The generic auth plugin sends a discovery request before sending
auth request. If the host running keystone service is offline,
the auth request would take twice the amount of specified timeout
before failing. This leads to subcloud audit requests being
severely backed up over time when the system has a large number
of offine subclouds. Consequently, a newly deployed subcloud or
rehomed subcloud would take many hours to become online.
With v3 password auth plugin, the auth request does not include
discovery request. Hence, request to offline subcloud will fail
at the specified timeout. As a result, dcmanager audit workers
has enough time to process all requests in their queues for the
audit cycle.
Test Plan:
Deploy a large number (1K or more) of subclouds in a DC system
then take all of them offline. After a few hours:
- Start up one of the offline subclouds and verify that it
becomes online shortly after the startup sequence.
- Deploy a new simplex subcloud and verify that the subcloud
becomes online shortly after the initial unlock.
Closes-Bug: 2067771
Co-authored-by: Rei Oliveira <Reinildes.JoseMateusOliveira@windriver.com>
Change-Id: I6ab8f41cecac3b3909aaee7085cc69fa9fed6388
Signed-off-by: Tee Ngo <tee.ngo@windriver.com>
Introducing a precheck state for subcloud software
orchestration. This state conducts checks to validate
the orchestration process by:
1. Identifying any VIM strategy created in the subcloud.
2. Verifying if the release has prestaged data.
Test Plan (inservice patch/AIO simplex):
- Execute the software orchestration and confirm that
the precheck runs for the subcloud.
Story: 2010676
Task: 48727
Change-Id: Ib3966e702cdf1c141196fe44d15249ebf036b3a2
Signed-off-by: Hugo Brito <hugo.brito@windriver.com>
The following changes aim to reduce the number of database
requests by dcmanager:
- Move logic to filter out non-qualified subclouds from audit
worker to audit manager.
- Add new db api to mark end audit timestamp in bulk.
Test Plan:
Perform a batch subcloud deployment
- Verify that subclouds being deployed are excluded from the
audit_subclouds RPC requests to dcmanager audit workers.
- Verify that the end audit timestamp of skipped subclouds
is set in one database transaction.
Story: 2011106
Task: 50218
Change-Id: Ie4f9804a0ef870f81eb726fb9cd451b5284962ab
Signed-off-by: Tee Ngo <tee.ngo@windriver.com>
This commit addresses the odd cases where the
/var/www/pages/iso/<rel>/ostree_repo bind mount
becomes missing or stale.
We add detection of missing content, and also detect a stale bind mount.
A stale bind mount is detected by comparing the inode numbers
of the bind-mounted /var/www/pages/iso/<rel>/ostree_repo and original
/var/www/pages/feed/rel-<rel>/ostree_repo directory.
NOTES:
- The self.www_root variable is changed to self.www_iso_root to make
it more obvious that this is the /var/www/pages/iso path, not the feed
path.
- Now using the 'sh' python library for the mount commands, which is
much more convenient and straight-forward than the subprocess library
Test Plan:
PASS:
- Unmount (but do not delete) the /var/www/pages/iso/<rel>/ostree_repo
directory. When a subcloud add or deploy operation is done, the bind
mount is recreated.
- Stale mount:
# Replace the original
sudo cp -a /var/www/pages/feed/rel-24.09/ostree_repo \
/var/www/pages/feed/rel-24.09/ostree_repo.orig
sudo rm -rf /var/www/pages/feed/rel-24.09/ostree_repo
sudo cp -a /var/www/pages/feed/rel-24.09/ostree_repo.orig \
/var/www/pages/feed/rel-24.09/ostree_repo
When a subcloud add or deploy operation is done, the stale bind
mount is detected. The /var/www/pages/iso/24.09/ostree_repo is
unmounted, and the directory is removed.
When a subcloud add or deploy operation is done, the bind
mount is recreated.
Closes-Bug: 2066411
Change-Id: I25911722b1e333cd352f142664526d7dfa73e9e8
Signed-off-by: Kyle MacLeod <kyle.macleod@windriver.com>
Add a validation in dcmanager subcloud delete command to
ensure that a subcloud with an active operation, such
as bootstrapping, configuring and installating can not
be deleted and, instead, dcmanager subcloud abort
should be executed.
Test plan:
1. Run the delete command for subclouds in bootstrapping
and configuring deploy status and verify that an error
is raised stating that the operation is invalid.
2. Run the delete command for subclouds in bootstrap-failed
and complete states and verify that the subcloud is deleted.
Closes-Bug: 2065595
Change-Id: Ic9dbe0ddce11cc211b64b0ba687b012594b88351
Signed-off-by: Raphael Lima <Raphael.Lima@windriver.com>
Implement cloud-init nocloud seed iso generation, required to
reconfigure and prepare factory-installed nodes for the
enrollment process. The seed ISO contains the networking and user data,
it specifies the configuration applied upon booting the standalone system,
done as the initial step of the subcloud enrollment process.
Overall, this commit introduces the SubcloudEnrollment class with the
initial enrollment operations to:
- Build the data files required
to update the user and OAM network configuration
- Generate the seed iso with the data files
- Serve the iso for remote use
These will be utilized in subsequent changes where the
generated iso will be mounted to a factory installed node using RVMC.
Furthermore, these changes will be revisited when integrating with
the enrollment APIs to ensure that the payload is interpreted
correctly as iso_values from the provided
install + bootstrap values.
Test Plan:
1. PASS: Validate seed iso generation:
Invoke generate_seed_iso with payload
and ensure seed.iso is created at
/opt/platform/iso/<rel-version>/nodes/<subcloud>
Furthermore, ensure the iso is available and served
by lighttpd at:
/var/www/pages/iso/<rel-version>/nodes/<subcloud>
2. PASS: Validate iso regeneration:
ensure previous seed iso is
cleaned up and iso is regenerated.
3. PASS: Verify temp dirs are cleaned up
4. PASS: Validate contents of seed iso:
Mount seed iso and ensure both
network-config and cloud-config exists
and they are correctly generated
based on the payload.
Story: 2011100
Task: 50053
Change-Id: Icef8258852746aef2d0a3f025a1fe85fff93980e
Signed-off-by: Salman Rana <salman.rana@windriver.com>
Updating software client to reflect changes in the USM's
new REST API release. Adjusting other files due to method
name and return value modifications.
Test Plan:
- Manage a subcloud and check if usm_sync_status is in-sync
- Use the software_client to perform the following commands:
- list, show, delete, commit, deploy precheck
Story: 2010676
Task: 50015
Change-Id: Ifa15b50a3d163c981a72adbaeffd462102f7c42d
Signed-off-by: Hugo Brito <hugo.brito@windriver.com>
Integrate dcorch master clients with the optimized OpenStackDriver.
Update the methods of creating subcloud keystone, dcdbsync and sysinv
clients.
Add subcloud management ip parameter to RPC calls between dcorch-engine
master and workers services to construct client endpoints.
Test Plan:
PASS: Change the admin password on the system controller using
the command "openstack --os-region-name SystemController user
password set". Verify that the admin password is synchronized
to the subcloud and the dcorch receives the corresponding sync
request, followed by successful execution of sync resources for
the subcloud.
PASS: Unmanage and then manage a subcloud, and verify that the initial
sync is executed successfully for that subcloud.
PASS: Verify successful dcorch audits every 5 minutes.
Story: 2011106
Task: 50113
Change-Id: Idfa493068dc7d2bac21aac2871238b9f0de12c9d
Signed-off-by: lzhu1 <li.zhu@windriver.com>
This commit adds a new management_ip field to the dcorch subcloud
table. This field will be used to build the subclouds service endpoints
after the OptimizedOpenStackDriver [1] is integrated into dcorch.
The DB upgrade script and related upgrade tests will be done in a
separate commit.
Test Plan:
1. PASS - Run the dcorch database migration script to update it to
version 009, verify that the management_ip column is added
to the subcloud table.
2. PASS - Add a new subcloud and verify in dcorch DB that the a new
subcloud item was added with the correct management_ip field.
3. PASS - Run a subcloud update with network reconfiguration, changing
the management_ip, verify that in dcorch DB that the subcloud
item was updated correctly.
[1]: https://review.opendev.org/c/starlingx/distcloud/+/918311
Story: 2011106
Task: 50105
Change-Id: If1c299700fd769dc8f89172c5088fe7de66d0774
Signed-off-by: Gustavo Herzmann <gustavo.herzmann@windriver.com>
dcmanager-orchestrator call the k8s python client to perform a
number of operations. The k8s python client creates temp files under
/tmp and continues use these tmp files for the life-cycle of the
processes.
However systemd-tmpfiles-clean.service will run every day to clean up
files in /tmp dir that are older than 10 days. If the k8s client code
is not triggered for more than 10 days (thus its temp files are not
accessed for more than 10 days), these temp files will be removed as
part of the cleanup. Certain dcmanager-orchestrator operations then
starts to fail with an error that the tmp file is no longer there.
This is a known issue of kubernetes python client:
https://github.com/kubernetes-client/python/issues/765
The commit fixes this issue by setting TMPDIR to /var/run/dcmanager_
orchestrator_tmp when sm starts dcmanager-orchestrator.
The following similar commits were added for sysinv,dcmanager
services in the past
https://review.opendev.org/c/starlingx/config/+/736761https://review.opendev.org/c/starlingx/distcloud/+/736247
Closes-bug: 2066048
Change-Id: I3d39f5b034e3ef2e6ad9636e86f26f0e93f16d45
Signed-off-by: amantri <ayyappa.mantri@windriver.com>
This commit modifies the dcmanager audit service to use the new
OptimizedOpenStackDriver [1].
It also fixes a typo in the OptimzedOpenStackDriver where the
get_cached_region_clients_for_thread was referencing the
OpenStackDriver class.
Test Plan:
1. PASS - Add a new subcloud and verify that it becomes online and that
its sync_status becomes in-sync;
2. PASS - Verify through the logs that the fetch_subcloud_mgmt_ips
function is being called to populate the endpoint cache;
3. PASS - Remove the subcloud endpoints from the keystone database,
restart the audit service and verify that the subcloud is
still audited correctly;
4. PASS - Leave the system running for 12h and check that new tokens
are obtained whenever they are close to expire (~1h).
[1]: https://review.opendev.org/c/starlingx/distcloud/+/918311
Story: 2011106
Task: 50111
Change-Id: Ia24a72a77a60d36cee5a31482fe71a341d2e7d83
Signed-off-by: Gustavo Herzmann <gustavo.herzmann@windriver.com>
1. Refactor dcorch's generic_sync_manager.py and initial_sync_manager
into a main process manager and a worker manager. The main manager
will handle the allocation of eligible subclouds to each worker.
2. Rename the current EngineService to EngineWorkerService and introduce
a new EngineService for the main process, similar to
DCManagerAuditService and DCManagerAuditWorkerService.
3. Rename the current RPC EngineClient to EngineWorkerClient and
introduce a new EngineClient. Adapt the RPC methods to accommodate
the modifications in these main process managers and worker managers.
4. Move master resources data retrieval from each sync_thread to engine
workers.
5. Implement 2 new db APIs for subcloud batch sync and state updates.
6. Remove code related to sync_lock and its associated db table schema.
7. Add ocf script for managing the start and stop of the dcorch
engine-worker service, and make changes in packaging accordingly.
8. Bug fixes for the issues related to the usage of
base64.urlsafe_b64encode and base64.urlsafe_b64decode in python3.
9. Update unit tests for the main process and worker managers.
Test Plan:
PASS: Verify that the dcorch audit runs properly every 5 minutes.
PASS: Verify that the initial sync runs properly every 10 seconds.
PASS: Verify that the sync subclouds operation runs properly every 5
seconds.
PASS: Successfully start and stop the dcorch-engine and
dcorch-engine-worker services using the sm commands.
PASS: Change the admin password on the system controller using
the command "openstack --os-region-name SystemController user
password set". Verify that the admin password is synchronized
to the subcloud and the dcorch receives the corresponding sync
request, followed by successful execution of sync resources for
the subcloud.
PASS: Unmanage and then manage a subcloud, and verify that the initial
sync is executed successfully for that subcloud.
PASS: Verify the removal of the sync_lock table from the dcorch db.
Story: 2011106
Task: 50013
Change-Id: I329847bd1107ec43e67ec59bdd1e3111b7b37cd3
Signed-off-by: lzhu1 <li.zhu@windriver.com>
This commit introduces the enroll command API.
Test Plan:
PASS: Deploy a system controller and run subcloud add
enroll in CLI without bootstrap-values. Verify that
the API returns an error.
PASS: Deploy a system controller and run subcloud add
enroll passing all required parameters in CLI. Verify
in dcmanager log that the API returned a success code.
Story: 2011100
Task: 50005
Change-Id: I525d26166dbb7d7afcb26b96191b5045eee7b52d
Signed-off-by: Gustavo Pereira <gustavo.lyrapereira@windriver.com>
This commit implements an optimized OpenStackDriver that builds the
endpoints for subclouds directly using their management IPs instead of
retrieving them from the keystone database. Subcloud endpoints will be
removed from Keystone due to performance reasons in a future commit.
- The driver now accepts a fetch_subcloud_ips function as an argument.
- This function retrieves a dictionary of subcloud region names to
their management IPs (without a region argument) or a specific
subcloud's management IP (with a region argument).
- Dcmanager services and dcorch should implement their own
fetch_subcloud_ips function to provide the driver with subcloud
IP information.
This approach improves performance and prepares for the removal of
subcloud endpoints from Keystone.
NOTE: The original OpenStackDriver, KeystoneClient and EndpointCache
will be removed in a future commit, after the DC services are updated
to use the new optimized OpenStackDriver. The optimized one will be
integrated with the DC services in separate commits.
Test Plan:
Remove the subcloud endpoints from the keystone DB, modify the
dcmanager-audit service to use the new classes and then run the
following tests:
1. PASS - Verify that audit is able to get both the RegionOne and
subclouds endpoints without issues using the new driver.
2. PASS - Verify that the hourly token refresh only triggers the
refresh of central region token and endpoints.
3. PASS - Verify that when adding a new subcloud, the endpoint cache
is updated to include the endpoints for the new subcloud.
Story: 2011106
Task: 50035
Change-Id: I146592eb17f6a5433eae25f20e8de2f01c813055
Signed-off-by: Gustavo Herzmann <gustavo.herzmann@windriver.com>
This commit includes new unit tests for system_peer_manager.py,
covering new test cases in sync subclouds, delete, update
operations.
Test plan:
1) PASS: Run tox py39, pylint and pep8 envs and
verify that they are all passing.
2) PASS: Check 'tox -e cover' command output.
Coverage increased from 69% to 92%
Depends-On: https://review.opendev.org/c/starlingx/distcloud/+/915055
Story: 2007082
Task: 49713
Change-Id: I0b5a7d024f7b3a5c5ef4adb4aa29dc9d0e7f9de4
Signed-off-by: Swapna Gorre <swapna.gorre@windriver.com>
This commit adds a new parameter (patch) to the patch orchestration,
allowing the upload and apply of a specific patch file to a subcloud.
This change is essencial for enabling the new USM feature on subclouds
running older version.
Test Plan:
PASS: Fail if perform patch orchestation using --patch parameter
with the subcloud and systemcontroller with the same version.
PASS: Perform patch orchestration using --patch parameter
- The patch should be uploaded, applied and installed to the subcloud
PASS: Perform patch orchestration using --patch and --upload-only
- The patch should be uploaded to the subcloud
Obs.:
1. Tests were performed without the patch being applied to the
systemcontroller
2. Tests were performed with subcloud in-sync and out-of-sync
Story: 2010676
Task: 50012
Change-Id: I7eb2940c708668b17ff93977b5622c3cff4cb3da
Signed-off-by: Hugo Brito <hugo.brito@windriver.com>
This commit will be updating default password occurrences on
distcloud files to comply with new password rules, that will be:
- Minimum 12 characters
- At least 1 Uppercase letter
- At least 1 number
- At least 1 special character
- Cannot reuse past 5 passwords
- Default password expiry period should be set to 90 days.
The default passwords are updated as follows:
St8rlingX* -> St8rlingXCloud*
Test Plan:
PASS: Run build-pkgs -c -p distributedcloud
Story: 2011084
Task: 49824
Change-Id: I8c954ae023493048fb98d64b2df8df97a00ae1b7
Signed-off-by: Karla Felix <karla.karolinenogueirafelix@windriver.com>
This commit includes new unit tests for subcloud_manager.py,
covering new test cases in deploy, backup, migrate, install,
rename operations.
Test plan:
1) PASS: Run tox py39, pylint and pep8 envs and
verify that they are all passing.
2) PASS: Check 'tox -e cover' command output.
Coverage increased from 79% to 90%
Depends-On: https://review.opendev.org/c/starlingx/distcloud/+/914075
Story: 2007082
Task: 49618
Change-Id: I1219ce7c5f6cebc0d1cb564905eb5cc5b4045540
Signed-off-by: Swapna Gorre <swapna.gorre@windriver.com>
Update the proposed action displayed by "dcmanager subcloud error"
command when a subcloud is in bootstrap-failed state.
Instead of suggesting the deletion and reinstall of the subcloud,
it should indicate the usage of "dcmanager subcloud deploy resume"
after the cause of the failure has been resolved.
Test plan:
1. PASS: deploy a subcloud with the wrong password in
bootstrap-values file and verify that the error message
displayed in "dcmanager subcloud error <subcloud>"
informs the new proposed action.
Closes-Bug: 2065189
Change-Id: Ie41b38c5b527424bdd64ca5af1ed59c91bf03e70
Signed-off-by: Raphael Lima <Raphael.Lima@windriver.com>
Created system_peer_manager object.
Leveraged mock methods from base.py and
moved duplicate mock specifications to
TestSystemPeerManager
Test plan:
1) PASS: Run tox py39, pylint and pep8 envs and
verify that they are all passing.
2) PASS: Check 'tox -e cover' command output.
Story: 2007082
Task: 49713
Change-Id: I4a45dba61308e2e108f315423d314fd94c99aac1
Signed-off-by: Swapna Gorre <swapna.gorre@windriver.com>
Add usm to the list of users whose credentials needs to be
replicated in the subcloud. For any software commands to work
in side subcloud, after it is made to 'managed' state, the 'usm'
user credentials needs to be replicated in the subcloud.
Test Plan:
PASS: Install DC subcloud, ensure it is in managed state,
and execute software commands (Eg. software list)
Closes-Bug: 2063460
Change-Id: I4af841dcc51dc7fea2a6a12a37728cb9e0f8b59c
Signed-off-by: Joseph Vazhappilly <joseph.vazhappillypaily@windriver.com>
The DC OCF scripts were not updated over the switch to Debian
in StarlingX 8.0. As a result, it could lead to orphan processes
over the service restart or controller swact. The orphan processes
consume resources and perform duplicate/obsolete tasks (e.g.
auditing the same subclouds as the corresponding worker processes)
until their work queues are empty.
This commit fixes up the pgrep option to restore the functionality
of the confirm_stop function of the OCF script. Processes that
fail to be terminated will get killed.
Test Plan:
- Deploy a small DC system. Verify that all DC services can
be started, stopped and restarted by SM.
- Deploy a large DC system with many subclouds. Reduce the
thread_pool_size of dcmanager-audit-worker. Let the system
soak for a couple of hours. Restart the service in the
middle of the audit cycle. Verify that dcmanager-audit-worker
sevice was successfully restarted and there are no orphan
processes.
Closes-Bug: 2064368
Change-Id: Ie5cbc89cde374e32d4e0a3799a9f8833c071d206
Signed-off-by: Tee Ngo <tee.ngo@windriver.com>
The sysinv API call for certificate installation with type
openldap_ca will extract ca data included in the certificate bundle
and include it in the 'system-local-ca' ca.crt field.
Modified dcmanager to perform the call using this structure, passing
a bundle with TLS cert + CA cert + TLS key from the 'system-local-ca'
in the SystemController.
This code is called during DX subcloud upgrade and is used to keep
the current 'system-local-ca' on the subcloud consistent with the
one in the SystemController.
Test plan:
PASS: In a DC w/ DX subcloud in stx 9:
- Perform cert-manager migration.
- Upgrade the SystemController.
- Verify system-local-ca secret content in the
SystemController and the subcloud.
- Start orchestrated upgraded for th DX subcloud.
- Verify dcmanager/state.log. After the step
"Stage: 2, State: transferring CA certificate"
verify the system-local-ca secret content in the subcloud.
The secret should have been replaced to match the one in
the SystemController.
- While c1 is upgrading, verify OpenLDAP by creating user
in SystemController with 'ldapusersetup' and log into it
in the subcloud.
Story: 2009811
Task: 49044
Change-Id: I42e6308f066126f903738f4e3c319c6027c8cb0b
Signed-off-by: Marcelo Loebens <Marcelo.DeCastroLoebens@windriver.com>
StarlingX stopped supporting CentOS builds in the after release 7.0.
This update will strip CentOS from our code base. It will also remove
references to the failed OpenSUSE feature as well.
Story: 2011110
Task: 49949
Change-Id: If8c5d8d04e0a5ae766239912886f93332614fa4e
Signed-off-by: Scott Little <scott.little@windriver.com>
Improves unit test coverage for dcmanager's subclouds API
from 55% to 90%.
Test plan:
All of the tests were created taking into account the
output of 'tox -c tox.ini -e cover' command
Story: 2007082
Task: 49725
Change-Id: If31005a8f3420e94dd17f15bdac97af5253e8d5a
Signed-off-by: Raphael Lima <Raphael.Lima@windriver.com>
Refactor the subcloud_get_all_with_status function to
query only the endpoint_type and sync_status from the
subcloud_status table for improved efficiency.
Test Plan:
PASS: Get all subclouds with the right sync_status
PASS: Create a dcmanager strategy
- Fail if sync_status = Unknown
Story: 2011106
Task: 49893
Change-Id: Ie99abc6cb820800a632f1fd90ee7d7e0869a8312
Signed-off-by: Hugo Brito <hugo.brito@windriver.com>
This commit includes new unit tests for subcloud_manager.py,
covering new test cases in deploy, add, delete, update, compose,
backup and restore, redeploy, backup, prestage and migrate
operations.
Test plan:
1) PASS: Run tox py39, pylint and pep8 envs and
verify that they are all passing.
2) PASS: Check 'tox -e cover' command output.
Coverage increased from 70% to 79%
Depends-On: https://review.opendev.org/c/starlingx/distcloud/+/914074
Story: 2007082
Task: 49618
Change-Id: Ibfd30fc616c5c756ad73f3a33432411d7d189812
Signed-off-by: Swapna Gorre <swapna.gorre@windriver.com>
This commit updates the peer group association sync status to
'out-of-sync' after the user updates the
peer-controller-gateway-address attribute of the system-peer object.
This commit also modifies the subcloud update function to update the
subcloud route whenever the systemcontroller_gateway_address is
updated on the primary side and synced to the secondary.
It also adds an informative message to remind the caller to run the
sync command after updating the peer-controller-gateway-address.
Test Plan:
1. PASS: Do the following steps:
- Create a system peer with an incorrect systemcontroller
gateway address that's inside the management subcloud, but
outside the reserved IP range and then create an association.
Verify that the secondary subcloud and a route was created
using the incorrect IP.
- Update the system peer with the correct systemcontroller
gateway address on the primary site. Verify that the PGA
sync status is set to 'out-of-sync' on both sites.
- Sync the PGA and verify that the secondary subcloud
systemcontroller gateway address was updated and that the
old route was deleted and a new one using the new address
was created.
- Migrate the SPG to the non-primary site and verify that
it completes successfully and that the subcloud becomes
online and managed.
2. PASS: Repeat the first step of test case #1, but use an incorrect
address that's outside the management subnet. Then create
a PGA and verify that it fails due to the following
validation:
"systemcontroller_gateway_address invalid: Address must be in
subnet <management subnet>"
3. PASS: Repeat the first step of test case #1, but use an incorrect
address that's inside the reserved IP range. Then create
a PGA and verify that it fails due to the following
validation:
"systemcontroller_gateway_address invalid, is within
management pool <ip range>"
4. PASS: Create a system peer with a correct systemcontroller gateway
address for the first time and then create an association.
Verify that the secondary subcloud and a route was created
using the correct IP.
5. PASS: Update an attribute of the subcloud (e.g. the subcloud
description) on the primary site and verify that the sync
status chages to 'out-of-sync' on both sites, then run
the PGA sync operation and verify that the attribute was
synced to the secondary subcloud on the peer site.
Closes-Bug: 2062372
Change-Id: Ibffe6c86656a56a85d10deca54c161bbed7f0d17
Signed-off-by: Gustavo Herzmann <gustavo.herzmann@windriver.com>