From 4d5177f95fb909e67f85f2876ef11ce82d67be80 Mon Sep 17 00:00:00 2001 From: Ngairangbam Mili Date: Thu, 21 Mar 2024 15:36:07 +0000 Subject: [PATCH] Long Latency Between System Controller and Subclouds (r9, dsr8MR3) Added a new section for rehoming subcloud with expired certificates Story: 2010815 Task: 49748 Change-Id: Icb523fc50ada181d44caab46dcd7e9b30e0bc32c Signed-off-by: Ngairangbam Mili --- ...ng-subcloud-with-expired-certificates.rest | 2 + .../index-dist-cloud-kub-95bef233eef0.rst | 1 + ...with-expired-certificates-00549c4ea6e2.rst | 205 ++++++++++++++++++ 3 files changed, 208 insertions(+) create mode 100644 doc/source/_includes/rehoming-subcloud-with-expired-certificates.rest create mode 100644 doc/source/dist_cloud/kubernetes/rehoming-subcloud-with-expired-certificates-00549c4ea6e2.rst diff --git a/doc/source/_includes/rehoming-subcloud-with-expired-certificates.rest b/doc/source/_includes/rehoming-subcloud-with-expired-certificates.rest new file mode 100644 index 000000000..674e49bdb --- /dev/null +++ b/doc/source/_includes/rehoming-subcloud-with-expired-certificates.rest @@ -0,0 +1,2 @@ +.. licenseexpirationalarm-begin +.. licenseexpirationalarm-end diff --git a/doc/source/dist_cloud/kubernetes/index-dist-cloud-kub-95bef233eef0.rst b/doc/source/dist_cloud/kubernetes/index-dist-cloud-kub-95bef233eef0.rst index a6fce71a6..f119a3a86 100644 --- a/doc/source/dist_cloud/kubernetes/index-dist-cloud-kub-95bef233eef0.rst +++ b/doc/source/dist_cloud/kubernetes/index-dist-cloud-kub-95bef233eef0.rst @@ -58,6 +58,7 @@ Operation delete-subcloud-backup-data-using-dcmanager-cli-9cabe48bc4fd restore-a-subcloud-group-of-subclouds-from-backup-data-using-dcmanager-cli-f10c1b63a95e rehoming-a-subcloud + rehoming-subcloud-with-expired-certificates-00549c4ea6e2 rename-subcloud-e303565e7192 prestage-a-subcloud-using-dcmanager-df756866163f add-a-horizon-keystone-user-to-distributed-cloud-29655b0f0eb9 diff --git a/doc/source/dist_cloud/kubernetes/rehoming-subcloud-with-expired-certificates-00549c4ea6e2.rst b/doc/source/dist_cloud/kubernetes/rehoming-subcloud-with-expired-certificates-00549c4ea6e2.rst new file mode 100644 index 000000000..41f371ac9 --- /dev/null +++ b/doc/source/dist_cloud/kubernetes/rehoming-subcloud-with-expired-certificates-00549c4ea6e2.rst @@ -0,0 +1,205 @@ +.. _rehoming-subcloud-with-expired-certificates-00549c4ea6e2: + +=========================================== +Rehoming Subcloud with Expired Certificates +=========================================== + +The rehoming procedure for subcloud that has been powered off for a long period of +time differs from the regular rehoming procedure. Depending on how long the +subcloud has been offline, the platform certificates may expire and require regeneration. + +If the certificates are recoverable, the rehoming playbook will automatically +recover most of them. However, some certificates will require manual +intervention. The playbook will fail and :command:`dcmanager subcloud errors subcloud` +will indicate the actions that need to be taken. + +.. rubric:: |proc| + +#. Power on controller-0 of the subcloud. + + .. note:: + + Ensure that you can ping the |OAM| floating IP from the new system controller + before proceeding. + +#. SSH to the subcloud as sysadmin. If the password has expired, a prompt will + pop up requesting to update the sysadmin password. + +#. Proceed with rehoming. + +----------------- +Multi-node system +----------------- + +In DX and standard subclouds (subcloud with 2 controllers), the subcloud may be +in a continuous swact/reboot cycle that will lead to an unstable controller +for the rehoming procedure to target. + +- Ensure that you power off controller-1 before attempting the rehoming procedure. + Otherwise, the playbook will fail with an error ``Certificate + recovery in progress. Please power-off controller-1 and try again``. + +- The rehoming playbook will run and recover the active controller of the + subcloud, after which it will display ``Running certificate recovery on other + nodes. Connect to the subcloud and run 'tail -f /root/ansible.log' to follow + the logs.``. This means that another ansible process is running in the + subcloud and you can review the log for more details. + +- At the Running certificate recovery on other nodes step, controller-1 + should be powered on automatically. If not, a message will be written to + ``/root/ansible.log`` asking for manual intervention to power it on. + + The following error indicates that controller-1 should be powered off first for + subcloud active controller certificate recovery: + + .. code-block:: + + [sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud errors subcloud3 + FAILED rehoming playbook of (subcloud3). + detail: fatal: [subcloud3]: FAILED! => changed=false + msg: Certificate recovery in progress. Please power-off controller-1 and try again. + FAILED TASK: TASK [common/recover-subcloud-certificates : Fail if controller-1 is running] Thursday 15 March 2035 00:01:03 +0000 (0:00:00.439) 0:00:08.467 + + If you get this error, turn off controller-1 and try again. + +----------------------------- +Manually Managed Certificates +----------------------------- + +Manual certificates are those that are manually installed by the user using the +:command:`system certificate-install` command. Examples include the StarlingX +REST API & Horizon Server certificate and Local Registry Server certificate. +It is not possible to automatically recover manual certificates. + +As automatic recovery is not possible, the rehoming procedure will fail and ask +for manual intervention: + +.. code-block:: + + [sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud errors subcloud3 + FAILED rehoming playbook of (subcloud3). + detail: fatal: [subcloud3]: FAILED! => changed=false + msg: |- + Rest API and Docker Registry certificates are expired. Manual action required! On the subcloud, please update the expired certificates with `system certificate-install` and then run "dcmanager subcloud delete" and "dcmanager subcloud add" again to restart the procedure. + TASK [common/recover-subcloud-certificates : Delete root ca key file after use in compute nodes] *** + Wednesday 14 March 2035 22:52:22 +0000 (0:00:00.026) 0:03:12.115 ******* + skipping: [subcloud3] + +If you get this error, generate new certificates for the aforementioned +certificates, install them with certificate-install, and try again. + +.. note:: + + This will not be required if the certificates are already managed by cert-manager. + +-------------------------------------------------- +Cert-manager Certificates using a Custom CA Issuer +-------------------------------------------------- + +If you are using a Cert-manager Issuer other than ``system-local-ca`` for platform +certificates, you will get the following error: + +.. code-block:: + + [sysadmin@controller-0 dc-config(keystone_admin)]$ dcmanager subcloud error subcloud1 + FAILED rehoming playbook of (subcloud1). + detail: fatal: [subcloud1]: FAILED! => changed=false + msg: Cert-manager certificate(s) with their issuer expired. Please verify secret(s) + deployment/cloudplatform-rootca-secret on the subcloud, manually update and try + again." + TASK [common/recover-subcloud-certificates : Delete root ca key file after use in compute nodes] *** + Saturday 03 March 2035 18:56:00 +0000 (0:00:00.042) 0:02:42.799 ******** + skipping: [subcloud1] + FAILED TASK: TASK [common/recover-subcloud-certificates : Delete root ca key file after use in compute nodes] Saturday 03 March 2035 18:56:00 +0000 (0:00:00.042) + 0:02:42.799 + +In this case, manual update of the underlying Issuer's secret will be necessary. + +As an example, the above error mentions deployment/cloudplatform-rootca-secret, + where deployment is the K8s namespace and cloudplatform-rootca-secret is the secret name. + To update the |CA| certificate in this secret, use the following commands: + + .. code-block:: + + kubectl -n deployment delete secret cloudplatform-rootca-secret + kubectl -n deployment create secret tls cloudplatform-rootca-secret --key=./ca.key --cert=./ca.crt + rm ca.crt ca.key + + ``ca.crt`` and ``ca.key`` are in pem format. They can be obtained from the + security personnel or the team responsible for certificate management. + +--------------------------- +Management Affecting Alarms +--------------------------- + +Once the certificate recovery process is completed, the subclouds should be free of +management affecting alarms. The management affecting alarms will cause the rehoming +procedure to fail. The subcloud may still be recoverable and the alarms should +indicate the condition and provide information on the next step. + +.. code-block:: + + [sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud errors subcloud3 + FAILED rehoming playbook of (subcloud3). + detail: fatal: [subcloud3]: FAILED! => changed=false + msg: The subcloud has management affecting alarms which are blocking the rehoming + procedure from continuing. The subcloud may still be recoverable, connect to it and + run "fm alarm-list --mgmt_affecting" to check the alarms. Please resolve the alarm + condition(s) then try again. + TASK [common/recover-subcloud-certificates : Delete root ca key file after use in compute nodes] *** + Wednesday 14 March 2035 23:45:44 +0000 (0:00:00.020) 0:42:53.295 ******* + skipping: [subcloud3] + FAILED TASK: TASK [common/recover-subcloud-certificates : Delete root ca key file + after use in compute nodes] Wednesday 14 March 2035 23:45:44 +0000 (0:00:00.020) + 0:42:53.295 + +In this case, review the active alarms and take the necessary actions to resolve them. + +.. only:: partner + + .. include:: /_includes/rehoming-subcloud-with-expired-certificates.rest + :start-after: licenseexpirationalarm-begin + :end-before: licenseexpirationalarm-end + +------------------- +SSL CA Certificates +------------------- + +SSL CA certificates are not automatically recovered as part of the rehoming procedure. + +After a successful rehoming, an alarm will be raised by the system to let users +know about the expiration of SSL CA certificates: + +.. code-block:: + + [sysadmin@controller-0 ~(keystone_admin)]$ fm alarm-list + +----------+----------------------------------------------------------------------------------------------------------+--------------------------------------+----------+---------------------+ + | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp | + +----------+----------------------------------------------------------------------------------------------------------+--------------------------------------+----------+---------------------+ + | 500.210 | Certificate 'system certificate-show 9062a088-8c71-46c6-b194-6a65908f1080' (mode=ssl_ca) expired. | system.certificate.mode=ssl_ca.uuid= | critical | 2035-03-19T23:50:22 | + | | | 9062a088-8c71-46c6-b194-6a65908f1080 | | .917781 | + +----------+----------------------------------------------------------------------------------------------------------+--------------------------------------+----------+---------------------+ + +The alarm indicates that the certificate has expired. For more information +about the certificate, run ``sudo show-certs.sh``. The following are the two +possible resolutions: + +- The certificate is no longer needed + + .. code-block:: + + system certificate-list | grep ssl_ca + system certificate-uninstall -m ssl_ca + +- The certificate is needed + + .. code-block:: + + system certificate-list | grep ssl_ca + system certificate-uninstall -m ssl_ca + + Obtain and install the new version of the required certificate: + + .. code-block:: + + system certificate-install -m ssl_ca