.. _rehoming-subcloud-with-expired-certificates-00549c4ea6e2: =========================================== Rehoming Subcloud with Expired Certificates =========================================== The rehoming procedure for subcloud that has been powered off for a long period of time differs from the regular rehoming procedure. Depending on how long the subcloud has been offline, the platform certificates may expire and require regeneration. If the certificates are recoverable, the rehoming playbook will automatically recover most of them. However, some certificates will require manual intervention. The playbook will fail and :command:`dcmanager subcloud errors subcloud` will indicate the actions that need to be taken. .. rubric:: |proc| #. Power on controller-0 of the subcloud. .. note:: Ensure that you can ping the |OAM| floating IP from the new system controller before proceeding. #. SSH to the subcloud as sysadmin. If the password has expired, a prompt will pop up requesting to update the sysadmin password. #. Proceed with rehoming. ----------------- Multi-node system ----------------- In DX and standard subclouds (subcloud with 2 controllers), the subcloud may be in a continuous swact/reboot cycle that will lead to an unstable controller for the rehoming procedure to target. - Ensure that you power off controller-1 before attempting the rehoming procedure. Otherwise, the playbook will fail with an error ``Certificate recovery in progress. Please power-off controller-1 and try again``. - The rehoming playbook will run and recover the active controller of the subcloud, after which it will display ``Running certificate recovery on other nodes. Connect to the subcloud and run 'tail -f /root/ansible.log' to follow the logs.``. This means that another ansible process is running in the subcloud and you can review the log for more details. - At the Running certificate recovery on other nodes step, controller-1 should be powered on automatically. If not, a message will be written to ``/root/ansible.log`` asking for manual intervention to power it on. The following error indicates that controller-1 should be powered off first for subcloud active controller certificate recovery: .. code-block:: [sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud errors subcloud3 FAILED rehoming playbook of (subcloud3). detail: fatal: [subcloud3]: FAILED! => changed=false msg: Certificate recovery in progress. Please power-off controller-1 and try again. FAILED TASK: TASK [common/recover-subcloud-certificates : Fail if controller-1 is running] Thursday 15 March 2035 00:01:03 +0000 (0:00:00.439) 0:00:08.467 If you get this error, turn off controller-1 and try again. ----------------------------- Manually Managed Certificates ----------------------------- Manual certificates are those that are manually installed by the user using the :command:`system certificate-install` command. Examples include the StarlingX REST API & Horizon Server certificate and Local Registry Server certificate. It is not possible to automatically recover manual certificates. As automatic recovery is not possible, the rehoming procedure will fail and ask for manual intervention: .. code-block:: [sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud errors subcloud3 FAILED rehoming playbook of (subcloud3). detail: fatal: [subcloud3]: FAILED! => changed=false msg: |- Rest API and Docker Registry certificates are expired. Manual action required! On the subcloud, please update the expired certificates with `system certificate-install` and then run "dcmanager subcloud delete" and "dcmanager subcloud add" again to restart the procedure. TASK [common/recover-subcloud-certificates : Delete root ca key file after use in compute nodes] *** Wednesday 14 March 2035 22:52:22 +0000 (0:00:00.026) 0:03:12.115 ******* skipping: [subcloud3] If you get this error, generate new certificates for the aforementioned certificates, install them with certificate-install, and try again. .. note:: This will not be required if the certificates are already managed by cert-manager. -------------------------------------------------- Cert-manager Certificates using a Custom CA Issuer -------------------------------------------------- If you are using a Cert-manager Issuer other than ``system-local-ca`` for platform certificates, you will get the following error: .. code-block:: [sysadmin@controller-0 dc-config(keystone_admin)]$ dcmanager subcloud error subcloud1 FAILED rehoming playbook of (subcloud1). detail: fatal: [subcloud1]: FAILED! => changed=false msg: Cert-manager certificate(s) with their issuer expired. Please verify secret(s) deployment/cloudplatform-rootca-secret on the subcloud, manually update and try again." TASK [common/recover-subcloud-certificates : Delete root ca key file after use in compute nodes] *** Saturday 03 March 2035 18:56:00 +0000 (0:00:00.042) 0:02:42.799 ******** skipping: [subcloud1] FAILED TASK: TASK [common/recover-subcloud-certificates : Delete root ca key file after use in compute nodes] Saturday 03 March 2035 18:56:00 +0000 (0:00:00.042) 0:02:42.799 In this case, manual update of the underlying Issuer's secret will be necessary. As an example, the above error mentions deployment/cloudplatform-rootca-secret, where deployment is the K8s namespace and cloudplatform-rootca-secret is the secret name. To update the |CA| certificate in this secret, use the following commands: .. code-block:: kubectl -n deployment delete secret cloudplatform-rootca-secret kubectl -n deployment create secret tls cloudplatform-rootca-secret --key=./ca.key --cert=./ca.crt rm ca.crt ca.key ``ca.crt`` and ``ca.key`` are in pem format. They can be obtained from the security personnel or the team responsible for certificate management. --------------------------- Management Affecting Alarms --------------------------- Once the certificate recovery process is completed, the subclouds should be free of management affecting alarms. The management affecting alarms will cause the rehoming procedure to fail. The subcloud may still be recoverable and the alarms should indicate the condition and provide information on the next step. .. code-block:: [sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud errors subcloud3 FAILED rehoming playbook of (subcloud3). detail: fatal: [subcloud3]: FAILED! => changed=false msg: The subcloud has management affecting alarms which are blocking the rehoming procedure from continuing. The subcloud may still be recoverable, connect to it and run "fm alarm-list --mgmt_affecting" to check the alarms. Please resolve the alarm condition(s) then try again. TASK [common/recover-subcloud-certificates : Delete root ca key file after use in compute nodes] *** Wednesday 14 March 2035 23:45:44 +0000 (0:00:00.020) 0:42:53.295 ******* skipping: [subcloud3] FAILED TASK: TASK [common/recover-subcloud-certificates : Delete root ca key file after use in compute nodes] Wednesday 14 March 2035 23:45:44 +0000 (0:00:00.020) 0:42:53.295 In this case, review the active alarms and take the necessary actions to resolve them. .. only:: partner .. include:: /_includes/rehoming-subcloud-with-expired-certificates.rest :start-after: licenseexpirationalarm-begin :end-before: licenseexpirationalarm-end ------------------- SSL CA Certificates ------------------- SSL CA certificates are not automatically recovered as part of the rehoming procedure. After a successful rehoming, an alarm will be raised by the system to let users know about the expiration of SSL CA certificates: .. code-block:: [sysadmin@controller-0 ~(keystone_admin)]$ fm alarm-list +----------+----------------------------------------------------------------------------------------------------------+--------------------------------------+----------+---------------------+ | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp | +----------+----------------------------------------------------------------------------------------------------------+--------------------------------------+----------+---------------------+ | 500.210 | Certificate 'system certificate-show 9062a088-8c71-46c6-b194-6a65908f1080' (mode=ssl_ca) expired. | system.certificate.mode=ssl_ca.uuid= | critical | 2035-03-19T23:50:22 | | | | 9062a088-8c71-46c6-b194-6a65908f1080 | | .917781 | +----------+----------------------------------------------------------------------------------------------------------+--------------------------------------+----------+---------------------+ The alarm indicates that the certificate has expired. For more information about the certificate, run ``sudo show-certs.sh``. The following are the two possible resolutions: - The certificate is no longer needed .. code-block:: system certificate-list | grep ssl_ca system certificate-uninstall -m ssl_ca - The certificate is needed .. code-block:: system certificate-list | grep ssl_ca system certificate-uninstall -m ssl_ca Obtain and install the new version of the required certificate: .. code-block:: system certificate-install -m ssl_ca