Long Latency Between System Controller and Subclouds (r9, dsr8MR3)

Added a new section for rehoming subcloud with expired certificates

Story: 2010815
Task: 49748

Change-Id: Icb523fc50ada181d44caab46dcd7e9b30e0bc32c
Signed-off-by: Ngairangbam Mili <ngairangbam.mili@windriver.com>
This commit is contained in:
Ngairangbam Mili 2024-03-21 15:36:07 +00:00
parent ecef035a3a
commit 4d5177f95f
3 changed files with 208 additions and 0 deletions

View File

@ -0,0 +1,2 @@
.. licenseexpirationalarm-begin
.. licenseexpirationalarm-end

View File

@ -58,6 +58,7 @@ Operation
delete-subcloud-backup-data-using-dcmanager-cli-9cabe48bc4fd
restore-a-subcloud-group-of-subclouds-from-backup-data-using-dcmanager-cli-f10c1b63a95e
rehoming-a-subcloud
rehoming-subcloud-with-expired-certificates-00549c4ea6e2
rename-subcloud-e303565e7192
prestage-a-subcloud-using-dcmanager-df756866163f
add-a-horizon-keystone-user-to-distributed-cloud-29655b0f0eb9

View File

@ -0,0 +1,205 @@
.. _rehoming-subcloud-with-expired-certificates-00549c4ea6e2:
===========================================
Rehoming Subcloud with Expired Certificates
===========================================
The rehoming procedure for subcloud that has been powered off for a long period of
time differs from the regular rehoming procedure. Depending on how long the
subcloud has been offline, the platform certificates may expire and require regeneration.
If the certificates are recoverable, the rehoming playbook will automatically
recover most of them. However, some certificates will require manual
intervention. The playbook will fail and :command:`dcmanager subcloud errors subcloud`
will indicate the actions that need to be taken.
.. rubric:: |proc|
#. Power on controller-0 of the subcloud.
.. note::
Ensure that you can ping the |OAM| floating IP from the new system controller
before proceeding.
#. SSH to the subcloud as sysadmin. If the password has expired, a prompt will
pop up requesting to update the sysadmin password.
#. Proceed with rehoming.
-----------------
Multi-node system
-----------------
In DX and standard subclouds (subcloud with 2 controllers), the subcloud may be
in a continuous swact/reboot cycle that will lead to an unstable controller
for the rehoming procedure to target.
- Ensure that you power off controller-1 before attempting the rehoming procedure.
Otherwise, the playbook will fail with an error ``Certificate
recovery in progress. Please power-off controller-1 and try again``.
- The rehoming playbook will run and recover the active controller of the
subcloud, after which it will display ``Running certificate recovery on other
nodes. Connect to the subcloud and run 'tail -f /root/ansible.log' to follow
the logs.``. This means that another ansible process is running in the
subcloud and you can review the log for more details.
- At the Running certificate recovery on other nodes step, controller-1
should be powered on automatically. If not, a message will be written to
``/root/ansible.log`` asking for manual intervention to power it on.
The following error indicates that controller-1 should be powered off first for
subcloud active controller certificate recovery:
.. code-block::
[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud errors subcloud3
FAILED rehoming playbook of (subcloud3).
detail: fatal: [subcloud3]: FAILED! => changed=false
msg: Certificate recovery in progress. Please power-off controller-1 and try again.
FAILED TASK: TASK [common/recover-subcloud-certificates : Fail if controller-1 is running] Thursday 15 March 2035 00:01:03 +0000 (0:00:00.439) 0:00:08.467
If you get this error, turn off controller-1 and try again.
-----------------------------
Manually Managed Certificates
-----------------------------
Manual certificates are those that are manually installed by the user using the
:command:`system certificate-install` command. Examples include the StarlingX
REST API & Horizon Server certificate and Local Registry Server certificate.
It is not possible to automatically recover manual certificates.
As automatic recovery is not possible, the rehoming procedure will fail and ask
for manual intervention:
.. code-block::
[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud errors subcloud3
FAILED rehoming playbook of (subcloud3).
detail: fatal: [subcloud3]: FAILED! => changed=false
msg: |-
Rest API and Docker Registry certificates are expired. Manual action required! On the subcloud, please update the expired certificates with `system certificate-install` and then run "dcmanager subcloud delete" and "dcmanager subcloud add" again to restart the procedure.
TASK [common/recover-subcloud-certificates : Delete root ca key file after use in compute nodes] ***
Wednesday 14 March 2035 22:52:22 +0000 (0:00:00.026) 0:03:12.115 *******
skipping: [subcloud3]
If you get this error, generate new certificates for the aforementioned
certificates, install them with certificate-install, and try again.
.. note::
This will not be required if the certificates are already managed by cert-manager.
--------------------------------------------------
Cert-manager Certificates using a Custom CA Issuer
--------------------------------------------------
If you are using a Cert-manager Issuer other than ``system-local-ca`` for platform
certificates, you will get the following error:
.. code-block::
[sysadmin@controller-0 dc-config(keystone_admin)]$ dcmanager subcloud error subcloud1
FAILED rehoming playbook of (subcloud1).
detail: fatal: [subcloud1]: FAILED! => changed=false
msg: Cert-manager certificate(s) with their issuer expired. Please verify secret(s)
deployment/cloudplatform-rootca-secret on the subcloud, manually update and try
again."
TASK [common/recover-subcloud-certificates : Delete root ca key file after use in compute nodes] ***
Saturday 03 March 2035 18:56:00 +0000 (0:00:00.042) 0:02:42.799 ********
skipping: [subcloud1]
FAILED TASK: TASK [common/recover-subcloud-certificates : Delete root ca key file after use in compute nodes] Saturday 03 March 2035 18:56:00 +0000 (0:00:00.042)
0:02:42.799
In this case, manual update of the underlying Issuer's secret will be necessary.
As an example, the above error mentions deployment/cloudplatform-rootca-secret,
where deployment is the K8s namespace and cloudplatform-rootca-secret is the secret name.
To update the |CA| certificate in this secret, use the following commands:
.. code-block::
kubectl -n deployment delete secret cloudplatform-rootca-secret
kubectl -n deployment create secret tls cloudplatform-rootca-secret --key=./ca.key --cert=./ca.crt
rm ca.crt ca.key
``ca.crt`` and ``ca.key`` are in pem format. They can be obtained from the
security personnel or the team responsible for certificate management.
---------------------------
Management Affecting Alarms
---------------------------
Once the certificate recovery process is completed, the subclouds should be free of
management affecting alarms. The management affecting alarms will cause the rehoming
procedure to fail. The subcloud may still be recoverable and the alarms should
indicate the condition and provide information on the next step.
.. code-block::
[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud errors subcloud3
FAILED rehoming playbook of (subcloud3).
detail: fatal: [subcloud3]: FAILED! => changed=false
msg: The subcloud has management affecting alarms which are blocking the rehoming
procedure from continuing. The subcloud may still be recoverable, connect to it and
run "fm alarm-list --mgmt_affecting" to check the alarms. Please resolve the alarm
condition(s) then try again.
TASK [common/recover-subcloud-certificates : Delete root ca key file after use in compute nodes] ***
Wednesday 14 March 2035 23:45:44 +0000 (0:00:00.020) 0:42:53.295 *******
skipping: [subcloud3]
FAILED TASK: TASK [common/recover-subcloud-certificates : Delete root ca key file
after use in compute nodes] Wednesday 14 March 2035 23:45:44 +0000 (0:00:00.020)
0:42:53.295
In this case, review the active alarms and take the necessary actions to resolve them.
.. only:: partner
.. include:: /_includes/rehoming-subcloud-with-expired-certificates.rest
:start-after: licenseexpirationalarm-begin
:end-before: licenseexpirationalarm-end
-------------------
SSL CA Certificates
-------------------
SSL CA certificates are not automatically recovered as part of the rehoming procedure.
After a successful rehoming, an alarm will be raised by the system to let users
know about the expiration of SSL CA certificates:
.. code-block::
[sysadmin@controller-0 ~(keystone_admin)]$ fm alarm-list
+----------+----------------------------------------------------------------------------------------------------------+--------------------------------------+----------+---------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+----------------------------------------------------------------------------------------------------------+--------------------------------------+----------+---------------------+
| 500.210 | Certificate 'system certificate-show 9062a088-8c71-46c6-b194-6a65908f1080' (mode=ssl_ca) expired. | system.certificate.mode=ssl_ca.uuid= | critical | 2035-03-19T23:50:22 |
| | | 9062a088-8c71-46c6-b194-6a65908f1080 | | .917781 |
+----------+----------------------------------------------------------------------------------------------------------+--------------------------------------+----------+---------------------+
The alarm indicates that the certificate has expired. For more information
about the certificate, run ``sudo show-certs.sh``. The following are the two
possible resolutions:
- The certificate is no longer needed
.. code-block::
system certificate-list | grep ssl_ca
system certificate-uninstall -m ssl_ca <expired_certificate_uuid>
- The certificate is needed
.. code-block::
system certificate-list | grep ssl_ca
system certificate-uninstall -m ssl_ca <expired_certificate_uuid>
Obtain and install the new version of the required certificate:
.. code-block::
system certificate-install -m ssl_ca <new_ssl_ca>