Types of Errors reported on Subclouds introduced in Distributed Cloud System Controller GEO Redundancy - Phase1 (r9, dsr8MR3)

Story: 2010852
Task: 49449

Change-Id: Icda4df1440c2237917b5dd241bd8561b1c415aba
Signed-off-by: Ngairangbam Mili <ngairangbam.mili@windriver.com>
This commit is contained in:
Ngairangbam Mili 2024-01-23 05:56:01 +00:00
parent a34b9d46c8
commit 7f2bde6ef3
2 changed files with 138 additions and 0 deletions

View File

@ -184,3 +184,4 @@ Appendix
distributed-cloud-ports-reference
certificate-management-for-admin-rest-api-endpoints
subcloud-geo-redundancy-error-root-cause-correction-action-43449d658aae

View File

@ -0,0 +1,137 @@
.. _subcloud-geo-redundancy-error-root-cause-correction-action-43449d658aae:
==============================================================
Subcloud GEO Redundancy Error Root Cause and Correction Action
==============================================================
This section describes different error scenarios that can occur while using the
GEO Redundancy feature. The error scenarios described here are based on the
assumption that you are dealing with two distributed clouds, site A and site B.
In this context, the GEO Redundancy feature is activated designating site A as
the primary site and site B as the non-primary site. The GEO Redundancy feature
allows migration of subclouds to the non-primary site when the primary site
becomes unavailable, and also allows migrating them back to the primary site when it
becomes available again.
The error scenarios are divided into the following categories:
.. contents::
:local:
:depth: 1
----------------------
Protection group setup
----------------------
This scenario covers the errors detected during setup of the protection group and issues.
.. table::
:widths: auto
+---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Error scenarios | Recovery mechanism |
+=====================================================================+====================================================================================================================================================================+
| Site A goes down temporarily in the middle of association. | Upon site A recovery, the peer group association will automatically change its sync status to ``failed``. |
| | |
| | The administrator can trigger re-sync from the ``primary`` site if ``sync_status`` is either ``failed`` or ``out-of-sync``. |
| | |
| | Possible values of ``sync_status`` include ``syncing``, ``in_sync``, ``out-of-sync``, ``failed``, and ``unknown``. |
| | |
| | Possible values of ``association_type`` include ``primary``, ``non-primary``. |
+---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Site A is down in the middle of synchronization and remains offline | The administrator can check the peer group association sync status in the non-primary site to decide the next step. If the sync status is ``in-sync``, |
| for an extended period of time. | migration can be initiated. |
| | |
| How does the user check the syncing status from site B to initiate | |
| the migration? | |
+---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| After initial sync is completed, site B goes down. | |
| How does site A sync to site B after site B comes back online? | Site A needs to keep track of subcloud group updates when site B is down. The sync status will go into unknown status in site A. |
| | |
| | The peer group association sync status in site A will change to ``unknown`` as soon as site B becomes unavailable. Upon the recovery of site B, the sync status |
| | will become ``in-sync`` on both sites again. |
| | |
| | If changes are made to the peer group while site B is offline, the sync status in site A will change to ``failed``. Upon the recovery of site B, |
| | the sync status in site A will change to ``out-of-sync``. The administrator will need to re-initiate the sync in site A using |
| | the :command:`dcmanager peer-group-association sync <SiteA-Peer-Group-Association-ID>` command. |
| | |
+---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Site B is offline while creating peer group association to associate| Creation of association will be accepted but ``sync_status`` will be ``failed``. Protection group cannot be created. |
| peer and a |SPG|. | |
| | The administrator can re-sync the association after site B is online using the :command:`dcmanager peer-group-association sync <SiteA-Peer-Group-Association-ID>` |
| | command. |
+---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Swact occurs in site A while a peer group association is syncing. | Expected behavior should be similar to that of site A abrupt shutdown during sync. |
| | Re-sync needs to be done. |
+---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Swact occurs in site B while a peer group association is syncing. | Expected behavior should be similar to that of site B abrupt shutdown during sync. |
| | Re-sync needs to be done. |
+---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| In the event of either site going down or swact occurring: | a) Use the :command:`dcmanager peer-group-association show <association-id>` command to view the sync status in available site.. |
| | If the status is ``in-sync``, all the subclouds are added, otherwise synchronization has not finished and it needs to be re-initiated in the primary site when |
| | both sites are online. |
| a) How to track secondary subclouds added to site B | b) Run the :command:`dcmanager subcloud-peer-group list-subclouds <peer-group>` command on site B to check |
| and subclouds yet to be added to site B as secondary subcloud? | total number of secondary subclouds and the subcloud details. |
| b) How to track newly added subclouds to peer group and yet to be | |
| added new subclouds to peer group? | |
| | |
| | |
+---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
---------
Migration
---------
Assumption: Subclouds will be migrated to site B if site A goes down.
The following are the error scenarios that can occur during peer group migration.
.. table::
:widths: auto
+---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Error scenarios | Recovery mechanism |
+=====================================================================+====================================================================================================================================================================+
| What will be the status of the |SPG| if some subclouds failed | After the migration, you can use :command:`dcmanager subcloud-peer-group list-subclouds` to check the subclouds status under this |SPG| and you can check the |
| to migrate? | |SPG| status using :command:`dcmanager subcloud-peer-group status`. |
| | |
| | Re-run the :command:`dcmanager subcloud-peer-group migrate PEER_GROUP` command after fixing the failure. |
+---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| How to recover when the subcloud rehome fails because of | When site A goes down, migrate |SPG| to site B. The subcloud will go to the ``rehome-failed`` deploy status when it has the wrong bootstrap address or bootstrap |
| incorrect bootstrap address or bootstrap values and site A cannot | values. You can update the bootstrap address and bootstrap values if the subcloud migration fails and the primary site is down using the |
| recover in a time period? | :command:`dcmanager subcloud update --bootstrap-address` and :command:`dcmanager subcloud update --bootstrap-values` commands. You do not need to remove |
| | the rehome failed subcloud from the |SPG|. |
+---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| How to fix when the subcloud has incorrect bootstrap address | Check the |SPG| migration status using the command :command:`dcmanager subcloud-peer-group status` command to confirm if it has a subcloud in ``rehoming`` status.|
| or bootstrap values in the following situations of the |SPG| | If there is no subcloud in ``rehoming`` status, it means the |SPG| migration was completed and you need to migrate the |SPG| back to site A. You can update the |
| migration of site B? | subcloud after the migration failure and try again. If you want to recover the subcloud, follow the instructions below: |
| | |
| - Site A is recovered during migration. | - When site A is recovered during migration, you can update the subcloud on site A. After the update, you need to wait for the |SPG| migration process to finish. |
| | You can then migrate |SPG| back to site A to recover the subcloud. |
| - Site A is recovered post migration. | - When site A is recovered post migration, you can migrate the |SPG| back to site A. If the subcloud rehome fails again in site A, you can update the subcloud. |
| | |
| - Site A is online before the migration process. | - When site A is online before the migration process, you can update the subcloud on site A and sync the updated subcloud to site B. |
| | |
| | Use the :command:`dcmanager subcloud update --bootstrap-address` and :command:`dcmanager subcloud update --bootstrap-values` commands to update the subcloud. |
| | You do not need to remove the rehome failed subcloud from the |SPG|. |
+---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Site B goes down during |SPG| migration. | Re-execute the |SPG| migration if there is any subcloud with ``rehome-failed`` deploy status after site B is online. |
+---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
--------------
Post migration
--------------
Audit operations will be triggered when the network is restored or
``migration_status`` of the peer group retrieved is changed to ``complete``.
.. table::
:widths: auto
+---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Error scenarios | Recovery mechanism |
+=====================================================================+====================================================================================================================================================================+
| | |
| Site B goes down after the |SPG| has been migrated to its site. | Upon site A recovery, the administrator can trigger the migration of the |SPG| back to site A. |
| | |
+---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+