From 7f2bde6ef34657a66c1efdc013c2e4a90dc88813 Mon Sep 17 00:00:00 2001 From: Ngairangbam Mili Date: Tue, 23 Jan 2024 05:56:01 +0000 Subject: [PATCH] Types of Errors reported on Subclouds introduced in Distributed Cloud System Controller GEO Redundancy - Phase1 (r9, dsr8MR3) Story: 2010852 Task: 49449 Change-Id: Icda4df1440c2237917b5dd241bd8561b1c415aba Signed-off-by: Ngairangbam Mili --- .../index-dist-cloud-kub-95bef233eef0.rst | 1 + ...t-cause-correction-action-43449d658aae.rst | 137 ++++++++++++++++++ 2 files changed, 138 insertions(+) create mode 100644 doc/source/dist_cloud/kubernetes/subcloud-geo-redundancy-error-root-cause-correction-action-43449d658aae.rst diff --git a/doc/source/dist_cloud/kubernetes/index-dist-cloud-kub-95bef233eef0.rst b/doc/source/dist_cloud/kubernetes/index-dist-cloud-kub-95bef233eef0.rst index b44a68f1e..ec7a1301c 100644 --- a/doc/source/dist_cloud/kubernetes/index-dist-cloud-kub-95bef233eef0.rst +++ b/doc/source/dist_cloud/kubernetes/index-dist-cloud-kub-95bef233eef0.rst @@ -184,3 +184,4 @@ Appendix distributed-cloud-ports-reference certificate-management-for-admin-rest-api-endpoints + subcloud-geo-redundancy-error-root-cause-correction-action-43449d658aae diff --git a/doc/source/dist_cloud/kubernetes/subcloud-geo-redundancy-error-root-cause-correction-action-43449d658aae.rst b/doc/source/dist_cloud/kubernetes/subcloud-geo-redundancy-error-root-cause-correction-action-43449d658aae.rst new file mode 100644 index 000000000..c02a1a390 --- /dev/null +++ b/doc/source/dist_cloud/kubernetes/subcloud-geo-redundancy-error-root-cause-correction-action-43449d658aae.rst @@ -0,0 +1,137 @@ +.. _subcloud-geo-redundancy-error-root-cause-correction-action-43449d658aae: + +============================================================== +Subcloud GEO Redundancy Error Root Cause and Correction Action +============================================================== + +This section describes different error scenarios that can occur while using the +GEO Redundancy feature. The error scenarios described here are based on the +assumption that you are dealing with two distributed clouds, site A and site B. +In this context, the GEO Redundancy feature is activated designating site A as +the primary site and site B as the non-primary site. The GEO Redundancy feature +allows migration of subclouds to the non-primary site when the primary site +becomes unavailable, and also allows migrating them back to the primary site when it +becomes available again. + +The error scenarios are divided into the following categories: + +.. contents:: + :local: + :depth: 1 + +---------------------- +Protection group setup +---------------------- + +This scenario covers the errors detected during setup of the protection group and issues. + +.. table:: + :widths: auto + + +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Error scenarios | Recovery mechanism | + +=====================================================================+====================================================================================================================================================================+ + | Site A goes down temporarily in the middle of association. | Upon site A recovery, the peer group association will automatically change its sync status to ``failed``. | + | | | + | | The administrator can trigger re-sync from the ``primary`` site if ``sync_status`` is either ``failed`` or ``out-of-sync``. | + | | | + | | Possible values of ``sync_status`` include ``syncing``, ``in_sync``, ``out-of-sync``, ``failed``, and ``unknown``. | + | | | + | | Possible values of ``association_type`` include ``primary``, ``non-primary``. | + +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Site A is down in the middle of synchronization and remains offline | The administrator can check the peer group association sync status in the non-primary site to decide the next step. If the sync status is ``in-sync``, | + | for an extended period of time. | migration can be initiated. | + | | | + | How does the user check the syncing status from site B to initiate | | + | the migration? | | + +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | After initial sync is completed, site B goes down. | | + | How does site A sync to site B after site B comes back online? | Site A needs to keep track of subcloud group updates when site B is down. The sync status will go into unknown status in site A. | + | | | + | | The peer group association sync status in site A will change to ``unknown`` as soon as site B becomes unavailable. Upon the recovery of site B, the sync status | + | | will become ``in-sync`` on both sites again. | + | | | + | | If changes are made to the peer group while site B is offline, the sync status in site A will change to ``failed``. Upon the recovery of site B, | + | | the sync status in site A will change to ``out-of-sync``. The administrator will need to re-initiate the sync in site A using | + | | the :command:`dcmanager peer-group-association sync ` command. | + | | | + +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Site B is offline while creating peer group association to associate| Creation of association will be accepted but ``sync_status`` will be ``failed``. Protection group cannot be created. | + | peer and a |SPG|. | | + | | The administrator can re-sync the association after site B is online using the :command:`dcmanager peer-group-association sync ` | + | | command. | + +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Swact occurs in site A while a peer group association is syncing. | Expected behavior should be similar to that of site A abrupt shutdown during sync. | + | | Re-sync needs to be done. | + +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Swact occurs in site B while a peer group association is syncing. | Expected behavior should be similar to that of site B abrupt shutdown during sync. | + | | Re-sync needs to be done. | + +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | In the event of either site going down or swact occurring: | a) Use the :command:`dcmanager peer-group-association show ` command to view the sync status in available site.. | + | | If the status is ``in-sync``, all the subclouds are added, otherwise synchronization has not finished and it needs to be re-initiated in the primary site when | + | | both sites are online. | + | a) How to track secondary subclouds added to site B | b) Run the :command:`dcmanager subcloud-peer-group list-subclouds ` command on site B to check | + | and subclouds yet to be added to site B as secondary subcloud? | total number of secondary subclouds and the subcloud details. | + | b) How to track newly added subclouds to peer group and yet to be | | + | added new subclouds to peer group? | | + | | | + | | | + +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +--------- +Migration +--------- + +Assumption: Subclouds will be migrated to site B if site A goes down. + +The following are the error scenarios that can occur during peer group migration. + +.. table:: + :widths: auto + + +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Error scenarios | Recovery mechanism | + +=====================================================================+====================================================================================================================================================================+ + | What will be the status of the |SPG| if some subclouds failed | After the migration, you can use :command:`dcmanager subcloud-peer-group list-subclouds` to check the subclouds status under this |SPG| and you can check the | + | to migrate? | |SPG| status using :command:`dcmanager subcloud-peer-group status`. | + | | | + | | Re-run the :command:`dcmanager subcloud-peer-group migrate PEER_GROUP` command after fixing the failure. | + +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | How to recover when the subcloud rehome fails because of | When site A goes down, migrate |SPG| to site B. The subcloud will go to the ``rehome-failed`` deploy status when it has the wrong bootstrap address or bootstrap | + | incorrect bootstrap address or bootstrap values and site A cannot | values. You can update the bootstrap address and bootstrap values if the subcloud migration fails and the primary site is down using the | + | recover in a time period? | :command:`dcmanager subcloud update --bootstrap-address` and :command:`dcmanager subcloud update --bootstrap-values` commands. You do not need to remove | + | | the rehome failed subcloud from the |SPG|. | + +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | How to fix when the subcloud has incorrect bootstrap address | Check the |SPG| migration status using the command :command:`dcmanager subcloud-peer-group status` command to confirm if it has a subcloud in ``rehoming`` status.| + | or bootstrap values in the following situations of the |SPG| | If there is no subcloud in ``rehoming`` status, it means the |SPG| migration was completed and you need to migrate the |SPG| back to site A. You can update the | + | migration of site B? | subcloud after the migration failure and try again. If you want to recover the subcloud, follow the instructions below: | + | | | + | - Site A is recovered during migration. | - When site A is recovered during migration, you can update the subcloud on site A. After the update, you need to wait for the |SPG| migration process to finish. | + | | You can then migrate |SPG| back to site A to recover the subcloud. | + | - Site A is recovered post migration. | - When site A is recovered post migration, you can migrate the |SPG| back to site A. If the subcloud rehome fails again in site A, you can update the subcloud. | + | | | + | - Site A is online before the migration process. | - When site A is online before the migration process, you can update the subcloud on site A and sync the updated subcloud to site B. | + | | | + | | Use the :command:`dcmanager subcloud update --bootstrap-address` and :command:`dcmanager subcloud update --bootstrap-values` commands to update the subcloud. | + | | You do not need to remove the rehome failed subcloud from the |SPG|. | + +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Site B goes down during |SPG| migration. | Re-execute the |SPG| migration if there is any subcloud with ``rehome-failed`` deploy status after site B is online. | + +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +-------------- +Post migration +-------------- + +Audit operations will be triggered when the network is restored or +``migration_status`` of the peer group retrieved is changed to ``complete``. + +.. table:: + :widths: auto + + +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Error scenarios | Recovery mechanism | + +=====================================================================+====================================================================================================================================================================+ + | | | + | Site B goes down after the |SPG| has been migrated to its site. | Upon site A recovery, the administrator can trigger the migration of the |SPG| back to site A. | + | | | + +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+