From 7f2bde6ef34657a66c1efdc013c2e4a90dc88813 Mon Sep 17 00:00:00 2001
From: Ngairangbam Mili <ngairangbam.mili@windriver.com>
Date: Tue, 23 Jan 2024 05:56:01 +0000
Subject: [PATCH] Types of Errors reported on Subclouds introduced in
 Distributed Cloud System Controller GEO Redundancy - Phase1 (r9, dsr8MR3)

Story: 2010852
Task: 49449

Change-Id: Icda4df1440c2237917b5dd241bd8561b1c415aba
Signed-off-by: Ngairangbam Mili <ngairangbam.mili@windriver.com>
---
 .../index-dist-cloud-kub-95bef233eef0.rst     |   1 +
 ...t-cause-correction-action-43449d658aae.rst | 137 ++++++++++++++++++
 2 files changed, 138 insertions(+)
 create mode 100644 doc/source/dist_cloud/kubernetes/subcloud-geo-redundancy-error-root-cause-correction-action-43449d658aae.rst

diff --git a/doc/source/dist_cloud/kubernetes/index-dist-cloud-kub-95bef233eef0.rst b/doc/source/dist_cloud/kubernetes/index-dist-cloud-kub-95bef233eef0.rst
index b44a68f1e..ec7a1301c 100644
--- a/doc/source/dist_cloud/kubernetes/index-dist-cloud-kub-95bef233eef0.rst
+++ b/doc/source/dist_cloud/kubernetes/index-dist-cloud-kub-95bef233eef0.rst
@@ -184,3 +184,4 @@ Appendix
 
     distributed-cloud-ports-reference
     certificate-management-for-admin-rest-api-endpoints
+    subcloud-geo-redundancy-error-root-cause-correction-action-43449d658aae
diff --git a/doc/source/dist_cloud/kubernetes/subcloud-geo-redundancy-error-root-cause-correction-action-43449d658aae.rst b/doc/source/dist_cloud/kubernetes/subcloud-geo-redundancy-error-root-cause-correction-action-43449d658aae.rst
new file mode 100644
index 000000000..c02a1a390
--- /dev/null
+++ b/doc/source/dist_cloud/kubernetes/subcloud-geo-redundancy-error-root-cause-correction-action-43449d658aae.rst
@@ -0,0 +1,137 @@
+.. _subcloud-geo-redundancy-error-root-cause-correction-action-43449d658aae:
+
+==============================================================
+Subcloud GEO Redundancy Error Root Cause and Correction Action
+==============================================================
+
+This section describes different error scenarios that can occur while using the
+GEO Redundancy feature. The error scenarios described here are based on the
+assumption that you are dealing with two distributed clouds, site A and site B.
+In this context, the GEO Redundancy feature is activated designating site A as
+the primary site and site B as the non-primary site. The GEO Redundancy feature
+allows migration of subclouds to the non-primary site when the primary site
+becomes unavailable, and also allows migrating them back to the primary site when it
+becomes available again.
+
+The error scenarios are divided into the following categories:
+
+.. contents::
+   :local:
+   :depth: 1
+
+----------------------
+Protection group setup
+----------------------
+
+This scenario covers the errors detected during setup of the protection group and issues.
+
+.. table::
+    :widths: auto
+
+    +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+    | Error scenarios                                                     |  Recovery mechanism                                                                                                                                                |
+    +=====================================================================+====================================================================================================================================================================+
+    | Site A goes down temporarily in the middle of association.          |  Upon site A recovery, the peer group association will automatically change its sync status to ``failed``.                                                         |
+    |                                                                     |                                                                                                                                                                    |
+    |                                                                     |  The administrator can trigger re-sync from the ``primary`` site if ``sync_status`` is either ``failed`` or ``out-of-sync``.                                       |
+    |                                                                     |                                                                                                                                                                    |
+    |                                                                     |  Possible values of ``sync_status`` include ``syncing``, ``in_sync``, ``out-of-sync``, ``failed``, and ``unknown``.                                                |
+    |                                                                     |                                                                                                                                                                    |
+    |                                                                     |  Possible values of ``association_type`` include ``primary``, ``non-primary``.                                                                                     |
+    +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+    | Site A is down in the middle of synchronization and remains offline |  The administrator can check the peer group association sync status in the non-primary site to decide the next step. If the sync status is ``in-sync``,            |
+    | for an extended period of time.                                     |  migration can be initiated.                                                                                                                                       |
+    |                                                                     |                                                                                                                                                                    |
+    | How does the user check the syncing status from site B to initiate  |                                                                                                                                                                    |
+    | the migration?                                                      |                                                                                                                                                                    |
+    +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+    | After initial sync is completed, site B goes down.                  |                                                                                                                                                                    |
+    | How does site A sync to site B after site B comes back online?      |  Site A needs to keep track of subcloud group updates when site B is down. The sync status will go into unknown status in site A.                                  |
+    |                                                                     |                                                                                                                                                                    |
+    |                                                                     |  The peer group association sync status in site A will change to ``unknown`` as soon as site B becomes unavailable. Upon the recovery of site B, the sync status   |
+    |                                                                     |  will become ``in-sync`` on both sites again.                                                                                                                      |
+    |                                                                     |                                                                                                                                                                    |
+    |                                                                     |  If changes are made to the peer group while site B is offline, the sync status in site A will change to ``failed``. Upon the recovery of site B,                  |
+    |                                                                     |  the sync status in site A will change to ``out-of-sync``. The administrator will need to re-initiate the sync in site A using                                     |
+    |                                                                     |  the :command:`dcmanager peer-group-association sync <SiteA-Peer-Group-Association-ID>` command.                                                                   |
+    |                                                                     |                                                                                                                                                                    |
+    +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+    | Site B is offline while creating peer group association to associate|  Creation of association will be accepted but ``sync_status`` will be ``failed``. Protection group cannot be created.                                              |
+    | peer and a |SPG|.                                                   |                                                                                                                                                                    |
+    |                                                                     |  The administrator can re-sync the association after site B is online using the :command:`dcmanager peer-group-association sync <SiteA-Peer-Group-Association-ID>` |
+    |                                                                     |  command.                                                                                                                                                          |
+    +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+    | Swact occurs in site A while a peer group association is syncing.   |  Expected behavior should be similar to that of site A abrupt shutdown during sync.                                                                                |
+    |                                                                     |  Re-sync needs to be done.                                                                                                                                         |
+    +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+    | Swact occurs in site B while a peer group association is syncing.   |  Expected behavior should be similar to that of site B abrupt shutdown during sync.                                                                                |
+    |                                                                     |  Re-sync needs to be done.                                                                                                                                         |
+    +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+    | In the event of either site going down or swact occurring:          |  a) Use the :command:`dcmanager peer-group-association show <association-id>` command to view the sync status in available site..                                  |
+    |                                                                     |     If the status is ``in-sync``, all the subclouds are added, otherwise synchronization has not finished and it needs to be re-initiated in the primary site when |
+    |                                                                     |     both sites are online.                                                                                                                                         |
+    | a) How to track secondary subclouds added to site B                 |  b) Run the :command:`dcmanager subcloud-peer-group list-subclouds <peer-group>` command on site B to check                                                        |
+    |    and subclouds yet to be added to site B as secondary subcloud?   |     total number of secondary subclouds and the subcloud details.                                                                                                  |
+    | b) How to track newly added subclouds to peer group and yet to be   |                                                                                                                                                                    |
+    |    added new subclouds to peer group?                               |                                                                                                                                                                    |
+    |                                                                     |                                                                                                                                                                    |
+    |                                                                     |                                                                                                                                                                    |
+    +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+
+---------
+Migration
+---------
+
+Assumption: Subclouds will be migrated to site B if site A goes down.
+
+The following are the error scenarios that can occur during peer group migration.
+
+.. table::
+    :widths: auto
+
+    +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+    | Error scenarios                                                     |  Recovery mechanism                                                                                                                                                |
+    +=====================================================================+====================================================================================================================================================================+
+    | What will be the status of the |SPG| if some subclouds failed       |  After the migration, you can use :command:`dcmanager subcloud-peer-group list-subclouds` to check the subclouds status under this |SPG| and you can check the     |
+    | to migrate?                                                         |  |SPG| status using :command:`dcmanager subcloud-peer-group status`.                                                                                               |
+    |                                                                     |                                                                                                                                                                    |
+    |                                                                     |  Re-run the :command:`dcmanager subcloud-peer-group migrate PEER_GROUP` command after fixing the failure.                                                          |
+    +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+    | How to recover when the subcloud rehome fails because of            |  When site A goes down, migrate |SPG| to site B. The subcloud will go to the ``rehome-failed`` deploy status when it has the wrong bootstrap address or bootstrap  |
+    | incorrect bootstrap address or bootstrap values and site A cannot   |  values. You can update the bootstrap address and bootstrap values if the subcloud migration fails and the primary site is down using the                          |
+    | recover in a time period?                                           |  :command:`dcmanager subcloud update --bootstrap-address` and :command:`dcmanager subcloud update --bootstrap-values` commands. You do not need to remove          |
+    |                                                                     |  the rehome failed subcloud from the |SPG|.                                                                                                                        |
+    +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+    | How to fix when the subcloud has incorrect bootstrap address        |  Check the |SPG| migration status using the command :command:`dcmanager subcloud-peer-group status` command to confirm if it has a subcloud in ``rehoming`` status.|
+    | or bootstrap values in the following situations of the |SPG|        |  If there is no subcloud in ``rehoming`` status, it means the |SPG| migration was completed and you need to migrate the |SPG| back to site A. You can update the   |
+    | migration of site B?                                                |  subcloud after the migration failure and try again. If you want to recover the subcloud, follow the instructions below:                                           |
+    |                                                                     |                                                                                                                                                                    |
+    | - Site A is recovered during migration.                             |  - When site A is recovered during migration, you can update the subcloud on site A. After the update, you need to wait for the |SPG| migration process to finish. |
+    |                                                                     |    You can then migrate |SPG| back to site A to recover the subcloud.                                                                                              |
+    | - Site A is recovered post migration.                               |  - When site A is recovered post migration, you can migrate the |SPG| back to site A. If the subcloud rehome fails again in site A, you can update the subcloud.   |
+    |                                                                     |                                                                                                                                                                    |
+    | - Site A is online before the migration process.                    |  - When site A is online before the migration process, you can update the subcloud on site A and sync the updated subcloud to site B.                              |
+    |                                                                     |                                                                                                                                                                    |
+    |                                                                     |  Use the :command:`dcmanager subcloud update --bootstrap-address` and :command:`dcmanager subcloud update --bootstrap-values` commands to update the subcloud.     |
+    |                                                                     |  You do not need to remove the rehome failed subcloud from the |SPG|.                                                                                              |
+    +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+    | Site B goes down during |SPG| migration.                            |  Re-execute the |SPG| migration if there is any subcloud with ``rehome-failed`` deploy status after site B is online.                                              |
+    +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+
+--------------
+Post migration
+--------------
+
+Audit operations will be triggered when the network is restored or
+``migration_status`` of the peer group retrieved is changed to ``complete``.
+
+.. table::
+    :widths: auto
+
+    +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+    | Error scenarios                                                     |  Recovery mechanism                                                                                                                                                |
+    +=====================================================================+====================================================================================================================================================================+
+    |                                                                     |                                                                                                                                                                    |
+    |  Site B goes down after the |SPG| has been migrated to its site.    | Upon site A recovery, the administrator can trigger the migration of the |SPG| back to site A.                                                                     |
+    |                                                                     |                                                                                                                                                                    |
+    +---------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+