diff --git a/doc/source/_vendor/vendor_strings.txt b/doc/source/_vendor/vendor_strings.txt index 74a94fe3f..575911ddb 100755 --- a/doc/source/_vendor/vendor_strings.txt +++ b/doc/source/_vendor/vendor_strings.txt @@ -23,6 +23,8 @@ .. |os-prod-hor| replace:: OpenStack |prod-hor| .. |prod-img| replace:: https://mirror.starlingx.windriver.com/mirror/starlingx/ .. |prod-abbr| replace:: StX +.. |prod-dc-geo-red| replace:: Distributed Cloud Geo Redundancy +.. |prod-dc-geo-red-long| replace:: Distributed Cloud System controller Geographic Redundancy .. Guide names; will be formatted in italics by default. .. |node-doc| replace:: :title:`StarlingX Node Configuration and Management` diff --git a/doc/source/dist_cloud/kubernetes/backup-a-subcloud-group-of-subclouds-using-dcmanager-cli-f12020a8fc42.rst b/doc/source/dist_cloud/kubernetes/backup-a-subcloud-group-of-subclouds-using-dcmanager-cli-f12020a8fc42.rst index 2abd63f0b..1dfa66495 100644 --- a/doc/source/dist_cloud/kubernetes/backup-a-subcloud-group-of-subclouds-using-dcmanager-cli-f12020a8fc42.rst +++ b/doc/source/dist_cloud/kubernetes/backup-a-subcloud-group-of-subclouds-using-dcmanager-cli-f12020a8fc42.rst @@ -16,6 +16,12 @@ system data backup file has been generated on the subcloud, it will be transferred to the system controller and stored at a dedicated central location ``/opt/dc-vault/backups//``. +.. note:: + + Enabling the GEO Redundancy function will affect some of the subcloud + backup functions. For more information on GEO Redundancy and its + restrictions, see :ref:`configure-distributed-cloud-system-controller-geo-redundancy-e3a31d6bf662`. + Backup data creation requires the subcloud to be online, managed, and in healthy state. diff --git a/doc/source/dist_cloud/kubernetes/configure-distributed-cloud-system-controller-geo-redundancy-e3a31d6bf662.rst b/doc/source/dist_cloud/kubernetes/configure-distributed-cloud-system-controller-geo-redundancy-e3a31d6bf662.rst new file mode 100644 index 000000000..01c6f94b1 --- /dev/null +++ b/doc/source/dist_cloud/kubernetes/configure-distributed-cloud-system-controller-geo-redundancy-e3a31d6bf662.rst @@ -0,0 +1,617 @@ +.. _configure-distributed-cloud-system-controller-geo-redundancy-e3a31d6bf662: + +============================================================ +Configure Distributed Cloud System Controller GEO Redundancy +============================================================ + +.. rubric:: |context| + +You can configure a distributed cloud System Controller GEO Redundancy +using DC manager |CLI| commands. + +System administrators can follow the procedures below to enable and +disable the GEO Redundancy feature. + +.. Note:: + + In this release, the GEO Redundancy feature supports only two + distributed clouds in one protection group. + +.. contents:: + :local: + :depth: 1 + +--------------------- +Enable GEO Redundancy +--------------------- + +Set up a protection group for two distributed clouds, making these two +distributed clouds operational in 1+1 active GEO Redundancy mode. + +For example, let us assume we have two distributed clouds, site A and site B. +When the operation is performed on site A, the local site is site A and the +peer site is site B. When the operation is performed on site B, the local +site is site B and the peer site is site A. + +.. rubric:: |prereq| + +The peer system controller's |OAM| network is accessible to each other and can +access the subclouds via both |OAM| and management networks. + +For security of production system, it is important to ensure the safety and +identification of peer site queries. To meet this objective, it is essential to +have an HTTPS-based system API in place. This necessitates the presence of a +well-known and trusted |CA| to enable secure HTTPS communication between peers. +If you are using an internally trusted |CA|, ensure that the system trusts the |CA| by installing +its certificate with the following command. + +.. code-block:: none + + ~(keystone_admin)]$ system certificate-install --mode ssl_ca + +where: + +```` + is the path to the intermediate or Root |CA| certificate associated + with the |prod| REST API's Intermediate or Root |CA|-signed certificate. + +.. rubric:: |proc| + +You can enable the GEO Redundancy feature between site A and site B from the +command line. In this procedure, the subclouds managed by site A will be +configured to be managed by GEO Redundancy protection group that consists of site +A and site B. When site A is offline for some reasons, an alarm notifies the +administrator, who initiates the group based batch migration +to rehome the subclouds of site A to site B for centralized management. + +Similarly, you can also configure the subclouds managed by site B to be +taken over by site A when site B is offline by following the same procedure where +site B is local site and site A is peer site. + +#. Log in to the active controller node of site B and get the required + information about the site B to create a protection group. + + * Unique |UUID| of the central cloud of the peer system controller + * URI of Keystone endpoint of peer system controller + * Gateway IP address of the management network of peer system controller + + For example: + + .. code-block:: bash + + # On site B + sysadmin@controller-0:~$ source /etc/platform/openrc + ~(keystone_admin)]$ system show | grep -i uuid + | uuid | 223fcb30-909d-4edf-8c36-1aebc8e9bd4a | + + ~(keystone_admin)]$ openstack endpoint list --service keystone \ + --interface public --region RegionOne -c URL + +-----------------------------+ + | URL | + +-----------------------------+ + | http://10.10.10.2:5000 | + +-----------------------------+ + + ~(keystone_admin)]$ system host-route-list controller-0 | awk '{print $10}' | grep -v "^$" + gateway + 10.10.27.1 + +#. Log in to the active controller node of the central cloud of site A. Create + a System Peer instance of site B on site A so that site A can access information of + site B. + + .. code-block:: bash + + # On site A + ~(keystone_admin)]$ dcmanager system-peer add \ + --peer-uuid 223fcb30-909d-4edf-8c36-1aebc8e9bd4a \ + --peer-name siteB \ + --manager-endpoint http://10.10.10.2:5000 \ + --peer-controller-gateway-address 10.10.27.1 + Enter the admin password for the system peer: + Re-enter admin password to confirm: + + +----+--------------------------------------+-----------+-----------------------------+----------------------------+ + | id | peer uuid | peer name | manager endpoint | controller gateway address | + +----+--------------------------------------+-----------+-----------------------------+----------------------------+ + | 2 | 223fcb30-909d-4edf-8c36-1aebc8e9bd4a | siteB | http://10.10.10.2:5000 | 10.10.27.1 | + +----+--------------------------------------+-----------+-----------------------------+----------------------------+ + +#. Collect the information from site A. + + .. code-block:: bash + + # On site A + sysadmin@controller-0:~$ source /etc/platform/openrc + ~(keystone_admin)]$ system show | grep -i uuid + ~(keystone_admin)]$ openstack endpoint list --service keystone --interface public --region RegionOne -c URL + ~(keystone_admin)]$ system host-route-list controller-0 | awk '{print $10}' | grep -v "^$" + +#. Log in to the active controller node of the central cloud of site B. Create + a System Peer instance of site A on site B so that site B has information about site A. + + .. code-block:: bash + + # On site B + ~(keystone_admin)]$ dcmanager system-peer add \ + --peer-uuid 3963cb21-c01a-49cc-85dd-ebc1d142a41d \ + --peer-name siteA \ + --manager-endpoint http://10.10.11.2:5000 \ + --peer-controller-gateway-address 10.10.25.1 + Enter the admin password for the system peer: + Re-enter admin password to confirm: + +#. Create a |SPG| for site A. + + .. code-block:: bash + + # On site A + ~(keystone_admin)]$ dcmanager subcloud-peer-group add --peer-group-name group1 + +#. Add the subclouds needed for redundancy protection on site A. + + Ensure that the subclouds bootstrap data is updated. The bootstrap data is + the data used to bootstrap the subcloud, which includes the |OAM| and + management network information, system controller gateway information, and docker + registry information to pull necessary images to bootstrap the system. + + For an example of a typical bootstrap file, see :ref:`installing-and-provisioning-a-subcloud`. + + #. Update the subcloud information with the bootstrap values. + + .. code-block:: bash + + ~(keystone_admin)]$ dcmanager subcloud update subcloud1 \ + --bootstrap-address \ + --bootstrap-values + + #. Update the subcloud information with the |SPG| created locally. + + .. code-block:: bash + + ~(keystone_admin)]$ dcmanager subcloud update \ + --peer-group + + For example, + + .. code-block:: bash + + ~(keystone_admin)]$ dcmanager subcloud update subcloud1 --peer-group group1 + + #. If you want to remove one subcloud from the |SPG|, run the + following command: + + .. code-block:: bash + + ~(keystone_admin)]$ dcmanager subcloud update --peer-group none + + For example, + + .. code-block:: bash + + ~(keystone_admin)]$ dcmanager subcloud update subcloud1 --peer-group none + + #. Check the subclouds that are under the |SPG|. + + .. code-block:: bash + + ~(keystone_admin)]$ dcmanager subcloud-peer-group list-subclouds + +#. Create an association between the System Peer and |SPG|. + + .. code-block:: bash + + # On site A + ~(keystone_admin)]$ dcmanager peer-group-association add \ + --system-peer-id \ + --peer-group-id \ + --peer-group-priority + + The ``peer-group-priority`` parameter can accept an integer value greater + than 0. It is used to set the priority of the |SPG|, which is + created in peer site using the peer site's dcmanager API during association + synchronization. + + * The default priority in the |SPG| is 0 when it is created + in the local site. + + * The smallest integer has the highest priority. + + During the association creation, the |SPG| in the association + will be synchronized from the local site to the peer site, and the subclouds + belonging to the |SPG|. + + Confirm that the local |SPG| and its subclouds have been synchronized + into site B with the same name. + + * Show the association information just created in site A and ensure that + ``sync_status`` is ``in-sync``. + + .. code-block:: bash + + # On site A + ~(keystone_admin)]$ dcmanager peer-group-association list + + +----+---------------+----------------+---------+-----------------+---------------------+ + | id | peer_group_id | system_peer_id | type | sync_status | peer_group_priority | + +----+---------------+----------------+---------+-----------------+---------------------+ + | 1 | 1 | 2 | primary | in-sync | 2 | + +----+---------------+----------------+---------+-----------------+---------------------+ + + * Show ``subcloud-peer-group`` in site B and ensure that it has been created. + + * List the subcloud in ``subcloud-peer-group`` in site B and ensure that all + the subclouds have been synchronized as secondary subclouds. + + .. code-block:: bash + + # On site B + ~(keystone_admin)]$ dcmanager subcloud-peer-group show + ~(keystone_admin)]$ dcmanager subcloud-peer-group list-subclouds + + When you create the primary association on site A, a non-primary association + on site B will automatically be created to associate the synchronized |SPG| + from site A and the system peer pointing to site A. + + You can check the association list to confirm if the non-primary association + was created on site B. + + .. code-block:: bash + + # On site B + ~(keystone_admin)]$ dcmanager peer-group-association list + +----+---------------+----------------+-------------+-------------+---------------------+ + | id | peer_group_id | system_peer_id | type | sync_status | peer_group_priority | + +----+---------------+----------------+-------------+-------------+---------------------+ + | 2 | 26 | 1 | non-primary | in-sync | None | + +----+---------------+----------------+-------------+-------------+---------------------+ + +#. (Optional) Update the protection group related configuration. + + After the peer group association has been created, you can still update the + related resources configured in the protection group: + + * Update subcloud with bootstrap values + * Add subcloud(s) into the |SPG| + * Remove subcloud(s) from the |SPG| + + After any of the above operations, ``sync_status`` is changed to ``out-of-sync``. + + After the update has been completed, you need to use the :command:`sync` + command to push the |SPG| changes to the peer site that + keeps the |SPG| the same status. + + .. code-block:: bash + + # On site A + dcmanager peer-group-association sync + + .. warning:: + + The :command:`dcmanager peer-group-association sync` command must be run + after any of the following changes: + + - Subcloud is removed from the |SPG| for the subcloud name change. + + - Subcloud is removed from the |SPG| for the subcloud management network + reconfiguration. + + - Subcloud updates one or both of these parameters: + ``--bootstrap-address``, ``--bootstrap-values parameters``. + + Similarly, you need to check the information has been synchronized by + showing the association information just created in site A, ensuring that + ``sync_status`` is ``in-sync``. + + .. code-block:: bash + + # On site A + ~(keystone_admin)]$ dcmanager peer-group-association show + + +----+---------------+----------------+---------+-----------------+---------------------+ + | id | peer_group_id | system_peer_id | type | sync_status | peer_group_priority | + +----+---------------+----------------+---------+-----------------+---------------------+ + | 1 | 1 | 2 | primary | in-sync | 2 | + +----+---------------+----------------+---------+-----------------+---------------------+ + +.. rubric:: |result| + +You have configured a GEO Redundancy protection group between site A and site B. +If site A is offline, the subclouds configured in the |SPG| can be +migrated in batch to site B for central management manually. + +---------------------------- +Health Monitor and Migration +---------------------------- + +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Peer monitoring and alarming +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +After the peer protection group is formed, if site A cannot be connected to +site B, there will be an alarm message on site B. + +For example: + +.. code-block:: bash + + # On site B + ~(keystone_admin)]$ fm alarm-list + +----------+--------------------------------------------------------------------------------------------------------------------------+--------------------------------------+----------+--------------------------+ + | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp | + +----------+--------------------------------------------------------------------------------------------------------------------------+--------------------------------------+----------+--------------------------+ + | 280.004 | Peer siteA is in disconnected state. Following subcloud peer groups are impacted: group1. | peer=223fcb30-909d-4edf- | major | 2023-08-18T10:25:29. | + | | | 8c36-1aebc8e9bd4a | | 670977 | + | | | | | | + +----------+--------------------------------------------------------------------------------------------------------------------------+--------------------------------------+----------+--------------------------+ + +Administrator can suppress the alarm with the following command: + +.. code-block:: bash + + # On site B + ~(keystone_admin)]$ fm event-suppress --alarm_id 280.004 + +----------+------------+ + | Event ID | Status | + +----------+------------+ + | 280.004 | suppressed | + +----------+------------+ + +--------- +Migration +--------- + +If site A is down, after receiving the alarming message the administrator +can choose to perform the migration on site B, which will migrate the +subclouds under the |SPG| from site A to site B. + +.. note:: + + Before initiating the migration operation, ensure that ``sync-status`` of the + peer group association is ``in-sync`` so that the latest updates from site A + have been successfully synchronized to site B. If ``sync_status`` is not + ``in-sync``, the migration may fail. + +.. code-block:: bash + + # On site B + ~(keystone_admin)]$ dcmanager subcloud-peer-group migrate + + # For example: + ~(keystone_admin)]$ dcmanager subcloud-peer-group migrate group1 + +During the batch migration, you can check the status of the migration of each +subcloud in the |SPG| by showing the details of the |SPG| being migrated. + +.. code-block:: bash + + # On site B + ~(keystone_admin)]$ dcmanager subcloud-peer-group status + +After successful migration, the subcloud(s) should be in +``managed/online/complete`` status on site B. + +For example: + +.. code-block:: bash + + # On site B + ~(keystone_admin)]$ dcmanager subcloud list + +----+---------------------------------+------------+--------------+---------------+-------------+---------------+-----------------+ + | id | name | management | availability | deploy status | sync | backup status | backup datetime | + +----+---------------------------------+------------+--------------+---------------+-------------+---------------+-----------------+ + | 45 | subcloud3-node2 | managed | online | complete | in-sync | None | None | + | 46 | subcloud1-node6 | managed | online | complete | in-sync | None | None | + +----+---------------------------------+------------+--------------+---------------+-------------+---------------+-----------------+ + +-------------- +Post Migration +-------------- + +If site A is restored, the subcloud(s) should be adjusted to +``unmanaged/secondary`` status in site A. The administrator can receive an +alarm on site A that notifies that the |SPG| is managed by a peer site (site +B), because this |SPG| on site A has the higher priority. + +.. code-block:: bash + + ~(keystone_admin)]$ fm alarm-list + +----------+-------------------------------------------------------------------------------------------------------------------------+----------------------------------+----------+-----------------------+ + | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp | + +----------+-------------------------------------------------------------------------------------------------------------------------+----------------------------------+----------+-----------------------+ + | 280.005 | Subcloud peer group (peer_group_name=group1) is managed by remote system | subcloud_peer_group=7 | warning | 2023-09-04T04:51:58. | + | | (peer_uuid=223fcb30-909d-4edf-8c36-1aebc8e9bd4a) with lower priority. | | | 435539 | + | | | | | | + +----------+-------------------------------------------------------------------------------------------------------------------------+----------------------------------+----------+-----------------------+ + +Then, the administrator can decide if and when to migrate the subcloud(s) back. + +.. code-block:: bash + + # On site A + ~(keystone_admin)]$ dcmanager subcloud-peer-group migrate + + # For example: + ~(keystone_admin)]$ dcmanager subcloud-peer-group migrate group1 + +After successful migration, the subcloud status should be back to the +``managed/online/complete`` status. + +For example: + +.. code-block:: bash + + +----+---------------------------------+------------+--------------+---------------+---------+---------------+-----------------+ + | id | name | management | availability | deploy status | sync | backup status | backup datetime | + +----+---------------------------------+------------+--------------+---------------+---------+---------------+-----------------+ + | 33 | subcloud3-node2 | managed | online | complete | in-sync | None | None | + | 34 | subcloud1-node6 | managed | online | complete | in-sync | None | None | + +----+---------------------------------+------------+--------------+---------------+---------+---------------+-----------------+ + +Also, the alarm mentioned above will be cleared after migrating back. + +.. code-block:: bash + + ~(keystone_admin)]$ fm alarm-list + +---------------------- +Disable GEO Redundancy +---------------------- + +You can disable the GEO Redundancy feature from the command line. + +Ensure that you have a stable environment to disable the GEO Redundancy +feature, ensuring that the subclouds are managed by the expected site. + +.. rubric:: |proc| + +#. Delete the primary association on both the sites. + + .. code-block:: bash + + # site A + ~(keystone_admin)]$ dcmanager peer-group-association delete + +#. Delete the |SPG|. + + .. code-block:: bash + + # site A + ~(keystone_admin)]$ dcmanager subcloud-peer-group delete group1 + +#. Delete the system peer. + + .. code-block:: bash + + # site A + ~(keystone_admin)]$ dcmanager system-peer delete siteB + # site B + ~(keystone_admin)]$ dcmanager system-peer delete siteA + +.. rubric:: |result| + +You have torn down the protection group between site A and site B. + +--------------------------- +Backup and Restore Subcloud +--------------------------- + +You can backup and restore a subcloud in a distributed cloud environment. +However, GEO redundancy does not support the replication of subcloud backup +files from one site to another. + +A subcloud backup is valid only for the current system controller. When a +subcloud is migrated from site A to site B, the existing backup becomes +unavailable. In this case, you can create a new backup of that subcloud on site +B. Subsequently, you can restore the subcloud from this newly created backup +when it is managed under site B. + +For information on how to backup and restore a subcloud, see +:ref:`backup-a-subcloud-group-of-subclouds-using-dcmanager-cli-f12020a8fc42` +and :ref:`restore-a-subcloud-group-of-subclouds-from-backup-data-using-dcmanager-cli-f10c1b63a95e`. + +------------------------------------------- +Operations Performed by Protected Subclouds +------------------------------------------- + +The table below lists the operations that can/cannot be performed on the protected subclouds. + +**Primary site**: The site where the |SPG| was created. + +**Secondary site**: The peer site where the subclouds in the |SPG| can be migrated to. + +**Protected subcloud**: The subcloud that belongs to a |SPG|. + +**Local/Unprotected subcloud**: The subcloud that does not belong to any |SPG|. + ++------------------------------------------+----------------------------------+-------------------------------------------------------------------------------------------------+ +| Operation | Allow (Y/N/Maybe) | Note | ++==========================================+==================================+=================================================================================================+ +| Unmanage | N | Subcloud must be removed from the |SPG| before it can be manually unmanaged. | ++------------------------------------------+----------------------------------+-------------------------------------------------------------------------------------------------+ +| Manage | N | Subcloud must be removed from the |SPG| before it can be manually managed. | ++------------------------------------------+----------------------------------+-------------------------------------------------------------------------------------------------+ +| Delete | N | Subcloud must be removed from the |SPG| before it can be manually unmanaged | +| | | and deleted. | ++------------------------------------------+----------------------------------+-------------------------------------------------------------------------------------------------+ +| Update | Maybe | Subcloud can only be updated while it is managed in the primary site because the sync command | +| | | can only be issued from the system controller where the |SPG| was created. | +| | | | +| | | .. warning:: | +| | | | +| | | The subcloud network cannot be reconfigured while it is being managed by the secondary | +| | | site. If this operation is necessary, perform the following steps: | +| | | | +| | | #. Remove the subcloud from the |SPG| to make it a local/unprotected | +| | | subcloud. | +| | | #. Update the subcloud. | +| | | #. (Optional) Manually rehome the subcloud to the primary site after it is restored. | +| | | #. (Optional) Re-add the subcloud to the |SPG|. | ++------------------------------------------+----------------------------------+-------------------------------------------------------------------------------------------------+ +| Rename | Yes | - If the subcloud in the primary site is already a part of |SPG|, we need to remove it from the | +| | | |SPG| and then unmanage, rename, and manage the subcloud, and add it back to |SPG| and perform| +| | | the sync operation. | +| | | | +| | | - If the subcloud is in the secondary site, perform the following steps: | +| | | | +| | | #. Remove the subcloud from the |SPG| to make it a local/unprotected subcloud. | +| | | | +| | | #. Unmange the subcloud. | +| | | | +| | | #. Rename the subcloud. | +| | | | +| | | #. (Optional) Manually rehome the subcloud to the primary site after it is restored. | +| | | | +| | | #. (Optional) Re-add the subcloud to the |SPG|. | ++------------------------------------------+----------------------------------+-------------------------------------------------------------------------------------------------+ +| Patch | Y | .. warning:: | +| | | | +| | | There may be a patch out-of-sync alarm when the subcloud is migrated to another site. | ++------------------------------------------+----------------------------------+-------------------------------------------------------------------------------------------------+ +| Upgrade | Y | All the system controllers in the protection group must be upgraded first before upgrading | +| | | any of the subclouds. | ++------------------------------------------+----------------------------------+-------------------------------------------------------------------------------------------------+ +| Rehome | N | Subcloud cannot be manually rehomed while being part of the |SPG| | ++------------------------------------------+----------------------------------+-------------------------------------------------------------------------------------------------+ +| Backup | Y | | +| | | | ++------------------------------------------+----------------------------------+-------------------------------------------------------------------------------------------------+ +| Restore | Maybe | - If the subcloud in the primary site is already a part of |SPG|, we need to remove it from the | +| | | |SPG| and then unmanage and restore the subcloud, and add it back to |SPG| and perform | +| | | the sync operation. | +| | | | +| | | - If the subcloud is in the secondary site, perform the following steps: | +| | | | +| | | #. Remove the subcloud from the |SPG| to make it a local/unprotected subcloud. | +| | | | +| | | #. Unmange the subcloud. | +| | | | +| | | #. Restore the subcloud from the backup. | +| | | | +| | | #. (Optional) Manually rehome the subcloud to the primary site after it is restored. | +| | | | +| | | #. (Optional) Re-add the subcloud to the |SPG|. | ++------------------------------------------+----------------------------------+-------------------------------------------------------------------------------------------------+ +| Prestage | Y | .. warning:: | +| | | | +| | | The prestage data will get overwritten because it is not guaranteed that both the system | +| | | controllers always run on the same patch level (ostree repo) and/or have the same images | +| | | list. | +| | | | ++------------------------------------------+----------------------------------+-------------------------------------------------------------------------------------------------+ +| Reinstall | Y | | +| | | | ++------------------------------------------+----------------------------------+-------------------------------------------------------------------------------------------------+ +| Remove from |SPG| | Maybe | Subcloud can be removed from the |SPG| in the primary site. Subcloud can | +| | | only be removed from the |SPG| in the secondary site if the primary site is | +| | | currently down. | ++------------------------------------------+----------------------------------+-------------------------------------------------------------------------------------------------+ +| Add to |SPG| | Maybe | Subcloud can only be added to the |SPG| in the primary site as manual sync is required. | +| | | | +| | | | ++------------------------------------------+----------------------------------+-------------------------------------------------------------------------------------------------+ + + + + + diff --git a/doc/source/dist_cloud/kubernetes/figures/dcg1695034653874.png b/doc/source/dist_cloud/kubernetes/figures/dcg1695034653874.png new file mode 100644 index 000000000..3e149b83e Binary files /dev/null and b/doc/source/dist_cloud/kubernetes/figures/dcg1695034653874.png differ diff --git a/doc/source/dist_cloud/kubernetes/index-dist-cloud-kub-95bef233eef0.rst b/doc/source/dist_cloud/kubernetes/index-dist-cloud-kub-95bef233eef0.rst index b44a68f1e..8cb05739d 100644 --- a/doc/source/dist_cloud/kubernetes/index-dist-cloud-kub-95bef233eef0.rst +++ b/doc/source/dist_cloud/kubernetes/index-dist-cloud-kub-95bef233eef0.rst @@ -175,6 +175,16 @@ Upgrade Orchestration for Distributed Cloud SubClouds failure-prior-to-the-installation-of-n-plus-1-load-on-a-subcloud failure-during-the-installation-or-data-migration-of-n-plus-1-load-on-a-subcloud +-------------------------------------------------- +Distributed Cloud System Controller GEO Redundancy +-------------------------------------------------- + +.. toctree:: + :maxdepth: 1 + + overview-of-distributed-cloud-geo-redundancy + configure-distributed-cloud-system-controller-geo-redundancy-e3a31d6bf662 + -------- Appendix -------- diff --git a/doc/source/dist_cloud/kubernetes/overview-of-distributed-cloud-geo-redundancy.rst b/doc/source/dist_cloud/kubernetes/overview-of-distributed-cloud-geo-redundancy.rst new file mode 100644 index 000000000..f0d6cd013 --- /dev/null +++ b/doc/source/dist_cloud/kubernetes/overview-of-distributed-cloud-geo-redundancy.rst @@ -0,0 +1,118 @@ + +.. eho1558617205547 +.. _overview-of-distributed-cloud-geo-redundancy: + +============================================ +Overview of Distributed Cloud GEO Redundancy +============================================ + +|prod-long| |prod-dc-geo-red| configuration supports the ability to recover from +a catastrophic event that requires subclouds to be rehomed away from the failed +system controller site to the available site(s) which have enough spare capacity. +This way, even if the failed site cannot be restored in short time, the subclouds +can still be rehomed to available peer system controller(s) for centralized +management. + +In this configuration, the following items are addressed: + +* 1+1 GEO redundancy + + - Active-Active redundancy model + - Total number of subcloud should not exceed 1K + +* Automated operations + + - Synchronization and liveness check between peer systems + - Alarm generation if peer system controller is down + +* Manual operations + + - Batch rehoming from alive peer system controller + +--------------------------------------------- +Distributed Cloud GEO Redundancy Architecture +--------------------------------------------- + +1+1 Distributed Cloud GEO Redundancy Architecture consists of two local high +availability Distributed Cloud clusters. They are the mutual peers that form a +protection group illustrated in the figure below: + +.. image:: figures/dcg1695034653874.png + +The architecture features a synchronized distributed control plane for +geographic redundancy, where system peer instance is created in each local +Distributed Cloud cluster pointing to each other via keystone endpoints to +form a system protection group. + +If the administrator wants the peer site to take over the subclouds where local +system controller is in failure state, |SPG| needs to be created and subclouds +need to be assigned to it. Then, a Peer Group Association needs to be created +to link the system peer and |SPG| together. The |SPG| information and the +subclouds in it will be synchronized to the peer site via the endpoint information +stored in system peer instance. + +The peer sites do health checks via the endpoint information stored in the system peer +instance. If the local site detects that the peer site is not reachable, +it will raise an alarm to alert the administrator. + +If the failed site cannot be restored quickly, the administrator needs to +initiate batch subcloud migration by performing migration on the |SPG| from the +healthy peer of the failed site. + +When the failed site has been restored and is ready for service, administrator can +initiate the batch subcloud migration from the restored site to migrate back +all the subclouds in the |SPG| for geographic proximity. + +**Protection Group** A group of peer sites, which is configured to monitor each +other and decide how to take over the subclouds (based on predefined |SPG|) if +any peer in the group fails. + +**System Peer** +A logic entity, which is created in a system controller site. System controller +site uses the information (keystone endpoint, credential) stored in the system +peer for the health check and data synchronization. + +**Subcloud Secondary Deploy State** +This is a newly introduced state for a subcloud. If a subcloud is in the secondary +deploy state, the subcloud instance is only a placeholder holding the configuration +parameters, which can be used to migrate the corresponding subcloud from the peer +site. After rehoming, the subcloud's state will be changed from secondary to complete, +and is managed by the local site. The subcloud instance on the peer site is changed to secondary. + +**Subcloud Peer Group** +Group of locally managed subclouds, which is supposed to be duplicated into a +peer site as secondary subclouds. The |SPG| instance will also be created in +peer site and it will contain all the secondary subclouds just duplicated. + +Multiple |SPGs| are supported and the membership of the |SPG| is decided by +administrator. This way, administrator can divide local subclouds into different groups. + +|SPG| can be used to initiate subcloud batch migration. For example, when the +peer site has been detected to be down, and the local site is supposed to take +over the management of the subclouds in failed peer site, administrator can +perform |SPG| migration to migrate all the subclouds in the |SPG| to the local +site for centralized management. + +**Subcloud Peer Group Priority** +The priority is an attribute of |SPG| instance, and the |SPG| is designed to be +synchronized to each peer sites in the protection group with different priority +value. + +In a Protection Group, there can be multiple System Peers. The site which owns +the |SPG| with the highest priority (smallest value) is the +leader site, which needs to initiate the batch migration to take over the +subclouds grouped by the |SPG|. + +**Subcloud Peer Group and System Peer Association** +Association refers to the binding relationship between |SPG| and system peer. +When the association between a |SPG| and system peer is created on the local site, +the |SPG| and the subclouds in the group will be duplicated to the peer site to +which the system peer in this association is pointing. This way, when the local +site is down, the peer site has enough information to initiate the |SPG| based batch +migration to take over the centralized management for subclouds previously +managed by the failed site. + +One system peer can be associated with multiple |SPGs|. One |SPG| can be associated +with multiple system peers, with priority specified. This priority is used to +decide which |SPG| has the higher priority to take over the subclouds when batch migration +should be performed. diff --git a/doc/source/dist_cloud/kubernetes/rehoming-a-subcloud.rst b/doc/source/dist_cloud/kubernetes/rehoming-a-subcloud.rst index f3f97191d..6bfc56f26 100644 --- a/doc/source/dist_cloud/kubernetes/rehoming-a-subcloud.rst +++ b/doc/source/dist_cloud/kubernetes/rehoming-a-subcloud.rst @@ -17,6 +17,12 @@ controller using the rehoming playbook. The rehoming playbook does not work with freshly installed/bootstrapped subclouds. +.. note:: + + Manual rehoming is not possible if a subcloud is included in an |SPG|. + Use the :command:`dcmanager subcloud-peer-group migrate` command for automatic + rehoming. To get more information, see :ref:`configure-distributed-cloud-system-controller-geo-redundancy-e3a31d6bf662`. + .. note:: The system time should be accurately configured on the system controllers @@ -27,7 +33,7 @@ controller using the rehoming playbook. Do not rehome a subcloud if the RECONCILED status on the system resource or any host resource of the subcloud is FALSE. To check the RECONCILED status, run the :command:`kubectl -n deployment get system` and :command:`kubectl -n deployment get hosts` commands. - + Use the following procedure to enable subcloud rehoming and to update the new subcloud configuration (networking parameters, passwords, etc.) to be compatible with the new system controller.