docs/doc/source/backup/kubernetes/restoring-starlingx-system-...

427 lines
17 KiB
ReStructuredText

.. Greg updates required for -High Security Vulnerability Document Updates
.. uzk1552923967458
.. _restoring-starlingx-system-data-and-storage:
========================================
Restore Platform System Data and Storage
========================================
You can perform a system restore (controllers, workers, including or excluding
storage nodes) of a |prod| cluster from a previous system backup and bring it
back to the operational state it was when the backup procedure took place.
There are two restore modes- optimized restore and legacy restore. Optmized restore
must be used on |AIO-SX| and legacy restore must be used on systems that are not |AIO-SX|.
.. rubric:: |context|
Kubernetes configuration will be restored and pods that are started from
repositories accessible from the internet or from external repositories will
start immediately. |prod| specific applications must be re-applied once a
storage cluster is configured.
Everything is restored as it was when the backup was created, except for
optional data if not defined.-
See :ref:`Back Up System Data <backing-up-starlingx-system-data>` for more
details on the backup.
.. warning::
The system backup file can only be used to restore the system from which
the backup was made. You cannot use this backup file to restore the system
to different hardware.
To restore the backup, use the same version of the boot image (ISO) and
patches that were installed at the time of the backup.
The |prod| restore supports the following optional modes:
.. _restoring-starlingx-system-data-and-storage-ol-tw4-kvc-4jb:
- To keep the Ceph cluster data intact (false - default option), use the
following parameter, when passing the extra arguments to the Ansible Restore
playbook command:
.. code-block:: none
wipe_ceph_osds=false
- To wipe the Ceph cluster entirely (true), where the Ceph cluster will need
to be recreated, or if the Ceph partition was previously wiped, such as
during a fresh install between backup and restore or during reinstall, use
the following parameter:
.. code-block:: none
wipe_ceph_osds=true
Restoring a |prod| cluster from a backup file is done by re-installing the
ISO on controller-0, applying updates (patches), running the Ansible Restore
Playbook, unlocking controller-0, and then powering on, and unlocking the
remaining hosts, one host at a time, starting with the controllers, and then
the storage hosts, ONLY if required, and lastly the compute (worker) hosts.
Lastly, running :command:`system restore-complete` command.
.. rubric:: |prereq|
Before you start the restore procedure you must ensure the following
conditions are in place:
.. _restoring-starlingx-system-data-and-storage-ul-rfq-qfg-mp:
- All cluster hosts must be prepared for network boot and then powered
down. You can prepare a host for network boot.
.. note::
If you are restoring system data only, do not lock, power off or
prepare the storage hosts to be reinstalled.
- The backup file is accessible locally, if restore is done by running
Ansible Restore playbook locally on the controller. The backup file is
accessible remotely, if restore is done by running Ansible Restore playbook
remotely.
- You have the original |prod| ISO installation image available on a USB
flash drive. It is mandatory that you use the exact same version of the
software used during the original installation, otherwise the restore
procedure will fail.
- The restore procedure requires all hosts but controller-0 to boot
over the internal management network using the |PXE| protocol. Ideally, the
old boot images are no longer present, so that the hosts boot from the
network when powered on. If this is not the case, you must configure each
host manually for network boot immediately after powering it on.
- If you are restoring a |prod-dc| subcloud first, ensure it is in
an **unmanaged** state on the Central Cloud (SystemController) by using
the following commands:
.. code-block:: none
$ source /etc/platform/openrc
~(keystone_admin)]$ dcmanager subcloud unmanage <subcloud-name>
where ``<subcloud-name>`` is the name of the subcloud to be unmanaged.
For more information, see:
- `Backup a Subcloud/Group of Subclouds using DCManager CLI <backup-a-subcloud-group-of-subclouds-using-dcmanager-cli-f12020a8fc42>`
- `Restore a Subcloud/Group of Subclouds from Backup Data Using DCManager CLI <restore-a-subcloud-group-of-subclouds-from-backup-data-using-dcmanager-cli-f10c1b63a95e>`
.. rubric:: |proc|
#. Power down all hosts.
If you have a storage host and want to retain Ceph data, then power down
all the nodes except the storage hosts; the cluster has to be functional
during a restore operation.
.. caution::
Do not use :command:`wipedisk` before a restore operation. This will
lead to data loss on your Ceph cluster. It is safe to use
:command:`wipedisk` during an initial installation, while reinstalling
a host, or during an upgrade.
#. Install the |prod| ISO software on controller-0 from the USB flash
drive.
You can now log in using the host's console.
#. Log in to the console as user **sysadmin** with password **sysadmin**.
#. Install network connectivity required for the subcloud.
#. Ensure that the system is at the same patch level as it was when the backup
was taken. On the |AIO-SX| systems, you must manually reinstall any
previous patches. This may include doing a reboot if required.
For steps on how to install patches using the :command:`sw-patch install-local` command, see :ref:`aio_simplex_install_kubernetes_r7`;
``Install Software on Controller-0``.
After the reboot, you can verify that the updates were applied.
.. only:: partner
.. include:: /_includes/restore-platform-system-data-and-storage-b92b8bdaf16d.rest
:start-after: sw-patch-query-begin
:end-before: sw-patch-query-end
.. note::
On the systems that are not |AIO-SX|, you can skip this step if
``skip_patching=true`` is not used. Patches are automatically
reinstalled from the backup by default.
#. Ensure that the backup files are available on the controller. Run both
Ansible Restore playbooks, restore_platform.yml and restore_user_images.yml.
For more information on restoring the back up file, see :ref:`Run Restore
Playbook Locally on the Controller
<running-restore-playbook-locally-on-the-controller>`, and :ref:`Run
Ansible Restore Playbook Remotely
<system-backup-running-ansible-restore-playbook-remotely>`.
.. note::
The backup files contain the system data and updates.
The restore operation will pull missing images from the upstream registries.
#. Restore the local registry using the file restore_user_images.yml.
Example:
.. code-block:: none
~(keystone_admin)]$ ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_user_images.yml -e "initial_backup_dir=/home/sysadmin backup_filename=localhost_user_images_backup_2023_07_15_21_24_22.tgz ansible_become_pass=St8rlingX*"
.. note::
- This step applies only if it was created during the backup operation.
- The ``user_images_backup*.tgz`` file is created during backup only if
``backup_user_images`` is true.
This must be done before unlocking controller-0.
#. Unlock Controller-0.
.. code-block:: none
~(keystone_admin)]$ system host-unlock controller-0
After you unlock controller-0, storage nodes become available and Ceph
becomes operational.
#. If the system is a Distributed Cloud system controller, restore the **dc-vault**
using the restore_dc_vault.yml playbook. Perform this step after unlocking
controller-0:
.. code-block:: none
$ ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_dc_vault.yml -e "initial_backup_dir=/home/sysadmin backup_filename=localhost_dc_vault_backup_2020_07_15_21_24_22.tgz ansible_become_pass=St0rlingX*"
.. note::
The dc-vault backup archive is created by the backup.yml playbook.
#. Authenticate the system as Keystone user **admin**.
Source the **admin** user environment as follows:
.. code-block:: none
$ source /etc/platform/openrc
#. Apps transition from 'restore-requested' to 'applying' state, and
from 'applying' state to 'applied' state.
If apps are transitioned from 'applying' to 'restore-requested' state,
ensure there is network access and access to the docker registry.
The process is repeated once per minute until all apps are transitioned to
'applied'.
#. If you have a Duplex system, restore the **controller-1** host.
#. List the current state of the hosts.
.. code-block:: none
~(keystone_admin)]$ system host-list
+----+-------------+------------+---------------+-----------+------------+
| id | hostname | personality| administrative|operational|availability|
+----+-------------+------------+---------------+-----------+------------+
| 1 | controller-0| controller | unlocked |enabled |available |
| 2 | controller-1| controller | locked |disabled |offline |
| 3 | storage-0 | storage | locked |disabled |offline |
| 4 | storage-1 | storage | locked |disabled |offline |
| 5 | compute-0 | worker | locked |disabled |offline |
| 6 | compute-1 | worker | locked |disabled |offline |
+----+-------------+------------+---------------+-----------+------------+
#. Power on the host.
Ensure that the host boots from the network, and not from any disk
image that may be present.
The software is installed on the host, and then the host is
rebooted. Wait for the host to be reported as **locked**, **disabled**,
and **online**.
#. Unlock controller-1.
.. code-block:: none
~(keystone_admin)]$ system host-unlock controller-1
+-----------------+--------------------------------------+
| Property | Value |
+-----------------+--------------------------------------+
| action | none |
| administrative | locked |
| availability | online |
| ... | ... |
| uuid | 5fc4904a-d7f0-42f0-991d-0c00b4b74ed0 |
+-----------------+--------------------------------------+
#. Verify the state of the hosts.
.. code-block:: none
~(keystone_admin)]$ system host-list
+----+-------------+------------+---------------+-----------+------------+
| id | hostname | personality| administrative|operational|availability|
+----+-------------+------------+---------------+-----------+------------+
| 1 | controller-0| controller | unlocked |enabled |available |
| 2 | controller-1| controller | unlocked |enabled |available |
| 3 | storage-0 | storage | locked |disabled |offline |
| 4 | storage-1 | storage | locked |disabled |offline |
| 5 | compute-0 | worker | locked |disabled |offline |
| 6 | compute-1 | worker | locked |disabled |offline |
+----+-------------+------------+---------------+-----------+------------+
#. Restore storage configuration. If :command:`wipe_ceph_osds` is set to
**True**, follow the same procedure used to restore **controller-1**,
beginning with host **storage-0** and proceeding in sequence.
.. note::
This step should be performed ONLY if you are restoring storage hosts.
#. For storage hosts, there are two options:
With the controller software installed and updated to the same level
that was in effect when the backup was performed, you can perform
the restore procedure without interruption.
Standard with Controller Storage install or reinstall depends on the
:command:`wipe_ceph_osds` configuration:
#. If :command:`wipe_ceph_osds` is set to **true**, reinstall the
storage hosts.
#. If :command:`wipe_ceph_osds` is set to **false** (default
option), do not reinstall the storage hosts.
.. caution::
Do not reinstall or power off the storage hosts if you want to
keep previous Ceph cluster data. A reinstall of storage hosts
will lead to data loss.
#. Ensure that the Ceph cluster is healthy. Verify that the three Ceph
monitors (controller-0, controller-1, storage-0) are running in
quorum.
.. code-block:: none
~(keystone_admin)]$ ceph -s
cluster:
id: 3361e4ef-b0b3-4f94-97c6-b384f416768d
health: HEALTH_OK
services:
mon: 3 daemons, quorum controller-0,controller-1,storage-0
mgr: controller-0(active), standbys: controller-1
osd: 10 osds: 10 up, 10 in
data:
pools: 5 pools, 600 pgs
objects: 636 objects, 2.7 GiB
usage: 6.5 GiB used, 2.7 TiB / 2.7 TiB avail
pgs: 600 active+clean
io:
client: 85 B/s rd, 336 KiB/s wr, 0 op/s rd, 67 op/s wr
.. caution::
Do not proceed until the Ceph cluster is healthy and the message
HEALTH_OK appears.
If the message HEALTH_WARN appears, wait a few minutes and then try
again. If the warning condition persists, consult the public
documentation for troubleshooting Ceph monitors (for example
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-mon/).
#. Restore the compute (worker) hosts, one at a time.
Restore the compute (worker) hosts following the same procedure used to
restore controller-1.
#. Allow Calico and Coredns pods to be recovered by Kubernetes. They should
all be in 'N/N Running' state.
The state of the hosts when the restore operation is complete is as
follows:
.. code-block:: none
~(keystone_admin)]$ kubectl get pods -n kube-system | grep -e calico -e coredns
calico-kube-controllers-5cd4695574-d7zwt 1/1 Running
calico-node-6km72 1/1 Running
calico-node-c7xnd 1/1 Running
coredns-6d64d47ff4-99nhq 1/1 Running
coredns-6d64d47ff4-nhh95 1/1 Running
#. If **wipe_ceph_osds** is set to true and all the system hosts are in an
unlocked/enabled/available state, do the following:
#. Remove and reapply **platform-integ-apps**. This step will re-create
the default ceph pools (they were deleted):
.. code-block:: none
$ system application-remove platform-integ-apps
$ system application-apply platform-integ-apps
#. Delete completely and reapply all the applications that have
persistent volumes (OpenStack or custom apps). For example for
OpenStack, run the following commands
.. parsed-literal::
$ system application-remove |prefix|-openstack
$ system application-delete |prefix|-openstack
$ system application-upload |prefix|-openstack-20.12-0.tgz
$ system application-apply |prefix|-openstack
#. Run the :command:`system restore-complete` command.
.. code-block:: none
~(keystone_admin)]$ system restore-complete
#. Alarms 750.006 alarms disappear one at a time, as the apps are auto applied.
.. rubric:: |postreq|
.. _restoring-starlingx-system-data-and-storage-ul-b2b-shg-plb:
- Passwords for local user accounts must be restored manually since they
are not included as part of the backup and restore procedures.
- After restoring a |prod-dc| subcloud, you need to bring it back
to the **managed** state on the Central Cloud (SystemController), by
using the following commands:
.. code-block:: none
$ source /etc/platform/openrc
~(keystone_admin)]$ dcmanager subcloud manage <subcloud-name>
where ``<subcloud-name>`` is the name of the subcloud to be managed.
.. comments in steps seem to throw numbering off.
.. xreflink removed from step 'Install the |prod| ISO software on controller-0 from the USB flash
drive.':
For details, refer to the |inst-doc|: :ref:`Installing Software on
controller-0 <installing-software-on-controller-0>`. Perform the
installation procedure for your system and *stop* at the step that
requires you to configure the host as a controller.
.. xreflink removed from step 'Install network connectivity required for the subcloud.':
For details, refer to the |distcloud-doc|: :ref:`Installing and
Provisioning a Subcloud <installing-and-provisioning-a-subcloud>`.