StarlingX Bare Metal and Node Management, Hardware Maintenance

Go to file

Eric MacDonald 79d8644b1e Add bmc reset delay in the reset progression command handler This update solves two issues involving bmc reset. Issue #1: A race condition can occur if the mtcAgent finds an unlocked-disabled or heartbeat failing node early in its startup sequence, say over a swact or an SM service restart and needs to issue a one-time-reset. If at that point it has not yet established access to the BMC then the one-time-reset request is skipped. Issue #2: When issue #1 race conbdition does not occur before BMC access is established the mtcAgent will issue its one-time reset to a node. If this occurs as a result of a crashdump then this one-time reset can interrupt the collection of the vmcore crashdump file. This update solves both of these issues by introducing a bmc reset delay following the detection and in the handling of a failed node that 'may' need to be reset to recover from being network isolated. The delay prevents the crashdump from being interrupted and removes the race condition by giving maintenance more time to establish bmc access required to send the reset command. To handle significantly long bmc reset delay values this update cancels the posted 'in waiting' reset if the target recovers online before the delay expires. It is recommended to use a bmc reset delay that is longer than a typical node reboot time. This is so that in the typical case, where there is no crashdump happening, we don't reset the node late in its almost done recovery. The number of seconds till the pending reset countdown is logged periodically. It can take upwards of 2-3 minutes for a crashdump to complete. To avoid the double reboot, in the typical case, the bmc reset delay is set to 5 minutes which is longer than a typical boot time. This means that if the node recovers online before the delay expires then great, the reset wasn't needed and is cancelled. However, if the node is truely isolated or the shutdown sequence hangs then although the recovery is delayed a bit to accomodate for the crashdump case, the node is still recovered after the bmc reset delay period. This could lead to a double reboot if the node recovery-to-online time is longer than the bmc reset delay. This update implements this change by adding a new 'reset send wait' phase to the exhisting reset progression command handler. Some consistency driven logging improvements were also implemented. Test Plan: PASS: Verify failed node crashdump is not interrupted by bmc reset. PASS: Verify bmc is accessible after the bmc reset delay. PASS: Verify handling of a node recovery case where the node does not come back before bmc_reset_delay timeout. PASS: Verify posted reset is cancelled if the node goes online before the bmc reset delay and uptime shows less than 5 mins. PASS: Verify reset is not cancelled if node comes back online without reboot before bmc reset delay and still seeing mtcAlive on one or more links.Handles the cluster-host only heartbeat loss case. The node is still rebooted with the bmc reset delay as backup. PASS: Verify reset progression command handling, with and without reboot ACKs, with and without bmc PASS: Verify reset delay defaults to 5 minutes PASS: Verify reset delay change over a manual change and sighup PASS: Verify bmc reset delay of 0, 10, 60, 120, 300 (default), 500 PASS: Verify host-reset when host is already rebooting PASS: Verify host-reboot when host is already rebooting PASS: Verify timing of retries and bmc reset timeout PASS: Verify posted reset throttled log countdown Failure Mode Cases: PASS: Verify recovery handling of failed powered off node PASS: Verify recovery handling of failed node that never comes online PASS: Verify recovery handling when bmc is never accessible PASS: Verify recovery handling cluster-host network heartbeat loss PASS: Verify recovery handling management network heartbeat loss PASS: Verify recovery handling both heartbeat loss PASS: Verify mtcAgent restart handling finding unlocked disabled host Regression: PASS: Verify build and DX system install PASS: Verify lock/unlock (soak 10 loops) PASS: Verify host-reboot PASS: Verify host-reset PASS: Verify host-reinstall PASS: Verify reboot graceful recovery (force and no force) PASS: Verify transient heartbeat failure handling PASS: Verify persistent heartbeat loss handling of mgmt and/or cluster networks PASS: Verify SM peer reset handling when standby controller is rebooted PASS: Verify logging and issue debug ability Closes-Bug: 2042567 Closes-Bug: 2042571 Change-Id: I195661702b0d843d0bac19f3d1ae70195fdec308 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>		2023-11-02 20:58:00 +00:00
api-ref/source	Switch to newer openstackdocstheme and reno versions	2020-06-04 14:32:46 +02:00
bsp-files	Add patch extract from load	2023-08-08 12:15:00 -03:00
devstack	Security: Handle nospectre_v1 in the bootargs	2020-01-28 18:21:13 -05:00
doc	Fix tox-docs failing sphinx	2023-08-29 16:50:22 -04:00
installer	Fix kickstarts patching	2023-10-11 14:40:38 +00:00
kickstart	Prestaged ISO: copy ostree_repo to versioned platform-backup	2023-10-16 10:04:35 -04:00
mtce	Add bmc reset delay in the reset progression command handler	2023-11-02 20:58:00 +00:00
mtce-common	Add bmc reset delay in the reset progression command handler	2023-11-02 20:58:00 +00:00
mtce-compute	Update mtce debian package ver based on git	2023-03-02 14:50:35 +00:00
mtce-control	Update mtce debian package ver based on git	2023-03-02 14:50:35 +00:00
mtce-storage	Update mtce debian package ver based on git	2023-03-02 14:50:35 +00:00
releasenotes	Switch to newer openstackdocstheme and reno versions	2020-06-04 14:32:46 +02:00
tools	Set longer shutdown time and fix power state error log	2023-10-05 17:12:19 -04:00
.gitignore	Update tox.ini files to use stein constraints	2019-06-25 13:20:35 -04:00
.gitreview	OpenDev Migration Patch	2019-04-19 19:52:33 +00:00
.zuul.yaml	Fix github mirroring for this repo	2023-04-28 12:38:51 -04:00
CONTRIBUTORS.wrs	StarlingX open source release updates	2018-05-31 07:36:43 -07:00
LICENSE	StarlingX open source release updates	2018-05-31 07:36:43 -07:00
README.rst	starlingx/metal README improvement	2023-07-19 12:32:13 -03:00
centos_build_layer.cfg	Build layering, add layer build config file	2019-10-15 19:19:45 +08:00
centos_iso_image.inc	Remove unused inventory and python-inventoryclient	2020-01-08 14:12:05 -06:00
centos_pkg_dirs	rvmc: remove un-used build data	2020-01-16 08:39:54 -08:00
centos_stable_docker_images.inc	Utility to install a server via Redfish	2019-12-31 15:34:54 +00:00
debian_build_layer.cfg	Add debian_build_layer.cfg file	2021-10-05 14:08:23 -04:00
debian_iso_image.inc	Debian: metal: update debian_iso_image.inc	2022-11-16 12:06:51 +08:00
debian_pkg_dirs	Include upgrades meta files to Debian ISO	2022-08-02 21:01:58 +00:00
debian_stable_docker_images.inc	debian: port rvmc docker image to Debian	2022-08-12 16:30:01 +00:00
pylint.rc	Add pylint py3 portability checks for the metal repo	2021-09-13 11:57:42 -03:00
test-requirements.txt	Removed wait_for_worker_config_init in AIO systems	2021-07-08 18:48:28 -04:00
tox.ini	Update tox.ini to work with tox 4	2022-12-26 23:26:54 +00:00

README.rst

metal

The starlingx/metal repository handles StarlingX Bare Metal Management¹.

This repository is not intended to be developed standalone, but rather as part of the StarlingX Source System, which is defined by the StarlingX manifest².

README.rst

metal

References