StarlingX Bare Metal and Node Management, Hardware Maintenance

Go to file

Eric MacDonald 48978d804d Improved maintenance handling of spontaneous active controller reboot Performing a forced reboot of the active controller sometimes results in a second reboot of that controller. The cause of the second reboot was due to its reported uptime in the first mtcAlive message, following the reboot, as greater than 10 minutes. Maintenance has a long standing graceful recovery threshold of 10 minutes. Meaning that if a host looses heartbeat and enters Graceful Recovery, if the uptime value extracted from the first mtcAlive message following the recovery of that host exceeds 10 minutes, then maintenance interprets that the host did not reboot. If a host goes absent for longer than this threshold then for reasons not limited to security, maintenance declares the host as 'failed' and force re-enables it through a reboot. With the introduction of containers and addition of new features over the last few releases, boot times on some servers are approaching the 10 minute threshold and in this case exceeded the threshold. The primary fix in this update is to increase this long standing threshold to 15 minutes to account for evolution of the product. During the debug of this issue a few other related undesirable behaviors related to Graceful Recovery were observed with the following additional changes implemented. - Remove hbsAgent process restart in ha service management failover failure recovery handling. This change is in the ha git with a loose dependency placed on this update. Reason: https://review.opendev.org/c/starlingx/ha/+/788299 - Prevent the hbsAgent from sending heartbeat clear events to maintenance in response to a heartbeat stop command. Reason: Maintenance receiving these clear events while in Graceful Recovery causes it to pop out of graceful recovery only to re-enter as a retry and therefore needlessly consumes one (of a max of 5) retry count. - Prevent successful Graceful Recovery until all heartbeat monitored networks recover. Reason: If heartbeat of one network, say cluster recovers but another (management) does not then its possible the max Graceful Recovery Retries could be reached quite quickly, while one network recovered but the other may not have, causing maintenance to fail the host and force a full enable with reboot. - Extend the wait for the hbsClient ready event in the graceful recovery handler timout from 1 minute to worker config timeout. Reason: To give the worker config time to complete before force starting the recovery handler's heartbeat soak. - Add Graceful Recovery Wait state recovery over process restart. Reason: Avoid double reboot of Gracefully Recovering host over SM service bounce. - Add requirement for a valid out-of-band mtce flags value before declaring configuration error in the subfunction enable handler. Reason: rebooting the active controller can sometimes result in a falsely reported configation error due to the subfunction enable handler interpreting a zero value as a configuration error. - Add uptime to all Graceful Recovery 'Connectivity Recovered' logs. Reason: To assist log analysis and issue debug Test Plan: PASS: Verify handling active controller reboot cases: AIO DC, AIO DX, Standard, and Storage PASS: Verify Graceful Recovery Wait behavior cases: with and without timeout, with and without bmc cases: uptime > 15 mins and 10 < uptime < 15 mins PASS: Verify Graceful Recovery continuation over mtcAgent restart cases: peer controller, compute, MNFA 4 computes PASS: Verify AIO DX and DC active controller reboot to standby takeover that up for less than 15 minutes. Regression: PASS: Verify MNFA feature ; 4 computes in 8 node Storage system PASS: Verify cluster network only heartbeat loss handling cases: worker and standby controller in all systems. PASS: Verify Dead Office Recovery (DOR) cases: AIO DC, AIO DX, Standard, Storage PASS: Verify system installations cases: AIO SX/DC/DX and 8 node Storage system PASS: Verify heartbeat and graceful recovery of both 'standby controller' and worker nodes in AIO Plus. PASS: Verify logging and no coredumps over all of testing PASS: Verify no missing or stuck alarms over all of testing Change-Id: I3d16d8627b7e838faf931a3c2039a6babf2a79ef Closes-Bug: 1922584 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>		2021-04-30 15:35:53 +00:00
api-ref/source	Switch to newer openstackdocstheme and reno versions	2020-06-04 14:32:46 +02:00
bsp-files	Restrict isolcpu_plugin to nodes with worker function	2021-04-06 14:25:58 +00:00
devstack	Security: Handle nospectre_v1 in the bootargs	2020-01-28 18:21:13 -05:00
doc	Switch to newer openstackdocstheme and reno versions	2020-06-04 14:32:46 +02:00
installer	Add auto-version for remaining stx/metal packages	2020-12-17 13:26:24 -05:00
kickstart	Drop isolcpu from AIO/worker kickstarts	2020-06-19 02:08:28 -04:00
mtce	Improved maintenance handling of spontaneous active controller reboot	2021-04-30 15:35:53 +00:00
mtce-common	Improved maintenance handling of spontaneous active controller reboot	2021-04-30 15:35:53 +00:00
mtce-compute	Add auto-versioning to starlingx/metal mtce packages	2020-05-21 15:18:43 -04:00
mtce-control	Mtce heartbeat cluster state change notification improvement	2021-01-08 09:59:24 -05:00
mtce-storage	Add auto-versioning to starlingx/metal mtce packages	2020-05-21 15:18:43 -04:00
releasenotes	Switch to newer openstackdocstheme and reno versions	2020-06-04 14:32:46 +02:00
tools/rvmc/centos	Redfish Virtual Media Controller enhancements	2020-08-17 21:14:50 +00:00
.gitignore	Update tox.ini files to use stein constraints	2019-06-25 13:20:35 -04:00
.gitreview	OpenDev Migration Patch	2019-04-19 19:52:33 +00:00
.zuul.yaml	Tox and Zuul job for the bandit code scan in starlingx/metal	2020-06-29 08:24:46 +00:00
CONTRIBUTORS.wrs	StarlingX open source release updates	2018-05-31 07:36:43 -07:00
LICENSE	StarlingX open source release updates	2018-05-31 07:36:43 -07:00
README.rst	Followup opendev cleanup and test jobs	2019-04-22 16:42:03 +00:00
centos_build_layer.cfg	Build layering, add layer build config file	2019-10-15 19:19:45 +08:00
centos_iso_image.inc	Remove unused inventory and python-inventoryclient	2020-01-08 14:12:05 -06:00
centos_pkg_dirs	rvmc: remove un-used build data	2020-01-16 08:39:54 -08:00
centos_stable_docker_images.inc	Utility to install a server via Redfish	2019-12-31 15:34:54 +00:00
pylint.rc	Add pylint checks for python files in metal	2020-01-03 13:27:00 -06:00
test-requirements.txt	Tox and Zuul job for the bandit code scan in starlingx/metal	2020-06-29 08:24:46 +00:00
tox.ini	Use newer flake8 to run on ubuntu-focal Zuul machines	2020-09-09 17:59:49 -04:00

README.rst

metal

StarlingX Bare Metal Management