StarlingX Bare Metal and Node Management, Hardware Maintenance
Go to file
Eric MacDonald 2fc05673d1 Add SysRq crash dump support for pmon quorum health messaging loss
The hostwd process supports failure handling for two pmon
quorum failure modes.
 1. persistent pmon quorum process failure
 2. persistent absence of pmon's quorum health report

This update adds a new configuration option and associated
implementation required to force a crash dump action for
failure mode 2 above.

This means that if the Process Monitor itself gets stalled or stops
running for 3 (default config) minutes then the hostwd will trigger
a SysRq to force a crash dump.

Test Plan:

PASS: Verify kdump for pmon quorum health report message loss
PASS: Verify no kdump when kdump_on_stall is disabled
PASS: Verify handling when kdump service is not active
PASS: Verify sighup config change detection and handling

Regression:

PASS: Verify softdog timeout handling and logs
PASS: Verify quorum threshold config change and handling
PASS: Verify handling with reboot/reset recovery methods disabled
PASS: Verify enable reboot_on_err config change handling
PASS: Verify reboot/reset actions are ignored while host is locked
PASS: Verify pmon failure recovery handling before threshold reached

Change-Id: Id926447574e02013f83c0170784e2a8f9a46bac1
Partial-Bug: 1894889
Depends-On: https://review.opendev.org/#/c/750806
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2020-11-13 12:38:16 -05:00
api-ref/source Switch to newer openstackdocstheme and reno versions 2020-06-04 14:32:46 +02:00
bsp-files Revert "Enable 'rcu_nocb_poll' kernel config option" 2020-10-27 13:53:04 +00:00
devstack Security: Handle nospectre_v1 in the bootargs 2020-01-28 18:21:13 -05:00
doc Switch to newer openstackdocstheme and reno versions 2020-06-04 14:32:46 +02:00
installer Remove reference to cgcs-centos-repo 2020-09-18 16:04:44 -04:00
kickstart Drop isolcpu from AIO/worker kickstarts 2020-06-19 02:08:28 -04:00
mtce Add SysRq crash dump support for pmon quorum health messaging loss 2020-11-13 12:38:16 -05:00
mtce-common Add SysRq crash dump support for pmon quorum health messaging loss 2020-11-13 12:38:16 -05:00
mtce-compute Add auto-versioning to starlingx/metal mtce packages 2020-05-21 15:18:43 -04:00
mtce-control Fix heartbeat messaging when interface is set to 'lo' 2020-06-26 14:16:41 +00:00
mtce-storage Add auto-versioning to starlingx/metal mtce packages 2020-05-21 15:18:43 -04:00
releasenotes Switch to newer openstackdocstheme and reno versions 2020-06-04 14:32:46 +02:00
tools/rvmc/centos Redfish Virtual Media Controller enhancements 2020-08-17 21:14:50 +00:00
.gitignore Update tox.ini files to use stein constraints 2019-06-25 13:20:35 -04:00
.gitreview OpenDev Migration Patch 2019-04-19 19:52:33 +00:00
.zuul.yaml Tox and Zuul job for the bandit code scan in starlingx/metal 2020-06-29 08:24:46 +00:00
CONTRIBUTORS.wrs StarlingX open source release updates 2018-05-31 07:36:43 -07:00
LICENSE StarlingX open source release updates 2018-05-31 07:36:43 -07:00
README.rst Followup opendev cleanup and test jobs 2019-04-22 16:42:03 +00:00
centos_build_layer.cfg Build layering, add layer build config file 2019-10-15 19:19:45 +08:00
centos_iso_image.inc Remove unused inventory and python-inventoryclient 2020-01-08 14:12:05 -06:00
centos_pkg_dirs rvmc: remove un-used build data 2020-01-16 08:39:54 -08:00
centos_stable_docker_images.inc Utility to install a server via Redfish 2019-12-31 15:34:54 +00:00
pylint.rc Add pylint checks for python files in metal 2020-01-03 13:27:00 -06:00
test-requirements.txt Tox and Zuul job for the bandit code scan in starlingx/metal 2020-06-29 08:24:46 +00:00
tox.ini Use newer flake8 to run on ubuntu-focal Zuul machines 2020-09-09 17:59:49 -04:00

README.rst

metal

StarlingX Bare Metal Management