metal/mtce-common/src/common
Eric MacDonald 2fc05673d1 Add SysRq crash dump support for pmon quorum health messaging loss
The hostwd process supports failure handling for two pmon
quorum failure modes.
 1. persistent pmon quorum process failure
 2. persistent absence of pmon's quorum health report

This update adds a new configuration option and associated
implementation required to force a crash dump action for
failure mode 2 above.

This means that if the Process Monitor itself gets stalled or stops
running for 3 (default config) minutes then the hostwd will trigger
a SysRq to force a crash dump.

Test Plan:

PASS: Verify kdump for pmon quorum health report message loss
PASS: Verify no kdump when kdump_on_stall is disabled
PASS: Verify handling when kdump service is not active
PASS: Verify sighup config change detection and handling

Regression:

PASS: Verify softdog timeout handling and logs
PASS: Verify quorum threshold config change and handling
PASS: Verify handling with reboot/reset recovery methods disabled
PASS: Verify enable reboot_on_err config change handling
PASS: Verify reboot/reset actions are ignored while host is locked
PASS: Verify pmon failure recovery handling before threshold reached

Change-Id: Id926447574e02013f83c0170784e2a8f9a46bac1
Partial-Bug: 1894889
Depends-On: https://review.opendev.org/#/c/750806
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2020-11-13 12:38:16 -05:00
..
Makefile Add redfish support detection to maintenance 2019-08-19 14:03:37 +00:00
alarmUtil.cpp Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
alarmUtil.h Refactor infrastructure network in mtce code 2019-04-18 09:32:41 -04:00
bmcUtil.cpp Refactor BMC provisioning in Maintenance 2019-12-09 09:39:49 -05:00
bmcUtil.h Refactor BMC provisioning in Maintenance 2019-12-09 09:39:49 -05:00
fitCodes.h Add mtcAgent socket initialization failure retry handling. 2020-04-01 19:24:22 +00:00
hostClass.cpp Refactor BMC provisioning in Maintenance 2019-12-09 09:39:49 -05:00
hostClass.h Refactor BMC provisioning in Maintenance 2019-12-09 09:39:49 -05:00
hostUtil.cpp Refactor BMC provisioning in Maintenance 2019-12-09 09:39:49 -05:00
hostUtil.h Refactor BMC provisioning in Maintenance 2019-12-09 09:39:49 -05:00
httpUtil.cpp MTCE: reading BMC passwords from Barbican secret storage. 2019-02-14 09:04:46 -05:00
httpUtil.h Remove all nova and libvirt files from mtce-common 2019-03-19 15:23:36 -05:00
ipmiUtil.cpp Refactor BMC provisioning in Maintenance 2019-12-09 09:39:49 -05:00
ipmiUtil.h Redfish support for Sensor Monitoring in hwmond 2019-09-12 01:56:42 +08:00
jsonUtil.cpp Refactor BMC provisioning in Maintenance 2019-12-09 09:39:49 -05:00
jsonUtil.h Remove all nova and libvirt files from mtce-common 2019-03-19 15:23:36 -05:00
keyClass.cpp Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
keyClass.h Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
logMacros.h Add SysRq crash dump support for pmon quorum health messaging loss 2020-11-13 12:38:16 -05:00
msgClass.cpp Fix mtce-common build error with gcc-8.2.1 2020-04-03 14:49:09 +08:00
msgClass.h Fix BMC access loss handling 2020-01-03 09:34:37 -05:00
nlEvent.cpp Fix heartbeat messaging when interface is set to 'lo' 2020-06-26 14:16:41 +00:00
nlEvent.h Refactor infrastructure network in mtce code 2019-04-18 09:32:41 -04:00
nodeBase.cpp Modify Mtce Reinstall FSM to first power-off BMC provisioned hosts 2020-02-12 15:44:26 +00:00
nodeBase.h Fix heartbeat messaging when interface is set to 'lo' 2020-06-26 14:16:41 +00:00
nodeEvent.cpp Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
nodeEvent.h Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
nodeMacro.h Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
nodeTimers.cpp Refactor BMC provisioning in Maintenance 2019-12-09 09:39:49 -05:00
nodeTimers.h Add redfish power/reset/reinstall bmc support to maintenance 2019-09-26 15:59:35 -04:00
nodeUtil.cpp Prevent pmond process recovery when system is not running 2020-06-15 11:09:47 -04:00
nodeUtil.h Prevent pmond process recovery when system is not running 2020-06-15 11:09:47 -04:00
pingUtil.cpp Fix BMC access loss handling 2020-01-03 09:34:37 -05:00
pingUtil.h Fix BMC access loss handling 2020-01-03 09:34:37 -05:00
redfishUtil.cpp Refactor BMC provisioning in Maintenance 2019-12-09 09:39:49 -05:00
redfishUtil.h Add redfish power/reset/reinstall bmc support to maintenance 2019-09-26 15:59:35 -04:00
regexUtil.cpp Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
regexUtil.h Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
returnCodes.h Refactor infrastructure network in mtce code 2019-04-18 09:32:41 -04:00
secretUtil.cpp Refactor BMC provisioning in Maintenance 2019-12-09 09:39:49 -05:00
secretUtil.h Improve BMC password first fetch handling in hwmon 2019-09-17 18:57:08 +00:00
threadUtil.cpp Refactor BMC provisioning in Maintenance 2019-12-09 09:39:49 -05:00
threadUtil.h Enable protocol switch between ipmi and redfish for hwmon 2019-09-22 22:28:30 -04:00
timeUtil.cpp Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
timeUtil.h Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
tokenUtil.cpp Remove references to ceilometer in maintenance 2019-04-30 14:28:12 -04:00
tokenUtil.h MTCE: reading BMC passwords from Barbican secret storage. 2019-02-14 09:04:46 -05:00