metal/mtce/src
Eric MacDonald 2fc05673d1 Add SysRq crash dump support for pmon quorum health messaging loss
The hostwd process supports failure handling for two pmon
quorum failure modes.
 1. persistent pmon quorum process failure
 2. persistent absence of pmon's quorum health report

This update adds a new configuration option and associated
implementation required to force a crash dump action for
failure mode 2 above.

This means that if the Process Monitor itself gets stalled or stops
running for 3 (default config) minutes then the hostwd will trigger
a SysRq to force a crash dump.

Test Plan:

PASS: Verify kdump for pmon quorum health report message loss
PASS: Verify no kdump when kdump_on_stall is disabled
PASS: Verify handling when kdump service is not active
PASS: Verify sighup config change detection and handling

Regression:

PASS: Verify softdog timeout handling and logs
PASS: Verify quorum threshold config change and handling
PASS: Verify handling with reboot/reset recovery methods disabled
PASS: Verify enable reboot_on_err config change handling
PASS: Verify reboot/reset actions are ignored while host is locked
PASS: Verify pmon failure recovery handling before threshold reached

Change-Id: Id926447574e02013f83c0170784e2a8f9a46bac1
Partial-Bug: 1894889
Depends-On: https://review.opendev.org/#/c/750806
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2020-11-13 12:38:16 -05:00
..
alarm De-branding in starlingx/metal: Titanium Cloud -> StarlingX 2020-04-03 07:58:25 +02:00
common Fix Mtce Heartbeat period recovery on MNFA Exit 2020-09-18 01:34:11 +00:00
fsmon De-branding in starlingx/metal: Titanium Cloud -> StarlingX 2020-04-03 07:58:25 +02:00
fsync Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
heartbeat Fix heartbeat messaging when interface is set to 'lo' 2020-06-26 14:16:41 +00:00
hostw Add SysRq crash dump support for pmon quorum health messaging loss 2020-11-13 12:38:16 -05:00
hwmon Merge "Fix mtce build error with gcc-8.2.1" 2020-04-28 16:58:19 +00:00
lmon Fix mtce build error with gcc-8.2.1 2020-04-03 14:44:21 +08:00
maintenance Fix AIO SX Lazy Reboot race condition 2020-10-22 17:12:48 -04:00
mtclog Set restricted permissions for mtce logfiles 2019-07-17 18:19:52 -04:00
pmon Prevent pmond process recovery when system is not running 2020-06-15 11:09:47 -04:00
public Fix mtce build error with gcc-8.2.1 2020-04-03 14:44:21 +08:00
scripts Enhance crashDumpMgr with oversized crash dump protection 2020-10-28 16:51:54 +00:00
LICENSE Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
Makefile Remove Resource Monitor ; aka rmon, from the load 2019-03-19 16:12:38 -04:00