metal/mtce/src/maintenance
Eric MacDonald 031818e55b Add in-service test to clear stale config failure alarm
A configuration failure alarm can get stuck asserted if
that node experiences an uncontrolled reboot that recovers
without a configuration failure.

This update adds an in-service test that audits host health
while there is a configuration failure alarm raised and
clear that alarm if the failure condition goes away. This
could be a result of an in-service manifest that runs and
corrects the configuration or if the node reboots and comes
back up in a healthy (properly configured) state.

Fixed bug that was clearing config alarm severity state
when a heartbeat clear event is received.

This update also goes a step further and introduces an
alarms state audit that detects and corrects maintenance
alarm state mismatches.

Test Plan:

PASS: Verify the add handler loads config alarm state
PASS: Verify in-service test clears stale config alarm
PASS: Verify in-service test acts on new config failure
      ... degrade - active controller
      ... fail    - other hosts
PASS: Verify audit fixes mtce alarm state mismatches
PASS: Verify audit handles fm not running case
PASS: Verify audit handling behavior with valid alarm cases
PASS: Verify locked alarm management over process restart
PASS: Verify audit only logs active alarms list changes
PASS: Verify audit runs for both locked/unlocked nodes
PASS: Verify update as a patch

Regression:

PASS: Verify enable sequence config failure handling
PASS: ... active controller     - recoverable degrade
PASS: ... other nodes           - threshold fail
PASS: ... auto recovery disable - config failure
PASS: Verify mtcAgent process logging
PASS: Verify heartbeat handling and alarming
PASS: Verify Standard system install
PASS: Verify AIO system install

Change-Id: If9957229810435e9faeb08374f2b5fbcb5b0f826
Closes-Bug: 1918195
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2021-03-29 16:39:52 -04:00
..
Makefile Add support for peer controller reset via mtcClient 2021-01-14 16:44:14 -05:00
mtcAlarm.cpp Add in-service test to clear stale config failure alarm 2021-03-29 16:39:52 -04:00
mtcAlarm.h Add in-service test to clear stale config failure alarm 2021-03-29 16:39:52 -04:00
mtcBmcUtil.cpp Improve mtcAgent interrupted thread cleanup 2021-03-15 10:51:16 -04:00
mtcBmcUtil.h Add redfish support detection to maintenance 2019-08-19 14:03:37 +00:00
mtcCmdHdlr.cpp Add redfish power/reset/reinstall bmc support to maintenance 2019-09-26 15:59:35 -04:00
mtcCompMsg.cpp Merge "Make mtcClient stop collectd before shutdown" 2021-02-04 14:21:13 +00:00
mtcCtrlMsg.cpp Add support for peer controller reset via mtcClient 2021-01-14 16:44:14 -05:00
mtcHttpSvr.cpp Fix Mtce's VIM systems query handling 2019-10-09 09:44:35 -04:00
mtcHttpSvr.h Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
mtcHttpUtil.cpp MTCE: reading BMC passwords from Barbican secret storage. 2019-02-14 09:04:46 -05:00
mtcHttpUtil.h Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
mtcInvApi.cpp Refactor BMC provisioning in Maintenance 2019-12-09 09:39:49 -05:00
mtcInvApi.h Fix format-overflow warning in mtcInvApi 2019-08-27 10:33:44 -05:00
mtcNodeComp.cpp Add support for peer controller reset via mtcClient 2021-01-14 16:44:14 -05:00
mtcNodeComp.h Add support for peer controller reset via mtcClient 2021-01-14 16:44:14 -05:00
mtcNodeCtrl.cpp Add in-service test to clear stale config failure alarm 2021-03-29 16:39:52 -04:00
mtcNodeFsm.cpp Improve mtcAgent interrupted thread cleanup 2021-03-15 10:51:16 -04:00
mtcNodeFsm.h Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
mtcNodeHdlrs.cpp Add in-service test to clear stale config failure alarm 2021-03-29 16:39:52 -04:00
mtcNodeHdlrs.h Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
mtcNodeMnfa.cpp Fix Graceful Recovery handling while in Graceful Recovery handling 2021-03-17 14:25:19 -04:00
mtcNodeMsg.h Add support for peer controller reset via mtcClient 2021-01-14 16:44:14 -05:00
mtcSmgrApi.cpp Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
mtcSmgrApi.h Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
mtcStubs.cpp Implement Active-Active Heartbeat as HA Improvement Fix 2018-12-10 09:57:34 -05:00
mtcSubfHdlrs.cpp Refactor BMC provisioning in Maintenance 2019-12-09 09:39:49 -05:00
mtcThreads.cpp Refactor BMC provisioning in Maintenance 2019-12-09 09:39:49 -05:00
mtcThreads.h Add redfish power/reset/reinstall bmc support to maintenance 2019-09-26 15:59:35 -04:00
mtcVimApi.cpp Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
mtcVimApi.h Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
mtcWorkQueue.cpp [Trivial Fix] fix typos in docstrings 2019-02-21 14:46:06 +08:00