metal

History

Eric MacDonald 031818e55b Add in-service test to clear stale config failure alarm A configuration failure alarm can get stuck asserted if that node experiences an uncontrolled reboot that recovers without a configuration failure. This update adds an in-service test that audits host health while there is a configuration failure alarm raised and clear that alarm if the failure condition goes away. This could be a result of an in-service manifest that runs and corrects the configuration or if the node reboots and comes back up in a healthy (properly configured) state. Fixed bug that was clearing config alarm severity state when a heartbeat clear event is received. This update also goes a step further and introduces an alarms state audit that detects and corrects maintenance alarm state mismatches. Test Plan: PASS: Verify the add handler loads config alarm state PASS: Verify in-service test clears stale config alarm PASS: Verify in-service test acts on new config failure ... degrade - active controller ... fail - other hosts PASS: Verify audit fixes mtce alarm state mismatches PASS: Verify audit handles fm not running case PASS: Verify audit handling behavior with valid alarm cases PASS: Verify locked alarm management over process restart PASS: Verify audit only logs active alarms list changes PASS: Verify audit runs for both locked/unlocked nodes PASS: Verify update as a patch Regression: PASS: Verify enable sequence config failure handling PASS: ... active controller - recoverable degrade PASS: ... other nodes - threshold fail PASS: ... auto recovery disable - config failure PASS: Verify mtcAgent process logging PASS: Verify heartbeat handling and alarming PASS: Verify Standard system install PASS: Verify AIO system install Change-Id: If9957229810435e9faeb08374f2b5fbcb5b0f826 Closes-Bug: 1918195 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>		2021-03-29 16:39:52 -04:00
..
Makefile	Add support for peer controller reset via mtcClient	2021-01-14 16:44:14 -05:00
mtcAlarm.cpp	Add in-service test to clear stale config failure alarm	2021-03-29 16:39:52 -04:00
mtcAlarm.h	Add in-service test to clear stale config failure alarm	2021-03-29 16:39:52 -04:00
mtcBmcUtil.cpp	Improve mtcAgent interrupted thread cleanup	2021-03-15 10:51:16 -04:00
mtcBmcUtil.h	Add redfish support detection to maintenance	2019-08-19 14:03:37 +00:00
mtcCmdHdlr.cpp	Add redfish power/reset/reinstall bmc support to maintenance	2019-09-26 15:59:35 -04:00
mtcCompMsg.cpp	Merge "Make mtcClient stop collectd before shutdown"	2021-02-04 14:21:13 +00:00
mtcCtrlMsg.cpp	Add support for peer controller reset via mtcClient	2021-01-14 16:44:14 -05:00
mtcHttpSvr.cpp	Fix Mtce's VIM systems query handling	2019-10-09 09:44:35 -04:00
mtcHttpSvr.h	Decouple Guest-server/agent from stx-metal	2018-09-18 17:15:08 -04:00
mtcHttpUtil.cpp	MTCE: reading BMC passwords from Barbican secret storage.	2019-02-14 09:04:46 -05:00
mtcHttpUtil.h	Decouple Guest-server/agent from stx-metal	2018-09-18 17:15:08 -04:00
mtcInvApi.cpp	Refactor BMC provisioning in Maintenance	2019-12-09 09:39:49 -05:00
mtcInvApi.h	Fix format-overflow warning in mtcInvApi	2019-08-27 10:33:44 -05:00
mtcNodeComp.cpp	Add support for peer controller reset via mtcClient	2021-01-14 16:44:14 -05:00
mtcNodeComp.h	Add support for peer controller reset via mtcClient	2021-01-14 16:44:14 -05:00
mtcNodeCtrl.cpp	Add in-service test to clear stale config failure alarm	2021-03-29 16:39:52 -04:00
mtcNodeFsm.cpp	Improve mtcAgent interrupted thread cleanup	2021-03-15 10:51:16 -04:00
mtcNodeFsm.h	Decouple Guest-server/agent from stx-metal	2018-09-18 17:15:08 -04:00
mtcNodeHdlrs.cpp	Add in-service test to clear stale config failure alarm	2021-03-29 16:39:52 -04:00
mtcNodeHdlrs.h	Decouple Guest-server/agent from stx-metal	2018-09-18 17:15:08 -04:00
mtcNodeMnfa.cpp	Fix Graceful Recovery handling while in Graceful Recovery handling	2021-03-17 14:25:19 -04:00
mtcNodeMsg.h	Add support for peer controller reset via mtcClient	2021-01-14 16:44:14 -05:00
mtcSmgrApi.cpp	Decouple Guest-server/agent from stx-metal	2018-09-18 17:15:08 -04:00
mtcSmgrApi.h	Decouple Guest-server/agent from stx-metal	2018-09-18 17:15:08 -04:00
mtcStubs.cpp	Implement Active-Active Heartbeat as HA Improvement Fix	2018-12-10 09:57:34 -05:00
mtcSubfHdlrs.cpp	Refactor BMC provisioning in Maintenance	2019-12-09 09:39:49 -05:00
mtcThreads.cpp	Refactor BMC provisioning in Maintenance	2019-12-09 09:39:49 -05:00
mtcThreads.h	Add redfish power/reset/reinstall bmc support to maintenance	2019-09-26 15:59:35 -04:00
mtcVimApi.cpp	Decouple Guest-server/agent from stx-metal	2018-09-18 17:15:08 -04:00
mtcVimApi.h	Decouple Guest-server/agent from stx-metal	2018-09-18 17:15:08 -04:00
mtcWorkQueue.cpp	[Trivial Fix] fix typos in docstrings	2019-02-21 14:46:06 +08:00