A configuration failure alarm can get stuck asserted if
that node experiences an uncontrolled reboot that recovers
without a configuration failure.
This update adds an in-service test that audits host health
while there is a configuration failure alarm raised and
clear that alarm if the failure condition goes away. This
could be a result of an in-service manifest that runs and
corrects the configuration or if the node reboots and comes
back up in a healthy (properly configured) state.
Fixed bug that was clearing config alarm severity state
when a heartbeat clear event is received.
This update also goes a step further and introduces an
alarms state audit that detects and corrects maintenance
alarm state mismatches.
Test Plan:
PASS: Verify the add handler loads config alarm state
PASS: Verify in-service test clears stale config alarm
PASS: Verify in-service test acts on new config failure
... degrade - active controller
... fail - other hosts
PASS: Verify audit fixes mtce alarm state mismatches
PASS: Verify audit handles fm not running case
PASS: Verify audit handling behavior with valid alarm cases
PASS: Verify locked alarm management over process restart
PASS: Verify audit only logs active alarms list changes
PASS: Verify audit runs for both locked/unlocked nodes
PASS: Verify update as a patch
Regression:
PASS: Verify enable sequence config failure handling
PASS: ... active controller - recoverable degrade
PASS: ... other nodes - threshold fail
PASS: ... auto recovery disable - config failure
PASS: Verify mtcAgent process logging
PASS: Verify heartbeat handling and alarming
PASS: Verify Standard system install
PASS: Verify AIO system install
Change-Id: If9957229810435e9faeb08374f2b5fbcb5b0f826
Closes-Bug: 1918195
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>