metal/mtce/src
Eric MacDonald 031818e55b Add in-service test to clear stale config failure alarm
A configuration failure alarm can get stuck asserted if
that node experiences an uncontrolled reboot that recovers
without a configuration failure.

This update adds an in-service test that audits host health
while there is a configuration failure alarm raised and
clear that alarm if the failure condition goes away. This
could be a result of an in-service manifest that runs and
corrects the configuration or if the node reboots and comes
back up in a healthy (properly configured) state.

Fixed bug that was clearing config alarm severity state
when a heartbeat clear event is received.

This update also goes a step further and introduces an
alarms state audit that detects and corrects maintenance
alarm state mismatches.

Test Plan:

PASS: Verify the add handler loads config alarm state
PASS: Verify in-service test clears stale config alarm
PASS: Verify in-service test acts on new config failure
      ... degrade - active controller
      ... fail    - other hosts
PASS: Verify audit fixes mtce alarm state mismatches
PASS: Verify audit handles fm not running case
PASS: Verify audit handling behavior with valid alarm cases
PASS: Verify locked alarm management over process restart
PASS: Verify audit only logs active alarms list changes
PASS: Verify audit runs for both locked/unlocked nodes
PASS: Verify update as a patch

Regression:

PASS: Verify enable sequence config failure handling
PASS: ... active controller     - recoverable degrade
PASS: ... other nodes           - threshold fail
PASS: ... auto recovery disable - config failure
PASS: Verify mtcAgent process logging
PASS: Verify heartbeat handling and alarming
PASS: Verify Standard system install
PASS: Verify AIO system install

Change-Id: If9957229810435e9faeb08374f2b5fbcb5b0f826
Closes-Bug: 1918195
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2021-03-29 16:39:52 -04:00
..
alarm De-branding in starlingx/metal: Titanium Cloud -> StarlingX 2020-04-03 07:58:25 +02:00
common Add in-service test to clear stale config failure alarm 2021-03-29 16:39:52 -04:00
fsmon De-branding in starlingx/metal: Titanium Cloud -> StarlingX 2020-04-03 07:58:25 +02:00
fsync Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
heartbeat Merge "Mtce heartbeat cluster state change notification improvement" 2021-01-18 16:15:27 +00:00
hostw Fix mtce compiling issue with gcc8 2021-02-24 18:21:00 -05:00
hwmon Add NonRecoverable property to Hardware Monitor's Redfish 2021-03-11 11:13:59 -05:00
lmon Fix mtce build error with gcc-8.2.1 2020-04-03 14:44:21 +08:00
maintenance Add in-service test to clear stale config failure alarm 2021-03-29 16:39:52 -04:00
mtclog Set restricted permissions for mtce logfiles 2019-07-17 18:19:52 -04:00
pmon Add alarmed process audit to Process Monitor 2021-03-09 08:22:32 -05:00
public Fix mtce build error with gcc-8.2.1 2020-04-03 14:44:21 +08:00
scripts Fix reinstall of worker nodes 2021-02-26 19:07:40 +02:00
LICENSE Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
Makefile Remove Resource Monitor ; aka rmon, from the load 2019-03-19 16:12:38 -04:00