metal/mtce-common/src/common
Eric MacDonald f2fedc0446 Add alarm retry support to maintenance alarm handling daemon
The maintenance alarm handling daemon (mtcalarmd) should not
drop alarm requests simply because FM process is not running.
Insteads it should retry for it and other FM error cases that
will likely succeed in time if they are retried.

Some error cases however do need to be dropped such as those
that are unlikely to succeed with retries.

Reviewed FM return codes with FM designer which lead to a list
of errors that should drop and others that should retry.

This update implements that handling with a posting and
servicing of a first-in / first-out alarm queue.

Typical retry case is the NOCONNECT error code which occurs
when FM is not running.

Alarm ordering and first try timestamp is maintained.
Retries and logs are throttled to avoid flooding.

Test Plan:

PASS: Verify success path alarm handling End-to-End.
PASS: Verify retry handling while FM is not running.
PASS: Verify handling of all FM error codes (fit tool).
PASS: Verify alarm handling under stress (inject-alarm script) soak.
PASS: verify no memory leak over stress soak.
PASS: Verify logging (success, retry, failure)
PASS: Verify alarm posted date is maintained over retry success.

Change-Id: Icd1e75583ef660b767e0788dd4af7f184bdb9e86
Closes-Bug: 1841653
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-10-07 09:07:49 -04:00
..
Makefile Add redfish support detection to maintenance 2019-08-19 14:03:37 +00:00
alarmUtil.cpp Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
alarmUtil.h Refactor infrastructure network in mtce code 2019-04-18 09:32:41 -04:00
bmcUtil.cpp Add redfish power/reset/reinstall bmc support to maintenance 2019-09-26 15:59:35 -04:00
bmcUtil.h Add redfish power/reset/reinstall bmc support to maintenance 2019-09-26 15:59:35 -04:00
fitCodes.h Add alarm retry support to maintenance alarm handling daemon 2019-10-07 09:07:49 -04:00
hostClass.cpp Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
hostClass.h Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
hostUtil.cpp MTCE: reading BMC passwords from Barbican secret storage. 2019-02-14 09:04:46 -05:00
hostUtil.h Add redfish support detection to maintenance 2019-08-19 14:03:37 +00:00
httpUtil.cpp MTCE: reading BMC passwords from Barbican secret storage. 2019-02-14 09:04:46 -05:00
httpUtil.h Remove all nova and libvirt files from mtce-common 2019-03-19 15:23:36 -05:00
ipmiUtil.cpp Redfish support for Sensor Monitoring in hwmond 2019-09-12 01:56:42 +08:00
ipmiUtil.h Redfish support for Sensor Monitoring in hwmond 2019-09-12 01:56:42 +08:00
jsonUtil.cpp Refactor infrastructure network in mtce code 2019-04-18 09:32:41 -04:00
jsonUtil.h Remove all nova and libvirt files from mtce-common 2019-03-19 15:23:36 -05:00
keyClass.cpp Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
keyClass.h Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
logMacros.h Add redfish power/reset/reinstall bmc support to maintenance 2019-09-26 15:59:35 -04:00
msgClass.cpp Add 50 byte hostname support to maintenance 2019-07-12 12:20:08 +00:00
msgClass.h Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
nlEvent.cpp Add 50 byte hostname support to maintenance 2019-07-12 12:20:08 +00:00
nlEvent.h Refactor infrastructure network in mtce code 2019-04-18 09:32:41 -04:00
nodeBase.cpp Fix maintenance cluster-host messaging 2019-07-18 14:54:45 -04:00
nodeBase.h Add alarm retry support to maintenance alarm handling daemon 2019-10-07 09:07:49 -04:00
nodeEvent.cpp Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
nodeEvent.h Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
nodeMacro.h Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
nodeTimers.cpp Add network boot support to mtce reinstall handling 2019-05-23 18:30:04 -04:00
nodeTimers.h Add redfish power/reset/reinstall bmc support to maintenance 2019-09-26 15:59:35 -04:00
nodeUtil.cpp Fix maintenance cluster-host messaging 2019-07-18 14:54:45 -04:00
nodeUtil.h Make Mtce system mode scan case in-sensitive 2019-05-06 19:14:14 +00:00
pingUtil.cpp Add redfish power/reset/reinstall bmc support to maintenance 2019-09-26 15:59:35 -04:00
pingUtil.h Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
redfishUtil.cpp Add redfish power/reset/reinstall bmc support to maintenance 2019-09-26 15:59:35 -04:00
redfishUtil.h Add redfish power/reset/reinstall bmc support to maintenance 2019-09-26 15:59:35 -04:00
regexUtil.cpp Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
regexUtil.h Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
returnCodes.h Refactor infrastructure network in mtce code 2019-04-18 09:32:41 -04:00
secretUtil.cpp Improve BMC password first fetch handling in hwmon 2019-09-17 18:57:08 +00:00
secretUtil.h Improve BMC password first fetch handling in hwmon 2019-09-17 18:57:08 +00:00
threadUtil.cpp Add redfish power/reset/reinstall bmc support to maintenance 2019-09-26 15:59:35 -04:00
threadUtil.h Enable protocol switch between ipmi and redfish for hwmon 2019-09-22 22:28:30 -04:00
timeUtil.cpp Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
timeUtil.h Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
tokenUtil.cpp Remove references to ceilometer in maintenance 2019-04-30 14:28:12 -04:00
tokenUtil.h MTCE: reading BMC passwords from Barbican secret storage. 2019-02-14 09:04:46 -05:00