metal/mtce-common/cgts-mtce-common-1.0
Eric MacDonald acd2d684f6 Mtce: Debouce heartbeat recovery
For the event of Heartbeat Failure with a host, the Mtce Heartbeat Agent
will declare heartbeat recovery upon the first successful heartbeat
reply after the loss is declared ; basically edge level trigger
recovery.

In cases where a networking issue causes heartbeat loss of a group of
hosts, Maintenance tracks the group of hosts that experienced heartbeta
loss and puts the system into 'Multi Node Failure Avoidance' mode.
maintenance then simply waits up to a configured timeout period for
hosts to regain heartbeat.
As heartbeat is regained for each host that host is attempted to be
'Gracefully Recovered'.

However, if the networking issue persists in a way that the occasional
transient heartbeat pulse gets through then the maintenance system can
prematurely take hosts and then 'the system' out of MNFA mode only to
find that heartbeat is actually not properly recovered/working only to
then fail and force reboot/reset each node that is still experiencing
heartbeat loss.

This update changes the heartbeat service from an 'edge' to 'level'
sensitive recovery by requiring a number of back-2-back heartbeat pulses
following a failure before that host is delared as recovered and pulled
out of the MMNFA pool.

Basically, This update makes the system's MNFA recovery algorithm more
robust in the face of transient heartbeat loss for a group of hosts.

Story: 2002882
Task: 22845

Change-Id: Ie36b73a14cfad317d900e3a3a9ddb434326737a1
Signed-off-by: Jack Ding <jack.ding@windriver.com>
2018-07-20 11:12:19 -04:00
..
alarm StarlingX open source release updates 2018-05-31 07:36:43 -07:00
common Mtce: Debouce heartbeat recovery 2018-07-20 11:12:19 -04:00
daemon Collectd+InfluxDb-RMON Replacement(ALL METRICS) P1 2018-07-03 11:04:27 -04:00
fsmon StarlingX open source release updates 2018-05-31 07:36:43 -07:00
guest Shorten "addons/wr-cgcs/layers/cgcs" to just "stx" 2018-07-04 11:03:59 -04:00
heartbeat Mtce: Debouce heartbeat recovery 2018-07-20 11:12:19 -04:00
hostw StarlingX open source release updates 2018-05-31 07:36:43 -07:00
hwmon Mtce: Implement all token fetches as non-blocking operations. 2018-06-27 15:00:23 -04:00
maintenance Mtce: Re-add explicit request for mtcAlive in Graceful Recovery handler 2018-07-20 11:11:59 -04:00
mtclog StarlingX open source release updates 2018-05-31 07:36:43 -07:00
pmon pmond: add support for no script label in conf files 2018-07-01 21:18:33 -04:00
public StarlingX open source release updates 2018-05-31 07:36:43 -07:00
rmon Collectd+InfluxDb-RMON Replacement(ALL METRICS) P1 2018-07-03 11:04:27 -04:00
scripts StarlingX open source release updates 2018-05-31 07:36:43 -07:00
.gitignore StarlingX open source release updates 2018-05-31 07:36:43 -07:00
LICENSE StarlingX open source release updates 2018-05-31 07:36:43 -07:00
Makefile StarlingX open source release updates 2018-05-31 07:36:43 -07:00