StarlingX Bare Metal and Node Management, Hardware Maintenance
Go to file
Eric MacDonald acd2d684f6 Mtce: Debouce heartbeat recovery
For the event of Heartbeat Failure with a host, the Mtce Heartbeat Agent
will declare heartbeat recovery upon the first successful heartbeat
reply after the loss is declared ; basically edge level trigger
recovery.

In cases where a networking issue causes heartbeat loss of a group of
hosts, Maintenance tracks the group of hosts that experienced heartbeta
loss and puts the system into 'Multi Node Failure Avoidance' mode.
maintenance then simply waits up to a configured timeout period for
hosts to regain heartbeat.
As heartbeat is regained for each host that host is attempted to be
'Gracefully Recovered'.

However, if the networking issue persists in a way that the occasional
transient heartbeat pulse gets through then the maintenance system can
prematurely take hosts and then 'the system' out of MNFA mode only to
find that heartbeat is actually not properly recovered/working only to
then fail and force reboot/reset each node that is still experiencing
heartbeat loss.

This update changes the heartbeat service from an 'edge' to 'level'
sensitive recovery by requiring a number of back-2-back heartbeat pulses
following a failure before that host is delared as recovered and pulled
out of the MMNFA pool.

Basically, This update makes the system's MNFA recovery algorithm more
robust in the face of transient heartbeat loss for a group of hosts.

Story: 2002882
Task: 22845

Change-Id: Ie36b73a14cfad317d900e3a3a9ddb434326737a1
Signed-off-by: Jack Ding <jack.ding@windriver.com>
2018-07-20 11:12:19 -04:00
bsp-files Merge "Update upgrade version to 18.03" 2018-07-10 17:14:33 +00:00
installer Update boot configs to match CentOS 7.5 kernel 2018-07-06 11:26:06 -04:00
kickstart Rename mwa-* subdirectories to match the git repo name 2018-07-03 16:29:24 -04:00
mtce-common Mtce: Debouce heartbeat recovery 2018-07-20 11:12:19 -04:00
mtce-compute Rename mwa-* subdirectories to match the git repo name 2018-07-03 16:29:24 -04:00
mtce-control Rename mwa-* subdirectories to match the git repo name 2018-07-03 16:29:24 -04:00
mtce-storage Rename mwa-* subdirectories to match the git repo name 2018-07-03 16:29:24 -04:00
.gitignore Add default test framework 2018-06-11 18:51:02 -05:00
.gitreview Add .gitreview 2018-05-31 07:36:43 -07:00
.zuul.yaml Remove non-voting gate job 2018-06-29 14:31:56 -05:00
CONTRIBUTORS.wrs StarlingX open source release updates 2018-05-31 07:36:43 -07:00
LICENSE StarlingX open source release updates 2018-05-31 07:36:43 -07:00
README.rst StarlingX open source release updates 2018-05-31 07:36:43 -07:00
centos_pkg_dirs Split centos-pkg-dirs along git boundaries. 2018-06-20 16:25:33 -04:00
mwa-beas.map StarlingX open source release updates 2018-05-31 07:36:43 -07:00
test-requirements.txt Add default test framework 2018-06-11 18:51:02 -05:00
tox.ini Add default test framework 2018-06-11 18:51:02 -05:00

README.rst

stx-metal

StarlingX Bare Metal Management