StarlingX Bare Metal and Node Management, Hardware Maintenance

Go to file

Eric MacDonald acd2d684f6 Mtce: Debouce heartbeat recovery For the event of Heartbeat Failure with a host, the Mtce Heartbeat Agent will declare heartbeat recovery upon the first successful heartbeat reply after the loss is declared ; basically edge level trigger recovery. In cases where a networking issue causes heartbeat loss of a group of hosts, Maintenance tracks the group of hosts that experienced heartbeta loss and puts the system into 'Multi Node Failure Avoidance' mode. maintenance then simply waits up to a configured timeout period for hosts to regain heartbeat. As heartbeat is regained for each host that host is attempted to be 'Gracefully Recovered'. However, if the networking issue persists in a way that the occasional transient heartbeat pulse gets through then the maintenance system can prematurely take hosts and then 'the system' out of MNFA mode only to find that heartbeat is actually not properly recovered/working only to then fail and force reboot/reset each node that is still experiencing heartbeat loss. This update changes the heartbeat service from an 'edge' to 'level' sensitive recovery by requiring a number of back-2-back heartbeat pulses following a failure before that host is delared as recovered and pulled out of the MMNFA pool. Basically, This update makes the system's MNFA recovery algorithm more robust in the face of transient heartbeat loss for a group of hosts. Story: 2002882 Task: 22845 Change-Id: Ie36b73a14cfad317d900e3a3a9ddb434326737a1 Signed-off-by: Jack Ding <jack.ding@windriver.com>		2018-07-20 11:12:19 -04:00
bsp-files	Merge "Update upgrade version to 18.03"	2018-07-10 17:14:33 +00:00
installer	Update boot configs to match CentOS 7.5 kernel	2018-07-06 11:26:06 -04:00
kickstart	Rename mwa-* subdirectories to match the git repo name	2018-07-03 16:29:24 -04:00
mtce-common	Mtce: Debouce heartbeat recovery	2018-07-20 11:12:19 -04:00
mtce-compute	Rename mwa-* subdirectories to match the git repo name	2018-07-03 16:29:24 -04:00
mtce-control	Rename mwa-* subdirectories to match the git repo name	2018-07-03 16:29:24 -04:00
mtce-storage	Rename mwa-* subdirectories to match the git repo name	2018-07-03 16:29:24 -04:00
.gitignore	Add default test framework	2018-06-11 18:51:02 -05:00
.gitreview	Add .gitreview	2018-05-31 07:36:43 -07:00
.zuul.yaml	Remove non-voting gate job	2018-06-29 14:31:56 -05:00
CONTRIBUTORS.wrs	StarlingX open source release updates	2018-05-31 07:36:43 -07:00
LICENSE	StarlingX open source release updates	2018-05-31 07:36:43 -07:00
README.rst	StarlingX open source release updates	2018-05-31 07:36:43 -07:00
centos_pkg_dirs	Split centos-pkg-dirs along git boundaries.	2018-06-20 16:25:33 -04:00
mwa-beas.map	StarlingX open source release updates	2018-05-31 07:36:43 -07:00
test-requirements.txt	Add default test framework	2018-06-11 18:51:02 -05:00
tox.ini	Add default test framework	2018-06-11 18:51:02 -05:00

README.rst

stx-metal

StarlingX Bare Metal Management