acd2d684f6
For the event of Heartbeat Failure with a host, the Mtce Heartbeat Agent will declare heartbeat recovery upon the first successful heartbeat reply after the loss is declared ; basically edge level trigger recovery. In cases where a networking issue causes heartbeat loss of a group of hosts, Maintenance tracks the group of hosts that experienced heartbeta loss and puts the system into 'Multi Node Failure Avoidance' mode. maintenance then simply waits up to a configured timeout period for hosts to regain heartbeat. As heartbeat is regained for each host that host is attempted to be 'Gracefully Recovered'. However, if the networking issue persists in a way that the occasional transient heartbeat pulse gets through then the maintenance system can prematurely take hosts and then 'the system' out of MNFA mode only to find that heartbeat is actually not properly recovered/working only to then fail and force reboot/reset each node that is still experiencing heartbeat loss. This update changes the heartbeat service from an 'edge' to 'level' sensitive recovery by requiring a number of back-2-back heartbeat pulses following a failure before that host is delared as recovered and pulled out of the MMNFA pool. Basically, This update makes the system's MNFA recovery algorithm more robust in the face of transient heartbeat loss for a group of hosts. Story: 2002882 Task: 22845 Change-Id: Ie36b73a14cfad317d900e3a3a9ddb434326737a1 Signed-off-by: Jack Ding <jack.ding@windriver.com> |
||
---|---|---|
bsp-files | ||
installer | ||
kickstart | ||
mtce-common | ||
mtce-compute | ||
mtce-control | ||
mtce-storage | ||
.gitignore | ||
.gitreview | ||
.zuul.yaml | ||
CONTRIBUTORS.wrs | ||
LICENSE | ||
README.rst | ||
centos_pkg_dirs | ||
mwa-beas.map | ||
test-requirements.txt | ||
tox.ini |
README.rst
stx-metal
StarlingX Bare Metal Management