metal/mtce/src/heartbeat
Eric MacDonald 4fb3ce1121 Mtce: Improve robustness of heartbeat Loss reporting
Closes-Bug: 1806963

In the case where the active controller experiences a
spontaneous reboot failure there is the potential for
a race condition in the new Active-Active Heartbeat
model between the inactive hbsAgent and mtcAgent
starting up on the newly active controller.

The inactive hbsAgent can report a heartbeat Loss before
SM starts up the mtcAgent. This results in a no detect
of the of a heartbeat failed host.

This update modifies the hbsAgent to continue to report
heartbeat Loss at a throttled rate while the hbsAgent
continues to experience heartbeat loss of enabled monitored
hosts. This change is implemented in nodeClass.cpp.

Debug of this issue also revealed another undesirable race
condition and logging issue when a controller is locked. This
issue is remedied with the introduction of a control structure
'locked' state that is set on controller lock and looked at in
the hbs_cluster_update utility. hbsCluster.cpp

Two additional hbsAgent logging changes were implemented with
this update.

  1. Only print "missing peer controller cluster view" on a
     state change event. Otherwise, this becomes excessive
     whenever the inactive controller fails.
     hbsAgent.cpp

  2. Don't print the full heartbeat inventory and state banner
     with hbsInv.print_node_info on every heartbeat Loss event.
     Otherwise, this becomes excessive in larget systems.
     hbsCluster.cpp

Test Plan:
PASS: Verify hbsAgent log stream for implemented improvements.
PASS: Verify Lock inactive controller several times.
PASS: Fail inactive controller several times. verify detect.
PASS: Reboot active controller several times. verify detect.
PASS: DOR System several times. Verify proper recovery.
PASS: DOR system but prevent power-up of several hosts. Verify detect.

Change-Id: I36e6309e141e9c7844b736cce0cf0cddff3eb588
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2018-12-20 15:46:03 -05:00
..
Makefile Set SHELL in Makefiles that use bash constructs 2018-12-07 14:09:48 -06:00
hbsAgent.cpp Mtce: Improve robustness of heartbeat Loss reporting 2018-12-20 15:46:03 -05:00
hbsAlarm.cpp Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
hbsAlarm.h Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
hbsBase.h Mtce: Improve robustness of heartbeat Loss reporting 2018-12-20 15:46:03 -05:00
hbsClient.cpp Mtce: Improve robustness of heartbeat Loss reporting 2018-12-20 15:46:03 -05:00
hbsCluster.cpp Mtce: Improve robustness of heartbeat Loss reporting 2018-12-20 15:46:03 -05:00
hbsCluster.h Mtce: Add heartbeat cluster information for SM query 2018-10-05 22:47:17 +00:00
hbsPmon.cpp Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
hbsStubs.cpp Implement Active-Active Heartbeat as HA Improvement 2018-11-20 19:57:18 +00:00
hbsUtil.cpp Implement Active-Active Heartbeat as HA Improvement Fix 2018-12-10 09:57:34 -05:00
mtceHbsCluster.h Mtce: Add heartbeat cluster information for SM query 2018-10-05 22:47:17 +00:00