StarlingX Bare Metal and Node Management, Hardware Maintenance
Go to file
Eric MacDonald 4fb3ce1121 Mtce: Improve robustness of heartbeat Loss reporting
Closes-Bug: 1806963

In the case where the active controller experiences a
spontaneous reboot failure there is the potential for
a race condition in the new Active-Active Heartbeat
model between the inactive hbsAgent and mtcAgent
starting up on the newly active controller.

The inactive hbsAgent can report a heartbeat Loss before
SM starts up the mtcAgent. This results in a no detect
of the of a heartbeat failed host.

This update modifies the hbsAgent to continue to report
heartbeat Loss at a throttled rate while the hbsAgent
continues to experience heartbeat loss of enabled monitored
hosts. This change is implemented in nodeClass.cpp.

Debug of this issue also revealed another undesirable race
condition and logging issue when a controller is locked. This
issue is remedied with the introduction of a control structure
'locked' state that is set on controller lock and looked at in
the hbs_cluster_update utility. hbsCluster.cpp

Two additional hbsAgent logging changes were implemented with
this update.

  1. Only print "missing peer controller cluster view" on a
     state change event. Otherwise, this becomes excessive
     whenever the inactive controller fails.
     hbsAgent.cpp

  2. Don't print the full heartbeat inventory and state banner
     with hbsInv.print_node_info on every heartbeat Loss event.
     Otherwise, this becomes excessive in larget systems.
     hbsCluster.cpp

Test Plan:
PASS: Verify hbsAgent log stream for implemented improvements.
PASS: Verify Lock inactive controller several times.
PASS: Fail inactive controller several times. verify detect.
PASS: Reboot active controller several times. verify detect.
PASS: DOR System several times. Verify proper recovery.
PASS: DOR system but prevent power-up of several hosts. Verify detect.

Change-Id: I36e6309e141e9c7844b736cce0cf0cddff3eb588
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2018-12-20 15:46:03 -05:00
api-ref/source [Doc] openstackdocstheme starlingxdocs theme 2018-10-22 14:37:08 +00:00
bsp-files Refactor patches for openstack-aodh package 2018-11-29 00:12:38 +08:00
doc [Doc] openstackdocstheme starlingxdocs theme 2018-10-22 14:37:08 +00:00
installer Fix linters issues and enable tox/zuul linters job as gate 2018-09-05 09:02:25 +08:00
inventory SysInv Decoupling: Create Inventory Service 2018-12-06 13:17:35 -05:00
kickstart Add python2-ruamel-yaml to controllers 2018-11-08 15:15:02 +00:00
mtce Mtce: Improve robustness of heartbeat Loss reporting 2018-12-20 15:46:03 -05:00
mtce-common Merge "Mtce: Add Thresholded Maintenance Enable Recovery support" 2018-12-13 15:57:44 +00:00
mtce-compute Merge "get rid of duplicate LICENSE files in 3 packages" 2018-10-31 00:58:33 +00:00
mtce-control Implement Active-Active Heartbeat as HA Improvement 2018-11-20 19:57:18 +00:00
mtce-storage get rid of duplicate LICENSE files in 3 packages 2018-10-30 02:55:34 +00:00
python-inventoryclient SysInv Decoupling: Create Inventory Service 2018-12-06 13:17:35 -05:00
releasenotes Merge "releasenotes: Grammar edit." 2018-10-30 17:27:12 +00:00
.gitignore [Doc] OpenStack API Reference Guide 2018-09-05 19:59:26 -05:00
.gitreview Add .gitreview 2018-05-31 07:36:43 -07:00
.zuul.yaml Add api-ref and relnotes publish jobs 2018-10-11 08:21:53 -05:00
CONTRIBUTORS.wrs StarlingX open source release updates 2018-05-31 07:36:43 -07:00
LICENSE StarlingX open source release updates 2018-05-31 07:36:43 -07:00
README.rst StarlingX open source release updates 2018-05-31 07:36:43 -07:00
centos_iso_image.inc SysInv Decoupling: Create Inventory Service 2018-12-06 13:17:35 -05:00
centos_pkg_dirs SysInv Decoupling: Create Inventory Service 2018-12-06 13:17:35 -05:00
test-requirements.txt pep8 job enable and fix pep8 reported issue 2018-09-06 09:45:51 +08:00
tox.ini SysInv Decoupling: Create Inventory Service 2018-12-06 13:17:35 -05:00

README.rst

stx-metal

StarlingX Bare Metal Management