StarlingX Bare Metal and Node Management, Hardware Maintenance
Go to file
Eric MacDonald 675f49d556 Add mtcAgent support for sm_node_unhealthy condition
When heartbeat over both networks fail, mtcAgent
provides a 5 second grace period for heartbeat to
recover before failing the node.

However, when heartbeat fails over only one of the
networks (management or cluster) the mtcAgent does
not honour that 5 second grace period ; a bug.

When it comes to peer controller heartbeat failure
handling, SM needs that 5 second grace period to handle
swact before mtcAgent declares the peer controller as
failed, resets the node and updates the database.

This update implements a change that forces a 2 second
wait time between each fast enable and fixes the fast
enable threshold count to be the intended 3 retries.
This ensures that at least 5 seconds, actually 6 in
the case of single network heartbeat loss, passes
before declaring the node as failed.

In addition to that, a special condition is added to
detect and stop work if the active controller is
sm_node_unhealthy. We don't want mtcAgent to make
any database updates while in this failure mode.
This gives SM the time to handle the failure
according to the system's controllers' high
availability handling feature.

Test Plan:

PASS: Verify mtcAgent behavior on set and clear of
      SM node unhealthy state.
PASS: Verify SM has at least 5 seconds to shut down
      mtcAgent when heartbeat to peer controller fails
      for one or both networks.
PASS: Test real case scenario with link pull.
PASS: Verify logging in presence of real failure condition.

Change-Id: I8f8d6688040fe899aff6fc40aadda37894c2d5e9
Closes-Bug: 1847657
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-10-15 15:24:34 -04:00
api-ref/source Clean up and standardize landing pages 2019-01-09 09:34:38 -08:00
bsp-files Support custom kickstart addon for install from USB 2019-09-20 12:42:22 -04:00
devstack Add redfish support detection to maintenance 2019-08-19 14:03:37 +00:00
doc Fix the error links for metal docs 2019-07-03 09:20:25 -04:00
installer Configurable Host HTTP/HTTPS Port Binding 2019-02-06 16:04:07 -06:00
inventory Merge "fix the spelling mistakes" 2019-10-07 14:41:20 +00:00
kickstart Add openSUSE OBS Artifacts for Maintenance services 2019-09-20 09:18:54 -05:00
mtce Add mtcAgent support for sm_node_unhealthy condition 2019-10-15 15:24:34 -04:00
mtce-common Add mtcAgent support for sm_node_unhealthy condition 2019-10-15 15:24:34 -04:00
mtce-compute Update openSUSE OBS artifacts to build MTCE packages 2019-10-01 11:07:10 -05:00
mtce-control Update openSUSE OBS artifacts to build MTCE packages 2019-10-01 11:07:10 -05:00
mtce-storage Update openSUSE OBS artifacts to build MTCE packages 2019-10-01 11:07:10 -05:00
python-inventoryclient Update openSUSE OBS artifacts to build MTCE packages 2019-10-01 11:07:10 -05:00
releasenotes Update config for release notes to include project name 2019-02-05 14:14:17 -08:00
.gitignore Update tox.ini files to use stein constraints 2019-06-25 13:20:35 -04:00
.gitreview OpenDev Migration Patch 2019-04-19 19:52:33 +00:00
.zuul.yaml Minor zuul and tox cleanup related to package re-org 2019-09-09 10:35:11 -05:00
CONTRIBUTORS.wrs StarlingX open source release updates 2018-05-31 07:36:43 -07:00
LICENSE StarlingX open source release updates 2018-05-31 07:36:43 -07:00
README.rst Followup opendev cleanup and test jobs 2019-04-22 16:42:03 +00:00
centos_iso_image.inc Remove Resource Monitor ; aka rmon, from the load 2019-03-19 16:12:38 -04:00
centos_pkg_dirs SysInv Decoupling: Create Inventory Service 2018-12-06 13:17:35 -05:00
test-requirements.txt pep8 job enable and fix pep8 reported issue 2018-09-06 09:45:51 +08:00
tox.ini Update tox.ini files to use stein constraints 2019-06-25 13:20:35 -04:00

README.rst

metal

StarlingX Bare Metal Management