StarlingX Bare Metal and Node Management, Hardware Maintenance
Go to file
Eric MacDonald f01fd85470 Fix MNFA recovery race condition that leads to stuck degrade
Seeing from 0 to 10% of hosts get stuck in the degrade state
after MNFA recovery.

Clearing host degrade on Multi-Node Failure Avoidance (MNFA)
recovery does not send degrade clear but does clear the hbs
controol states. Instead relies on explicit events from
hbsAgent per host/network to do so.

If MNFA Recovery (exit) event occurs before all hbsAgent
clear messages arrive then the hbs control clear tricks
the mtcAgent into thinking that there was no degrade event
active when it actually may still be.

This fix enables the clear option the mon_host MNFA Recovery
call so that the host's degrade condition is cleared.
It also removes the unnecessary heartbeat disable call.

Test Plan:

PASS: soak MNFA in large system over and over to verify
      a 0-10% stuck degrade occurance rate drops to 0
      after many (more than 20) occurances.

Regression:

PASS: Verify heartbeat.
PASS: Verify single node graceful recovery.

Change-Id: I699a376af5a95cc8dcc6ea5cc8266dc14fbacd09
Closes-Bug: 1845344
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-10-03 09:28:32 -04:00
api-ref/source Clean up and standardize landing pages 2019-01-09 09:34:38 -08:00
bsp-files Support custom kickstart addon for install from USB 2019-09-20 12:42:22 -04:00
devstack Add redfish support detection to maintenance 2019-08-19 14:03:37 +00:00
doc Fix the error links for metal docs 2019-07-03 09:20:25 -04:00
installer Configurable Host HTTP/HTTPS Port Binding 2019-02-06 16:04:07 -06:00
inventory Merge "Add inventory specfile for opensuse" 2019-09-20 14:23:16 +00:00
kickstart Add openSUSE OBS Artifacts for Maintenance services 2019-09-20 09:18:54 -05:00
mtce Fix MNFA recovery race condition that leads to stuck degrade 2019-10-03 09:28:32 -04:00
mtce-common Add redfish power/reset/reinstall bmc support to maintenance 2019-09-26 15:59:35 -04:00
mtce-compute Add openSUSE OBS Artifacts for Maintenance services 2019-09-20 09:18:54 -05:00
mtce-control Add openSUSE OBS Artifacts for Maintenance services 2019-09-20 09:18:54 -05:00
mtce-storage Add openSUSE OBS Artifacts for Maintenance services 2019-09-20 09:18:54 -05:00
python-inventoryclient Add openSUSE OBS Artifacts for Maintenance services 2019-09-20 09:18:54 -05:00
releasenotes Update config for release notes to include project name 2019-02-05 14:14:17 -08:00
.gitignore Update tox.ini files to use stein constraints 2019-06-25 13:20:35 -04:00
.gitreview OpenDev Migration Patch 2019-04-19 19:52:33 +00:00
.zuul.yaml Minor zuul and tox cleanup related to package re-org 2019-09-09 10:35:11 -05:00
CONTRIBUTORS.wrs StarlingX open source release updates 2018-05-31 07:36:43 -07:00
LICENSE StarlingX open source release updates 2018-05-31 07:36:43 -07:00
README.rst Followup opendev cleanup and test jobs 2019-04-22 16:42:03 +00:00
centos_iso_image.inc Remove Resource Monitor ; aka rmon, from the load 2019-03-19 16:12:38 -04:00
centos_pkg_dirs SysInv Decoupling: Create Inventory Service 2018-12-06 13:17:35 -05:00
test-requirements.txt pep8 job enable and fix pep8 reported issue 2018-09-06 09:45:51 +08:00
tox.ini Update tox.ini files to use stein constraints 2019-06-25 13:20:35 -04:00

README.rst

metal

StarlingX Bare Metal Management