StarlingX Bare Metal and Node Management, Hardware Maintenance
Go to file
Eric MacDonald 7da4eb945f Enable host heartbeat in add handler when not in DOR mode
Two Node System: VMs did not switch to ERROR state after host reboot

A logically failed (rebooted) active controller is not being
administratively failed by maintenance. As a result the host's
offline availability state is not reported to the VIM and the
VMs on that (rebooted) All-in-one host are not evacuated.

This issue only applies to two node systems because of how the heartbeat
enable of an All-in-one host needs to be held off until its compute 
manifests apply in the DOR case so as to avoid maintenance failing the 
peer controller over a DOR.

The challange in maintenance is to distinguish between this spontaneous
failure and a DOR. For All-in-one hosts, DOR mode is active for a 
whopping 600 seconds ; long enough to account for both sets of manifests
to apply.

It's that long delay that is making this silent fault stand out so 
obviously.

This update uses 'active DOR mode' to decide whether or not to enable a
host's heartbeat in the add handler.

To better handle early active controller failure the qualifier for DOR 
mode was reduced from 20 to 15 minutes. Meaning that maintenance DOR 
mode is activated if its host up time is less than 15 minutes ; rather 
than 20 as it was before this update. Note that normally the active 
controller starts maintenance with an uptime of 5-7 minutes.

Story: 2002995
Task: 23009
Change-Id: I749aefef45b9db6e86a2c6b81d131ebeccc68926
Signed-off-by: David Sullivan <david.sullivan@windriver.com>
2018-08-16 20:20:16 +00:00
bsp-files Extend cgcs disk partition for gnocchi usage 2018-08-08 15:54:44 -04:00
installer Update boot configs to match CentOS 7.5 kernel 2018-07-06 11:26:06 -04:00
kickstart Rename mwa-* subdirectories to match the git repo name 2018-07-03 16:29:24 -04:00
mtce-common Enable host heartbeat in add handler when not in DOR mode 2018-08-16 20:20:16 +00:00
mtce-compute Rename mwa-* subdirectories to match the git repo name 2018-07-03 16:29:24 -04:00
mtce-control Rename mwa-* subdirectories to match the git repo name 2018-07-03 16:29:24 -04:00
mtce-storage Rename mwa-* subdirectories to match the git repo name 2018-07-03 16:29:24 -04:00
.gitignore Add default test framework 2018-06-11 18:51:02 -05:00
.gitreview Add .gitreview 2018-05-31 07:36:43 -07:00
.zuul.yaml Remove non-voting gate job 2018-06-29 14:31:56 -05:00
CONTRIBUTORS.wrs StarlingX open source release updates 2018-05-31 07:36:43 -07:00
LICENSE StarlingX open source release updates 2018-05-31 07:36:43 -07:00
README.rst StarlingX open source release updates 2018-05-31 07:36:43 -07:00
centos_pkg_dirs Split centos-pkg-dirs along git boundaries. 2018-06-20 16:25:33 -04:00
mwa-beas.map StarlingX open source release updates 2018-05-31 07:36:43 -07:00
test-requirements.txt Add default test framework 2018-06-11 18:51:02 -05:00
tox.ini Add default test framework 2018-06-11 18:51:02 -05:00

README.rst

stx-metal

StarlingX Bare Metal Management