StarlingX Bare Metal and Node Management, Hardware Maintenance

Go to file

Eric MacDonald 62532a7eac Fix maintenance cluster-host messaging Maintenance's success path messaging does not depend on cluster network messaging. However, there are a number of failure mode cases that do depend on cluster network messaging to properly diagnose and offer a higher availability handling for some failure cases. For instance, when the management interface goes down, without cluster network messaging remote hosts can be isolated. Being able to command- reboot a host over cluster-host network offers higher availability. Maintenance is designed to use the cluster network, if provisioned, as a backup path for mtcAlive, node locked, reboot and several other commands and acknowledgements. Unfortunately, it was recently observed that maintenance is using the 'nfs-controller' label to resolve cluster network addressing which resolves to management network IPs. As a result all messages intended to be going over the cluster-host network are instead just redundant management network messages. During debug of this issue several additional cluster network messaging related issues were observed and fixed. This update implements the following fixes 1. since there is no floating address for the cluster network the mtcClient was modified to send messages to both controllers where only the active controller will be listening and acting. 2. fixes port number mtce listens for cluster-host network messages 3. fixes port number mtce sends cluster-host network messages to. 4. mtcAlive messages are also sent on provisioned cluster network. 5. locked state notifications and acks sent on provisioned cluster network. 6. reboot request and acks sent on provisioned cluster network. 7. fixed command acknowledgement messaging. This update also 1. envelopes the mtcAlive gate control to allow debug tracing of all gate state changes. 2. moves graceful recovery handling heartbeat failure state clear to the end of the recovery handler, just before heartbeat start. 3. adds sm unhealthy support to fail and automatically recover the inactive controller from an SM UNHEALTHY state. ---------- Test Plan: ---------- Functional: PASS: Verify management network messaging PASS: Verify cluster-host network messaging PASS: Verify cluster-host messages with tcpdump PASS: Verify cluster-host network mtcAlive messaging PASS: Verify reboot request and ack reply over management network PASS: Verify reboot request and ack reply over cluster-host network PASS: Verify lock state notification and ack reply over management network PASS: Verify lock state notification and ack reply over cluster-host network PASS: Verify acknowledgement messaging PASS: Verify maintenance daemon logging PASS: Verify maintenance socket initialization System: PASS: Verify compute system install PASS: Verify AIO system install Feature: PASS: Verify sm node unhealth handling (active:ignore, inactive:recover) Change-Id: I092596d3e22438dd8a613a073614c188f6f5721d Closes-Bug: #835268 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>		2019-07-18 14:54:45 -04:00
api-ref/source	Clean up and standardize landing pages	2019-01-09 09:34:38 -08:00
bsp-files	Increase cgts-vg size to accommodate new kubelet fs	2019-07-17 10:14:39 -04:00
devstack	Followup opendev cleanup and test jobs	2019-04-22 16:42:03 +00:00
doc	Clean up and standardize landing pages	2019-01-09 09:34:38 -08:00
installer	Configurable Host HTTP/HTTPS Port Binding	2019-02-06 16:04:07 -06:00
inventory	Update tox.ini files to use stein constraints	2019-06-25 13:20:35 -04:00
kickstart	Configurable Host HTTP/HTTPS Port Binding	2019-02-06 16:04:07 -06:00
mtce	Fix maintenance cluster-host messaging	2019-07-18 14:54:45 -04:00
mtce-common	Fix maintenance cluster-host messaging	2019-07-18 14:54:45 -04:00
mtce-compute	SUSE Specfile for mtce-compute Init Script LSB	2019-05-21 17:05:03 -05:00
mtce-control	SUSE Specfile for mtce-control Init Script LSB	2019-05-28 15:17:24 +00:00
mtce-storage	get rid of duplicate LICENSE files in 3 packages	2018-10-30 02:55:34 +00:00
python-inventoryclient	Update tox.ini files to use stein constraints	2019-06-25 13:20:35 -04:00
releasenotes	Update config for release notes to include project name	2019-02-05 14:14:17 -08:00
.gitignore	Update tox.ini files to use stein constraints	2019-06-25 13:20:35 -04:00
.gitreview	OpenDev Migration Patch	2019-04-19 19:52:33 +00:00
.zuul.yaml	Followup opendev cleanup and test jobs	2019-04-22 16:42:03 +00:00
CONTRIBUTORS.wrs	StarlingX open source release updates	2018-05-31 07:36:43 -07:00
LICENSE	StarlingX open source release updates	2018-05-31 07:36:43 -07:00
README.rst	Followup opendev cleanup and test jobs	2019-04-22 16:42:03 +00:00
centos_iso_image.inc	Remove Resource Monitor ; aka rmon, from the load	2019-03-19 16:12:38 -04:00
centos_pkg_dirs	SysInv Decoupling: Create Inventory Service	2018-12-06 13:17:35 -05:00
test-requirements.txt	pep8 job enable and fix pep8 reported issue	2018-09-06 09:45:51 +08:00
tox.ini	Update tox.ini files to use stein constraints	2019-06-25 13:20:35 -04:00

README.rst

metal

StarlingX Bare Metal Management