metal

History

Eric MacDonald 62532a7eac Fix maintenance cluster-host messaging Maintenance's success path messaging does not depend on cluster network messaging. However, there are a number of failure mode cases that do depend on cluster network messaging to properly diagnose and offer a higher availability handling for some failure cases. For instance, when the management interface goes down, without cluster network messaging remote hosts can be isolated. Being able to command- reboot a host over cluster-host network offers higher availability. Maintenance is designed to use the cluster network, if provisioned, as a backup path for mtcAlive, node locked, reboot and several other commands and acknowledgements. Unfortunately, it was recently observed that maintenance is using the 'nfs-controller' label to resolve cluster network addressing which resolves to management network IPs. As a result all messages intended to be going over the cluster-host network are instead just redundant management network messages. During debug of this issue several additional cluster network messaging related issues were observed and fixed. This update implements the following fixes 1. since there is no floating address for the cluster network the mtcClient was modified to send messages to both controllers where only the active controller will be listening and acting. 2. fixes port number mtce listens for cluster-host network messages 3. fixes port number mtce sends cluster-host network messages to. 4. mtcAlive messages are also sent on provisioned cluster network. 5. locked state notifications and acks sent on provisioned cluster network. 6. reboot request and acks sent on provisioned cluster network. 7. fixed command acknowledgement messaging. This update also 1. envelopes the mtcAlive gate control to allow debug tracing of all gate state changes. 2. moves graceful recovery handling heartbeat failure state clear to the end of the recovery handler, just before heartbeat start. 3. adds sm unhealthy support to fail and automatically recover the inactive controller from an SM UNHEALTHY state. ---------- Test Plan: ---------- Functional: PASS: Verify management network messaging PASS: Verify cluster-host network messaging PASS: Verify cluster-host messages with tcpdump PASS: Verify cluster-host network mtcAlive messaging PASS: Verify reboot request and ack reply over management network PASS: Verify reboot request and ack reply over cluster-host network PASS: Verify lock state notification and ack reply over management network PASS: Verify lock state notification and ack reply over cluster-host network PASS: Verify acknowledgement messaging PASS: Verify maintenance daemon logging PASS: Verify maintenance socket initialization System: PASS: Verify compute system install PASS: Verify AIO system install Feature: PASS: Verify sm node unhealth handling (active:ignore, inactive:recover) Change-Id: I092596d3e22438dd8a613a073614c188f6f5721d Closes-Bug: #835268 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>		2019-07-18 14:54:45 -04:00
..
Makefile	Set SHELL in Makefiles that use bash constructs	2018-12-07 14:09:48 -06:00
daemon_common.h	Fix maintenance cluster-host messaging	2019-07-18 14:54:45 -04:00
daemon_config.cpp	Refactor infrastructure network in mtce code	2019-04-18 09:32:41 -04:00
daemon_debug.cpp	Refactor infrastructure network in mtce code	2019-04-18 09:32:41 -04:00
daemon_files.cpp	Make Mtce system mode scan case in-sensitive	2019-05-06 19:14:14 +00:00
daemon_ini.cpp	fix compilation warnings in c/cpp files	2018-10-23 07:38:33 +00:00
daemon_ini.h	Decouple Guest-server/agent from stx-metal	2018-09-18 17:15:08 -04:00
daemon_main.cpp	Update the init parameters for opts	2019-05-30 11:00:41 +08:00
daemon_option.h	Implement Active-Active Heartbeat as HA Improvement	2018-11-20 19:57:18 +00:00
daemon_signal.cpp	Decouple Guest-server/agent from stx-metal	2018-09-18 17:15:08 -04:00