metal/mtce-common/src/daemon
Eric MacDonald 62532a7eac Fix maintenance cluster-host messaging
Maintenance's success path messaging does not depend on cluster
network messaging. However, there are a number of failure mode
cases that do depend on cluster network messaging to properly
diagnose and offer a higher availability handling for some
failure cases.

For instance, when the management interface goes down, without cluster
network messaging remote hosts can be isolated. Being able to command-
reboot a host over cluster-host network offers higher availability.

Maintenance is designed to use the cluster network, if provisioned, as a
backup path for mtcAlive, node locked, reboot and several other commands
and acknowledgements.

Unfortunately, it was recently observed that maintenance is using
the 'nfs-controller' label to resolve cluster network addressing
which resolves to management network IPs. As a result all messages
intended to be going over the cluster-host network are instead just
redundant management network messages.

During debug of this issue several additional cluster network
messaging related issues were observed and fixed.

This update implements the following fixes

1. since there is no floating address for the cluster network the
   mtcClient was modified to send messages to both controllers where
   only the active controller will be listening and acting.
2. fixes port number mtce listens for cluster-host network messages
3. fixes port number mtce sends cluster-host network messages to.
4. mtcAlive messages are also sent on provisioned cluster network.
5. locked state notifications and acks sent on provisioned cluster network.
6. reboot request and acks sent on provisioned cluster network.
7. fixed command acknowledgement messaging.

This update also

1. envelopes the mtcAlive gate control to allow debug tracing of all gate
   state changes.
2. moves graceful recovery handling heartbeat failure state clear to the
   end of the recovery handler, just before heartbeat start.
3. adds sm unhealthy support to fail and automatically recover the
   inactive controller from an SM UNHEALTHY state.

----------
Test Plan:
----------

Functional:

PASS: Verify management network messaging
PASS: Verify cluster-host network messaging
PASS: Verify cluster-host messages with tcpdump
PASS: Verify cluster-host network mtcAlive messaging
PASS: Verify reboot request and ack reply over management network
PASS: Verify reboot request and ack reply over cluster-host network
PASS: Verify lock state notification and ack reply over management network
PASS: Verify lock state notification and ack reply over cluster-host network
PASS: Verify acknowledgement messaging
PASS: Verify maintenance daemon logging
PASS: Verify maintenance socket initialization

System:

PASS: Verify compute system install
PASS: Verify AIO system install

Feature:

PASS: Verify sm node unhealth handling (active:ignore, inactive:recover)

Change-Id: I092596d3e22438dd8a613a073614c188f6f5721d
Closes-Bug: #835268
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-07-18 14:54:45 -04:00
..
Makefile Set SHELL in Makefiles that use bash constructs 2018-12-07 14:09:48 -06:00
daemon_common.h Fix maintenance cluster-host messaging 2019-07-18 14:54:45 -04:00
daemon_config.cpp Refactor infrastructure network in mtce code 2019-04-18 09:32:41 -04:00
daemon_debug.cpp Refactor infrastructure network in mtce code 2019-04-18 09:32:41 -04:00
daemon_files.cpp Make Mtce system mode scan case in-sensitive 2019-05-06 19:14:14 +00:00
daemon_ini.cpp fix compilation warnings in c/cpp files 2018-10-23 07:38:33 +00:00
daemon_ini.h Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
daemon_main.cpp Update the init parameters for opts 2019-05-30 11:00:41 +08:00
daemon_option.h Implement Active-Active Heartbeat as HA Improvement 2018-11-20 19:57:18 +00:00
daemon_signal.cpp Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00