metal/mtce
Eric MacDonald 675f49d556 Add mtcAgent support for sm_node_unhealthy condition
When heartbeat over both networks fail, mtcAgent
provides a 5 second grace period for heartbeat to
recover before failing the node.

However, when heartbeat fails over only one of the
networks (management or cluster) the mtcAgent does
not honour that 5 second grace period ; a bug.

When it comes to peer controller heartbeat failure
handling, SM needs that 5 second grace period to handle
swact before mtcAgent declares the peer controller as
failed, resets the node and updates the database.

This update implements a change that forces a 2 second
wait time between each fast enable and fixes the fast
enable threshold count to be the intended 3 retries.
This ensures that at least 5 seconds, actually 6 in
the case of single network heartbeat loss, passes
before declaring the node as failed.

In addition to that, a special condition is added to
detect and stop work if the active controller is
sm_node_unhealthy. We don't want mtcAgent to make
any database updates while in this failure mode.
This gives SM the time to handle the failure
according to the system's controllers' high
availability handling feature.

Test Plan:

PASS: Verify mtcAgent behavior on set and clear of
      SM node unhealthy state.
PASS: Verify SM has at least 5 seconds to shut down
      mtcAgent when heartbeat to peer controller fails
      for one or both networks.
PASS: Test real case scenario with link pull.
PASS: Verify logging in presence of real failure condition.

Change-Id: I8f8d6688040fe899aff6fc40aadda37894c2d5e9
Closes-Bug: 1847657
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-10-15 15:24:34 -04:00
..
centos Add redfish support detection to maintenance 2019-08-19 14:03:37 +00:00
opensuse Add mtce specfile for opensuse 2019-09-19 18:32:15 -05:00
src Add mtcAgent support for sm_node_unhealthy condition 2019-10-15 15:24:34 -04:00
PKG-INFO Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00