metal/mtce-common
Eric MacDonald 5c83453fdf Fix Graceful Recovery handling while in Graceful Recovery handling
The current Graceful Recovery handler is not properly handling
back-to-back Multi Node Failure Avoidance (MNFA) events.

There are two phases to MNFA

 phase 1: waiting for number of failed nodes to fall below
          mnfa_threahold as each affected node's heartbeat
          is recovered.
 phase 2: then a Graceful Recovery Wait period which is an
          11 second heartbeat soak to verify that a stable
          heartbeat is regained before declaring the NMFA
          event complete.

The Graceful Recovery Wait status of one or more affected nodes
has been seen to be left uncleared (stuck) on one or more of the
affected nodes if phase 2 of MNFA is interrupted by another MNFA
event ; aka MNFA Nesting.

Although this stuck status is not service affecting it does leave
one or more nodes' host.task field, as observed under host-show,
with "Graceful Recovery Wait" rather than empty.

This update makes Multi Node Failure Avoidance (MNFA) handling
changes to ensure that, upon MNFA exit, the recovery handler
is properly restarted if MNFA Nesting occurs.

Two additional Graceful Recovery phase issues were identified
and fixed by this update.

 1. Cut Graceful recovery handling in half

    - Found and removed a redundant 11 second heartbeat soak
      at the very end of the recovery handler.
    - This cuts the graceful recovery handling time down from
      22 to 11 seconds thereby cutting potential for nesting
      in half.

 2. Increased supported Graceful Recovery nesting from 3 to 5

    - Found that some links bounce more than others so a nesting
      count of 3 can lead to an occasional single node failure.
    - This adds a bit more resiliency to MNFA handling of cases
      that exhibit more link messaging bounce.

Test Plan: Verified 60+ MNFA occurrences across 4 different
           system types including AIO plus, Standard and Storage

PASS: Verify Single Node Graceful Recovery Handling
PASS: Verify Multi Node Graceful Recovery Handling
PASS: Verify Single Node Graceful Recovery Nesting Handling
PASS: Verify Multi Node Graceful Recovery Nesting Handling
PASS: Verify MNFA of up to 5 nests can be gracefully recovered
PASS: Verify MNFA of 6 nests lead to full enable of affected nodes
PASS: Verify update as a patch
PASS: Verify mtcAgent logging

Regression:

PASS: Verify standard system install
PASS: Verify product verification maintenance regression (4 runs)
PASS: Verify MNFA threshold increase and below threshold behavior
PASS: Verify MNFA with reduced timeout behavior for
      ... nested case that does not timeout
      ... case that does not timeout
      ... case that does timeout

Closes Bug: 1892877
Change-Id: I6b7d4478b5cae9521583af78e1370dadacd9536e
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2021-03-17 14:25:19 -04:00
..
centos Add auto-versioning to starlingx/metal mtce packages 2020-05-21 15:18:43 -04:00
opensuse Update openSUSE OBS artifacts to build MTCE packages 2019-10-01 11:07:10 -05:00
src Fix Graceful Recovery handling while in Graceful Recovery handling 2021-03-17 14:25:19 -04:00
PKG-INFO Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00