ha/service-mgmt/sm-common
Eric MacDonald 630a777cbb Add unhealthy state recovery audit to service management (sm)
Service Management (SM) monitors connectivity and health of
its peer controller over the OAM, Mgmt and (if provisioned)
Cluster-Host networks.

If SM sees all the links to its peer go 'carrier down' virtually
simultaneously, it is possible that both controllers might
simultaneously declare themselves unhealthy and both go
disabled; i.e. shutdown all services with no automatic recovery.

This update adds an 'Unhealthy State Recovery Audit' to SM which
forces a self restart when all of its monitored links recover
for cases where both controllers go unhealthy-shutdown or both
controllers remain active in split-brain.

Test Plan:

PASS: Verify AIO SX install
PASS: Verify Standard system install and unhealthy state recovery
PASS: Verify single link failure end to end behavior
PASS: Verify 2 of 3 link failure end to end behavior
PASS: Verify all link failure end to end behavior
PASS: Verify SM and Mtce heartbeat recovery over unhealthy state recovery
PASS: Verify swact back and forth following a recovery
PASS: Verify process restart as part of unhealthy state recovery
PASS: Verify AIO DX install and unhealthy state recovery

Change-Id: Ie906eaf04bec607328b7e0af09b37fa0558e3bbe
Closes-Bug: 1883004
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2020-06-16 19:09:38 +00:00
..
centos Remove version from sm-common folder 2019-09-26 12:00:43 -05:00
opensuse openSUSE: Open Build Service Artifacts 2019-10-09 10:05:20 -05:00
scripts Remove version from sm-common folder 2019-09-26 12:00:43 -05:00
src Add unhealthy state recovery audit to service management (sm) 2020-06-16 19:09:38 +00:00
LICENSE Remove version from sm-common folder 2019-09-26 12:00:43 -05:00
Makefile Remove version from sm-common folder 2019-09-26 12:00:43 -05:00