StarlingX Bare Metal and Node Management, Hardware Maintenance
Go to file
Eric MacDonald 3a5c578355 Mtce: Add Thresholded Maintenance Enable Recovery support
This update stops trying to recover hosts that have failed the
Enable sequence after a thresholded number of back-to-back tries.

A host that has reached a particular failure modes' max failure
threshold then maintenance puts it into a 'unlocked-disabled-failed'
state and left that way with no further recovery action until
it is manually locked and unlocked.

The thresholded Enable failure causes are

 Configuration Failure ....... threshold:2 retry interval:30 secs
 In-Test GoEnabled Failure ... threshold:2 retry interval:30 sec
 Start Host Services Failure . threshold:2 retry interval:30 sec
 Heartbeat Soak Failure ...... threshold:2 retry interval:10 minute

This update refactors the old auto recovery for AIO SX into this
more generic framework.

Story: 2003576
Task: 24905

Test Plan:

PASS: Verify AIO DX System Install
PASS: Verify AIO SX DOR
PASS: Verify Auto recovery disabled state is maintained over AIO SX DOR
PASS: Verify Lock/Unlock recovers host from Auto recovery disabled state
PASS: Verify AIO SX Main Config Failure handling
PASS: Verify AIO SX Main Config Timeout handling
PASS: Verify AIO SX Main GoEnabled Failure Handling
PASS; Verify AIO SX Main Host Services Failure handling
PASS; Verify AIO SX Main Host Services Timeout handling
PASS; Verify AIO SX Subf Config Failure handling
PASS: Verify AIO SX Subf Config Timeout handling
PASS: Verify AIO SX Subf GoEnabled Failure Handling
PASS: Verify AIO SX Subf Host Services Failure handling

PASS: Verify AIO DX System Install
PASS: Verify AIO DX DOR
PASS: Verify AIO DX DOR ; one time active controller GoEnabled failure ; swact requested
PASS: Verify AIO DX Main First Unlock Failure handling
PASS: Verify AIO DX Main Config Failure handling (inactive ctrl)
PASS: Verify AIO DX Main one time Config Failure handling
PASS: Verify AIO DX Main one time GoEnabled Failure handling.
PASS: Verify AIO DX SUBF Inactive Controller 1 GoEnable Failure handling.
PASS: Verify AIO DX Inactive Controller 1 GoEnable Failure with recovery on retry.
PASS: Verify AIO DX Active controller Enable failure with no or locked peer controller.
PASS: Verify AIO DX Reboot Active controller with peer in auto recovery disabled state.
PASS: Verify AIO DX Active controller failure with peer in auto recovery disabled state. (vswitch process)
PASS: Verify AIo DX Active controller failure then recovery after reboot with peer in auto recovery disabled state. (goenabled)
PASS: Verify AIO DX Inactive Controller Enable Heartbeat Soak Failure handling.
PASS: Verify AIO DX Active controller unhealthy detection and handling. (degrade)
PASS: Verify AIO DX Inactive controller unhealthy detection and handling. (fail)

PASS: Verify Normal System Install
PASS: Verify Compute Enable Configuration Failure handling (wc71-75)
PASS: Verify Compute Enable GoEnabled Failure handling (recover after 1)
PASS: Verify Compute Enable Start Host Services Failure handling
PASS: Verify Compute Enable Heartbeat Soak Failure handling
PASS: Verify Inactive Controller Enable Heartbeat Soak Failure handling
PASS: Verify Inactive Controller Configuration Failure handling
PASS; Verify Inactive Controller GoEnabled Failure handling
PASS; Verify Inactive Controller Host Services Failure handling
PASS; Verify goEnabled failure after active controller reboot with no peer controller (C0 rebooted with C1 locked) - no SM startup
PASS: Verify auto recovery threshold number is configurable
PASS: Verify auto recovery retry interval is configurable
PASS: Verify auto recovery host state and status message

Regression:

PASS: Verify Swact behavior, over and back
PASS: Verify 5 node DOR
PASS: Verify 3 host MNFA behavior
PASS: verify in-service heartbeat failure handling
PASS: verify no segfaults during UT

Corner Cases:

PASS: Verify mtcAlive boot failure behavior. reset progression. retry forever. - sleep in config script
PASS: Verify AIO SX mtcAgent process restart while in autorecovery disabled state
PASS: Verify autorecovery disabled state is preserved over mtcAgent process restart.

Change-Id: I7098f16243caef27c5295971ef3c9de5be975755
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2018-12-12 08:11:36 -05:00
api-ref/source [Doc] openstackdocstheme starlingxdocs theme 2018-10-22 14:37:08 +00:00
bsp-files Refactor patches for openstack-aodh package 2018-11-29 00:12:38 +08:00
doc [Doc] openstackdocstheme starlingxdocs theme 2018-10-22 14:37:08 +00:00
installer Fix linters issues and enable tox/zuul linters job as gate 2018-09-05 09:02:25 +08:00
inventory SysInv Decoupling: Create Inventory Service 2018-12-06 13:17:35 -05:00
kickstart Add python2-ruamel-yaml to controllers 2018-11-08 15:15:02 +00:00
mtce Mtce: Add Thresholded Maintenance Enable Recovery support 2018-12-12 08:11:36 -05:00
mtce-common Mtce: Add Thresholded Maintenance Enable Recovery support 2018-12-12 08:11:36 -05:00
mtce-compute Merge "get rid of duplicate LICENSE files in 3 packages" 2018-10-31 00:58:33 +00:00
mtce-control Implement Active-Active Heartbeat as HA Improvement 2018-11-20 19:57:18 +00:00
mtce-storage get rid of duplicate LICENSE files in 3 packages 2018-10-30 02:55:34 +00:00
python-inventoryclient SysInv Decoupling: Create Inventory Service 2018-12-06 13:17:35 -05:00
releasenotes Merge "releasenotes: Grammar edit." 2018-10-30 17:27:12 +00:00
.gitignore [Doc] OpenStack API Reference Guide 2018-09-05 19:59:26 -05:00
.gitreview Add .gitreview 2018-05-31 07:36:43 -07:00
.zuul.yaml Add api-ref and relnotes publish jobs 2018-10-11 08:21:53 -05:00
CONTRIBUTORS.wrs StarlingX open source release updates 2018-05-31 07:36:43 -07:00
LICENSE StarlingX open source release updates 2018-05-31 07:36:43 -07:00
README.rst StarlingX open source release updates 2018-05-31 07:36:43 -07:00
centos_iso_image.inc SysInv Decoupling: Create Inventory Service 2018-12-06 13:17:35 -05:00
centos_pkg_dirs SysInv Decoupling: Create Inventory Service 2018-12-06 13:17:35 -05:00
test-requirements.txt pep8 job enable and fix pep8 reported issue 2018-09-06 09:45:51 +08:00
tox.ini SysInv Decoupling: Create Inventory Service 2018-12-06 13:17:35 -05:00

README.rst

stx-metal

StarlingX Bare Metal Management