metal/mtce-common/cgts-mtce-common-1.0/common
Eric MacDonald 82e851d651 Mtce: Make Multi-Node Failure Avoidance Configurable
The maintenance system implements a high availability (HA) feature
designed to detect the simultaneous heartbeat failure of a group
of hosts and avoid failing all those hosts until heartbeat resumes
or after a set period of time.

This feature is called Multi-Node Failure Avoidance, aka MNFA, and
currently has the hosts threshold set to 3 and timeout set to 100 secs.

This update implements enhancements to that existing feature by
making the 'number-of-hosts threshold' and 'timeout period'
customer configurable service parameters.

The new service parameters are listed under platform:maintenance which
display with the following command

> system service-parameter-list

mnfa_threshold: This new label and value is added to the puppet
managed /etc/mtc.ini and represents the number of hosts that are
required to fail heartbeat as a group; within the heartbeat
failure window (heartbeat_failure_threshold) after which maintenance
activates MNFA Mode.

This update changes the default number of failing hosts from
3 to 2 while allowing a configurable range from 2 to 100.

mnfa_timeout: This new label and value is added to the puppet
managed /etc/mtc.ini. While MNFA mode is active, it will remain active
until the number of failing hosts drop below the mnfa_threshold or this
timer expires. The MNFA mode deactivates on the first occurance of
either case. Upon deactivation the remaining failed hosts are no
longer treated as a failure group but instead are all Gracefully
Recovered individually. A value of zero imposes no timeout making the
deactivation criteria solely host based.

This update changes the default 100 second timer to 0; no-timeout
while permitting valid a times range from 100 to 86400 secs or 1 day.

Test Plan:

PASS - Verify duplex and 4 compute DOR
PASS - Verify default MNFA - 1 inactive controller and 4 computes
PASS - Verify default MNFA - 4 computes
PASS - Verify default MNFA - 1 active controller and 3 computes and failed host
PASS - Verify Single host heartbeat failure handling - fail host
PASS - Verify Multi Node failure below mnfa_threshold - fail hosts
PASS - Verify MNFA handling with timeout of zero and threshold of 3
PASS - Verify MNFA timeout handling with timeout set at 100 sec
PASS - Verify MNFA service parameter lising, default value and mtc.ini
PASS - Verify MNFA service parameter change and inservice apply
PASS - Verify MNFA timeout service parameter change from value to 0
PASS - Verify MNFA timeout service parameter change from to inrange value
PASS - Verify MNFA service parametrer out of range change handling
PASS - Verify MNFA timeout change from No-Timeout to 100 sec (while active)

DocImpact
Story: 2003576
Task: 24903

Change-Id: Ib56dd79b38c3726e042cf34aae361f229c89940b
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2018-08-31 15:35:08 -04:00
..
Makefile StarlingX open source release updates 2018-05-31 07:36:43 -07:00
alarmUtil.cpp StarlingX open source release updates 2018-05-31 07:36:43 -07:00
alarmUtil.h StarlingX open source release updates 2018-05-31 07:36:43 -07:00
fitCodes.h Mtce: Implement all token fetches as non-blocking operations. 2018-06-27 15:00:23 -04:00
fsync.c StarlingX open source release updates 2018-05-31 07:36:43 -07:00
hostClass.cpp StarlingX open source release updates 2018-05-31 07:36:43 -07:00
hostClass.h StarlingX open source release updates 2018-05-31 07:36:43 -07:00
hostUtil.cpp Controller Services swact/failover time reduction 2018-06-28 15:51:50 -04:00
hostUtil.h StarlingX open source release updates 2018-05-31 07:36:43 -07:00
httpUtil.cpp Mtce: Implement all token fetches as non-blocking operations. 2018-06-27 15:00:23 -04:00
httpUtil.h Mtce: Implement all token fetches as non-blocking operations. 2018-06-27 15:00:23 -04:00
ipmiUtil.cpp StarlingX open source release updates 2018-05-31 07:36:43 -07:00
ipmiUtil.h StarlingX open source release updates 2018-05-31 07:36:43 -07:00
jsonUtil.cpp StarlingX open source release updates 2018-05-31 07:36:43 -07:00
jsonUtil.h StarlingX open source release updates 2018-05-31 07:36:43 -07:00
keyClass.cpp StarlingX open source release updates 2018-05-31 07:36:43 -07:00
keyClass.h StarlingX open source release updates 2018-05-31 07:36:43 -07:00
logMacros.h Mtce: Make Multi-Node Failure Avoidance Configurable 2018-08-31 15:35:08 -04:00
msgClass.cpp StarlingX open source release updates 2018-05-31 07:36:43 -07:00
msgClass.h StarlingX open source release updates 2018-05-31 07:36:43 -07:00
nlEvent.cpp StarlingX open source release updates 2018-05-31 07:36:43 -07:00
nlEvent.h StarlingX open source release updates 2018-05-31 07:36:43 -07:00
nodeBase.cpp Collectd+InfluxDb-RMON Replacement(ALL METRICS) P1 2018-07-03 11:04:27 -04:00
nodeBase.h Mtce: Debouce heartbeat recovery 2018-07-20 11:12:19 -04:00
nodeClass.cpp Mtce: Make Multi-Node Failure Avoidance Configurable 2018-08-31 15:35:08 -04:00
nodeClass.h Mtce: Make Multi-Node Failure Avoidance Configurable 2018-08-31 15:35:08 -04:00
nodeCmds.h Add 90s delay before locking storage node for upgrade 2018-07-06 09:18:21 -04:00
nodeEvent.cpp StarlingX open source release updates 2018-05-31 07:36:43 -07:00
nodeEvent.h StarlingX open source release updates 2018-05-31 07:36:43 -07:00
nodeMacro.h StarlingX open source release updates 2018-05-31 07:36:43 -07:00
nodeTimers.cpp StarlingX open source release updates 2018-05-31 07:36:43 -07:00
nodeTimers.h Add 90s delay before locking storage node for upgrade 2018-07-06 09:18:21 -04:00
nodeUtil.cpp StarlingX open source release updates 2018-05-31 07:36:43 -07:00
nodeUtil.h StarlingX open source release updates 2018-05-31 07:36:43 -07:00
pgdbClass.cpp.OBS StarlingX open source release updates 2018-05-31 07:36:43 -07:00
pgdbClass.h.OBS StarlingX open source release updates 2018-05-31 07:36:43 -07:00
pgdbUtil.cpp.OBS StarlingX open source release updates 2018-05-31 07:36:43 -07:00
pingUtil.cpp StarlingX open source release updates 2018-05-31 07:36:43 -07:00
pingUtil.h StarlingX open source release updates 2018-05-31 07:36:43 -07:00
regexUtil.cpp StarlingX open source release updates 2018-05-31 07:36:43 -07:00
regexUtil.h StarlingX open source release updates 2018-05-31 07:36:43 -07:00
returnCodes.h StarlingX open source release updates 2018-05-31 07:36:43 -07:00
threadUtil.cpp StarlingX open source release updates 2018-05-31 07:36:43 -07:00
threadUtil.h StarlingX open source release updates 2018-05-31 07:36:43 -07:00
timeUtil.cpp StarlingX open source release updates 2018-05-31 07:36:43 -07:00
timeUtil.h StarlingX open source release updates 2018-05-31 07:36:43 -07:00
tokenUtil.cpp Mtce: Implement all token fetches as non-blocking operations. 2018-06-27 15:00:23 -04:00
tokenUtil.h Mtce: Implement all token fetches as non-blocking operations. 2018-06-27 15:00:23 -04:00