metal/mtce-common/cgts-mtce-common-1.0/scripts
Eric MacDonald 74c5f89ab4 Mtce: Make Heartbeat Failure Action Configurable
The current maintenance heartbeat failure action handling is to Fail
and Gracefully Recover the host. This means that maintenance will
ensure that a heartbeat failed host is rebooted/reset before it is
recovered but will avoid rebooting it a second time if its recovered
uptime indicates that it has already rebooted.

This update expands that single action handling behavior to support
three new actions. In doing so it adds a new configuration service
parameter called heartbeat_failure_action. The customer can configure
this new parameter with any one of the following 4 actions in order of
decreasing impact.

   fail - Host is failed and gracefuly recovered.
        - Current Network specific alarms continue to be raised/cleared.
          Note: Prior to this update this was standard system behavior.
degrade - Host is only degraded while it is failing heartbeat.
        - Current Network specific alarms continue to be raised/cleared.
        - heartbeat degrade reason is cleared as are the alarms when
          heartbeat responses resume.
  alarm - The only indication of a heartbeat failure is by alarm.
        - Same set of alarms as in above action cases
        - Only in this case no degrade, no failure, no reboot/reset
   none - Heartbeat is disabled ; no multicase heartbeat message is sent.
        - All existing heartbeat alarms are cleared.
        - The heartbeat soak as part of the enable sequence is bypassed.

The selected action is a system wide setting.
The selected setting also applies to Multi-Node Failure Avoidance.
The default action is the legacy action Fail.

This update also

 1. Removes redundant inservice failure alarm for MNFA case in support
    of degrade only action. Keeping it would make that alarm handling
    case unnecessarily complicated.
 2. No longer used 'hbs calibration' code is removed (cleanup).
 3. Small amount of heartbeat logging cleanup.

Test Plan:
PASS:    fail: Verify MNFA and recovery
PASS:    fail: Verify Single Host heartbeat failure and recovery
PASS:    fail: Verify Single Host heartbeat failure and recovery (from none)
PASS: degrade: Verify MNFA and recovery
PASS: degrade: Verify Single Host heartbeat failure and recovery
PASS: degrade: Verify Single Host heartbeat failure and recovery (from alarm)
PASS:   alarm: Verify MNFA and recovery
PASS:   alarm: Verify Single Host heartbeat failure and recovery
PASS:   alarm: Verify Single Host heartbeat failure and recovery (from degrade)
PASS:    none: Verify heartbeat disable, fail ignore and no recovery
PASS:    none: Verify Single Host heartbeat ignore and no recovery
PASS:    none: Verify Single Host heartbeat ignode and no recovery (from fail)
PASS: Verify action change behavior from none to alarm with active MNFA
PASS: Verify action change behavior from alarm to degrade with active MNFA
PASS: Verify action change behavior from degrade to none with active MNFA
PASS: Verify action change behavior from none to fail with active MNFA
PASS: Verify action change behavior from fail to none with active MNFA
PASS: Verify action change behavior from degrade to fail then MNFA timeout
PASS: Verify all heartbeat action change customer logs
PASS: verify heartbeat stats clear over action change
PASS: Verify LO DOR (several large labs - compute and storage systems)
PASS: Verify recovery from failure of active controller
PASS: Verify 3 host failure behavior with MNFA threshold at 3 (action:fail)
PASS: Verify 2 host failure behavior with MNFA threshold at 3 (action:fail)

Depends-On: https://review.openstack.org/601264
Change-Id: Iede5cdbb1c923898fd71b3a95d5289182f4287b4
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2018-09-10 13:03:30 -04:00
..
config StarlingX open source release updates 2018-05-31 07:36:43 -07:00
config.service StarlingX open source release updates 2018-05-31 07:36:43 -07:00
dmemchk.sh Fix linters issues and enable tox/zuul linters job as gate 2018-09-05 09:02:25 +08:00
goenabled StarlingX open source release updates 2018-05-31 07:36:43 -07:00
goenabled.service StarlingX open source release updates 2018-05-31 07:36:43 -07:00
hbsAgent StarlingX open source release updates 2018-05-31 07:36:43 -07:00
hbsClient StarlingX open source release updates 2018-05-31 07:36:43 -07:00
hbsClient.conf StarlingX open source release updates 2018-05-31 07:36:43 -07:00
hbsClient.service StarlingX open source release updates 2018-05-31 07:36:43 -07:00
hwclock.service StarlingX open source release updates 2018-05-31 07:36:43 -07:00
hwclock.sh Fix linters issues and enable tox/zuul linters job as gate 2018-09-05 09:02:25 +08:00
mgmtlinkup StarlingX open source release updates 2018-05-31 07:36:43 -07:00
mtc.conf Mtce: Make Heartbeat Failure Action Configurable 2018-09-10 13:03:30 -04:00
mtc.ini StarlingX open source release updates 2018-05-31 07:36:43 -07:00
mtcAgent StarlingX open source release updates 2018-05-31 07:36:43 -07:00
mtcClient StarlingX open source release updates 2018-05-31 07:36:43 -07:00
mtcClient.conf StarlingX open source release updates 2018-05-31 07:36:43 -07:00
mtcClient.service StarlingX open source release updates 2018-05-31 07:36:43 -07:00
mtcTest StarlingX open source release updates 2018-05-31 07:36:43 -07:00
mtce.logrotate StarlingX open source release updates 2018-05-31 07:36:43 -07:00
mtcinit StarlingX open source release updates 2018-05-31 07:36:43 -07:00
mtclog StarlingX open source release updates 2018-05-31 07:36:43 -07:00
mtclog.service StarlingX open source release updates 2018-05-31 07:36:43 -07:00
mtclogd.conf StarlingX open source release updates 2018-05-31 07:36:43 -07:00
runservices StarlingX open source release updates 2018-05-31 07:36:43 -07:00
runservices.service StarlingX open source release updates 2018-05-31 07:36:43 -07:00
sched_trace StarlingX open source release updates 2018-05-31 07:36:43 -07:00
sensor_hp360_v1_ilo_v4.profile StarlingX open source release updates 2018-05-31 07:36:43 -07:00
sensor_hp380_v1_ilo_v4.profile StarlingX open source release updates 2018-05-31 07:36:43 -07:00
sensor_integration_profile.README StarlingX open source release updates 2018-05-31 07:36:43 -07:00
sensor_quanta_v1_ilo_v4.profile StarlingX open source release updates 2018-05-31 07:36:43 -07:00
store_trace StarlingX open source release updates 2018-05-31 07:36:43 -07:00
stress_ras.sh Fix linters issues and enable tox/zuul linters job as gate 2018-09-05 09:02:25 +08:00
stress_swact.sh Fix linters issues and enable tox/zuul linters job as gate 2018-09-05 09:02:25 +08:00
wipedisk StarlingX open source release updates 2018-05-31 07:36:43 -07:00