metal/mtce/src/maintenance
Eric Macdonald 50dc29f6c0 Improve maintenance power/reset control command retry handling
This update improves on and drives consistency into the
maintenance power on/off and reset handling in terms of
retries and use of graceful and immediate commands.

This update maintains the 10 retries for both power-on
and power-off commands and increases the number of retries
for the reset command from 5 to 10 to line up with the
power operation commands.

This update also ensures that the first 5 retries are done
with the graceful action command while the last 5 are with
the immediate.

This update also removed a power on handling case that could
have lead to a stuck state. This case was virtually impossible
to hit based on the required sequence of intermittent command
failures but that scenario handling was fixed up anyway.

Issues have been seen with the power-off handling on some servers.
Suspect that those servers need more time to power-off. So, this
introduced a 30 seconds delay following a power-off command before
issuing the power status query to give the server some time to
power-off before retrying the power-off command.

Test Plan: Both IPMI and Redfish

PASS: Verify power on/off and reset handling support up to 10 retries
PASS: Verify graceful command is used for the first power on/off
      or reset try and the first 5 retries
PASS: Verify immediate command is used for the final 5 retries
PASS: Verify reset handling with/without retries (none/mid/max)
PASS: Verify power-on  handling with/without retries (none/mid/max)
PASS: Verify power-off handling  with/without retries (none/mid/max)
PASS: Verify power status command failure handling for power on/off
NOTE: FIT (fault insertion testing) was used to create retry scenarios

PASS: Verify power-off inter retry delay feature
PASS: Verify 30 second power-off to power query delay
PASS: Verify redfish power/reset commands used are logged by default
PASS: Verify power-off/on and reset logging

Regression:

PASS: verify power-on/off and reset handling without retries
PASS: Verify power-off handling when power is already off
PASS: Verify power-on handling when power is already on

Closes-Bug: 2031945
Signed-off-by: Eric Macdonald <eric.macdonald@windriver.com>
Change-Id: Ie39326bcb205702df48ff9dd090f461c7110dd36
2024-01-25 22:42:26 +00:00
..
Makefile Add support for peer controller reset via mtcClient 2021-01-14 16:44:14 -05:00
mtcAlarm.cpp Failure case handling of LUKS service 2023-12-06 00:34:02 -05:00
mtcAlarm.h Failure case handling of LUKS service 2023-12-06 00:34:02 -05:00
mtcBmcUtil.cpp Improve maintenance power/reset control command retry handling 2024-01-25 22:42:26 +00:00
mtcBmcUtil.h Add redfish support detection to maintenance 2019-08-19 14:03:37 +00:00
mtcCmdHdlr.cpp Add bmc reset delay in the reset progression command handler 2023-11-02 20:58:00 +00:00
mtcCompMsg.cpp Failure case handling of LUKS service 2023-12-06 00:34:02 -05:00
mtcCtrlMsg.cpp Add bmc reset delay in the reset progression command handler 2023-11-02 20:58:00 +00:00
mtcHttpSvr.cpp Fix Mtce's VIM systems query handling 2019-10-09 09:44:35 -04:00
mtcHttpSvr.h Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
mtcHttpUtil.cpp Cleanup mtcAgent error logging during startup 2023-02-14 14:18:02 -05:00
mtcHttpUtil.h Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
mtcInvApi.cpp Prevent mtcClient from sending to uninitialized socket in AIO SX 2021-04-21 10:20:10 -04:00
mtcInvApi.h Fix format-overflow warning in mtcInvApi 2019-08-27 10:33:44 -05:00
mtcNodeComp.cpp Add Debian packaging for mtce packages 2021-10-29 09:17:00 -05:00
mtcNodeComp.h Add support for peer controller reset via mtcClient 2021-01-14 16:44:14 -05:00
mtcNodeCtrl.cpp Failure case handling of LUKS service 2023-12-06 00:34:02 -05:00
mtcNodeFsm.cpp Prevent mtcClient from sending to uninitialized socket in AIO SX 2021-04-21 10:20:10 -04:00
mtcNodeFsm.h Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
mtcNodeHdlrs.cpp Improve maintenance power/reset control command retry handling 2024-01-25 22:42:26 +00:00
mtcNodeHdlrs.h Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
mtcNodeMnfa.cpp Fix Graceful Recovery handling while in Graceful Recovery handling 2021-03-17 14:25:19 -04:00
mtcNodeMsg.h Add support for peer controller reset via mtcClient 2021-01-14 16:44:14 -05:00
mtcSmgrApi.cpp Debian: Fix mtcAgent segfault on SM host state change requests 2022-06-26 20:18:20 +00:00
mtcSmgrApi.h Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
mtcStubs.cpp Implement Active-Active Heartbeat as HA Improvement Fix 2018-12-10 09:57:34 -05:00
mtcSubfHdlrs.cpp Debian: Make Mtce offline handler more resilient to slow shutdowns 2022-10-24 15:57:43 +00:00
mtcThreads.cpp Improve maintenance power/reset control command retry handling 2024-01-25 22:42:26 +00:00
mtcThreads.h Add redfish power/reset/reinstall bmc support to maintenance 2019-09-26 15:59:35 -04:00
mtcVimApi.cpp Add bmc reset delay in the reset progression command handler 2023-11-02 20:58:00 +00:00
mtcVimApi.h Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
mtcWorkQueue.cpp [Trivial Fix] fix typos in docstrings 2019-02-21 14:46:06 +08:00