metal/mtce/src
Eric Macdonald 50dc29f6c0 Improve maintenance power/reset control command retry handling
This update improves on and drives consistency into the
maintenance power on/off and reset handling in terms of
retries and use of graceful and immediate commands.

This update maintains the 10 retries for both power-on
and power-off commands and increases the number of retries
for the reset command from 5 to 10 to line up with the
power operation commands.

This update also ensures that the first 5 retries are done
with the graceful action command while the last 5 are with
the immediate.

This update also removed a power on handling case that could
have lead to a stuck state. This case was virtually impossible
to hit based on the required sequence of intermittent command
failures but that scenario handling was fixed up anyway.

Issues have been seen with the power-off handling on some servers.
Suspect that those servers need more time to power-off. So, this
introduced a 30 seconds delay following a power-off command before
issuing the power status query to give the server some time to
power-off before retrying the power-off command.

Test Plan: Both IPMI and Redfish

PASS: Verify power on/off and reset handling support up to 10 retries
PASS: Verify graceful command is used for the first power on/off
      or reset try and the first 5 retries
PASS: Verify immediate command is used for the final 5 retries
PASS: Verify reset handling with/without retries (none/mid/max)
PASS: Verify power-on  handling with/without retries (none/mid/max)
PASS: Verify power-off handling  with/without retries (none/mid/max)
PASS: Verify power status command failure handling for power on/off
NOTE: FIT (fault insertion testing) was used to create retry scenarios

PASS: Verify power-off inter retry delay feature
PASS: Verify 30 second power-off to power query delay
PASS: Verify redfish power/reset commands used are logged by default
PASS: Verify power-off/on and reset logging

Regression:

PASS: verify power-on/off and reset handling without retries
PASS: Verify power-off handling when power is already off
PASS: Verify power-on handling when power is already on

Closes-Bug: 2031945
Signed-off-by: Eric Macdonald <eric.macdonald@windriver.com>
Change-Id: Ie39326bcb205702df48ff9dd090f461c7110dd36
2024-01-25 22:42:26 +00:00
..
alarm Failure case handling of LUKS service 2023-12-06 00:34:02 -05:00
common Improve maintenance power/reset control command retry handling 2024-01-25 22:42:26 +00:00
fsmon Replace a file test from fsmond 2023-11-17 08:15:28 -03:00
fsync Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
heartbeat Remove swerr log in hbsAgent cluster delete 2021-06-14 19:04:33 -04:00
hostw Change hostwd emergency log to write to /dev/kmsg 2023-02-01 23:41:14 +00:00
hwmon Re-enable sensor suppression support in Mtce Hardware Monitor 2022-08-06 00:02:29 +00:00
lmon Fix failing mtce services on Debian 2022-01-14 10:50:09 -03:00
maintenance Improve maintenance power/reset control command retry handling 2024-01-25 22:42:26 +00:00
mtclog Set restricted permissions for mtce logfiles 2019-07-17 18:19:52 -04:00
pmon Fix bashate failure in zuul 2022-10-06 17:22:12 +00:00
public Fix mtce build error with gcc-8.2.1 2020-04-03 14:44:21 +08:00
scripts Failure case handling of LUKS service 2023-12-06 00:34:02 -05:00
LICENSE Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00
Makefile Remove Resource Monitor ; aka rmon, from the load 2019-03-19 16:12:38 -04:00