metal/mtce-common
Eric MacDonald 0826882308 Add mtcAgent socket initialization failure retry handling.
The main maintenance process (mtcAgent) exits on a process start-up
socket initialization failure. SM restarts the failed process within
seconds and will swact if the second restart also fails. From startup
to swact can be as quick as 4 seconds. This is too short to handle a
collision with a manifest.

This update adds a number of socket initialization retries to extend
the time the process has to resolve socket initialization failures by
giving the collided manifest time to complete between retries.

The number of retries and inter retry wait time is calibrated to ensure
that a persistently failing mtcAgent process exits in under 40 seconds.

This is to ensure that SM is able to detect and swact away from a
persistently failing maintenance process while also giving the process
a few tries to resolve on its own.

Test Plan:

PASS: Verify socket init failure thresholded retry handling
      with no, persistent and recovered failure conditions.
PASS: Verify swact if socket init failure is persistent
PASS: Verify no swact if socket failure recovers after first exit
PASS: Verify no swact if socket failure recovers over init retry
PASS: Verify an hour long soak of continuous socket open/close retry

Change-Id: I3cb085145308f0e920324e22111f40bdeb12b444
Closes-Bug: 1869192
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2020-04-01 19:24:22 +00:00
..
centos Add redfish support detection to maintenance 2019-08-19 14:03:37 +00:00
opensuse Update openSUSE OBS artifacts to build MTCE packages 2019-10-01 11:07:10 -05:00
src Add mtcAgent socket initialization failure retry handling. 2020-04-01 19:24:22 +00:00
PKG-INFO Decouple Guest-server/agent from stx-metal 2018-09-18 17:15:08 -04:00