StarlingX Bare Metal and Node Management, Hardware Maintenance
Go to file
Eric MacDonald 0826882308 Add mtcAgent socket initialization failure retry handling.
The main maintenance process (mtcAgent) exits on a process start-up
socket initialization failure. SM restarts the failed process within
seconds and will swact if the second restart also fails. From startup
to swact can be as quick as 4 seconds. This is too short to handle a
collision with a manifest.

This update adds a number of socket initialization retries to extend
the time the process has to resolve socket initialization failures by
giving the collided manifest time to complete between retries.

The number of retries and inter retry wait time is calibrated to ensure
that a persistently failing mtcAgent process exits in under 40 seconds.

This is to ensure that SM is able to detect and swact away from a
persistently failing maintenance process while also giving the process
a few tries to resolve on its own.

Test Plan:

PASS: Verify socket init failure thresholded retry handling
      with no, persistent and recovered failure conditions.
PASS: Verify swact if socket init failure is persistent
PASS: Verify no swact if socket failure recovers after first exit
PASS: Verify no swact if socket failure recovers over init retry
PASS: Verify an hour long soak of continuous socket open/close retry

Change-Id: I3cb085145308f0e920324e22111f40bdeb12b444
Closes-Bug: 1869192
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2020-04-01 19:24:22 +00:00
api-ref/source Update landing pages for docs, api-ref, and release notes: 2020-02-07 12:24:12 -08:00
bsp-files Update pxeboot kickstart to allow for hybrid install 2020-02-12 11:58:24 -05:00
devstack Security: Handle nospectre_v1 in the bootargs 2020-01-28 18:21:13 -05:00
doc Update landing pages for docs, api-ref, and release notes: 2020-02-07 12:24:12 -08:00
installer Remove unused post_clone_iso_ks.cfg 2020-01-20 18:00:03 -05:00
kickstart Security: Handle nospectre_v1 in the bootargs 2020-01-28 18:21:13 -05:00
mtce Add mtcAgent socket initialization failure retry handling. 2020-04-01 19:24:22 +00:00
mtce-common Add mtcAgent socket initialization failure retry handling. 2020-04-01 19:24:22 +00:00
mtce-compute Update openSUSE OBS artifacts to build MTCE packages 2019-10-01 11:07:10 -05:00
mtce-control Update openSUSE OBS artifacts to build MTCE packages 2019-10-01 11:07:10 -05:00
mtce-storage Update openSUSE OBS artifacts to build MTCE packages 2019-10-01 11:07:10 -05:00
releasenotes Update landing pages for docs, api-ref, and release notes: 2020-02-07 12:24:12 -08:00
tools/rvmc/centos Fix rvmc container build 2020-01-20 17:50:27 +00:00
.gitignore Update tox.ini files to use stein constraints 2019-06-25 13:20:35 -04:00
.gitreview OpenDev Migration Patch 2019-04-19 19:52:33 +00:00
.zuul.yaml Adding job to upload commits to GitHub 2020-02-06 11:34:00 -05:00
CONTRIBUTORS.wrs StarlingX open source release updates 2018-05-31 07:36:43 -07:00
LICENSE StarlingX open source release updates 2018-05-31 07:36:43 -07:00
README.rst Followup opendev cleanup and test jobs 2019-04-22 16:42:03 +00:00
centos_build_layer.cfg Build layering, add layer build config file 2019-10-15 19:19:45 +08:00
centos_iso_image.inc Remove unused inventory and python-inventoryclient 2020-01-08 14:12:05 -06:00
centos_pkg_dirs rvmc: remove un-used build data 2020-01-16 08:39:54 -08:00
centos_stable_docker_images.inc Utility to install a server via Redfish 2019-12-31 15:34:54 +00:00
pylint.rc Add pylint checks for python files in metal 2020-01-03 13:27:00 -06:00
test-requirements.txt pep8 job enable and fix pep8 reported issue 2018-09-06 09:45:51 +08:00
tox.ini Add pylint checks for python files in metal 2020-01-03 13:27:00 -06:00

README.rst

metal

StarlingX Bare Metal Management