537935bb0c
All compute hosts seen to self reboot by hostw during patching due to stuck pmond process Current method to kill the running process leads to a race condition that results in a user space futex dead lock that hangs pmond and results in a watchdog self-reset due to quorum master 'pmond' failure. The dead lock was traced to the ordering of the kill process. Current steps to kill: - kill process - remove pidfile - unregister pid with kernel Deadlock is avoided by reversing the kill steps to what is more logical. - unregister pid with kernel - remove pidfile - kill process Also introduced audit that registers manually restarted processes with the kernel. Failure Rate Before Fix: 1 every 25 process restarts. Mostly fails before 5. Failure Rate After Fix: No failures after 15000 process restarts across 8 hosts including all host types between 2 different labs 2 different loads 18.07 and 18.08. Test Method: Pmon restart regression test restarts all processes on a host. Total soak restart of 25 monitored processes for 50 loops over 12 hosts = 15000 restarts. Also regressed process kill / recovery handling. (5000 process recoveries) Change-Id: Icac64df52df9d8074fcd886567dda6e53641572d Signed-off-by: David Sullivan <david.sullivan@windriver.com> Story: 2002993 Task: 23007 |
||
---|---|---|
bsp-files | ||
installer | ||
kickstart | ||
mtce-common | ||
mtce-compute | ||
mtce-control | ||
mtce-storage | ||
.gitignore | ||
.gitreview | ||
.zuul.yaml | ||
CONTRIBUTORS.wrs | ||
LICENSE | ||
README.rst | ||
centos_pkg_dirs | ||
mwa-beas.map | ||
test-requirements.txt | ||
tox.ini |
README.rst
stx-metal
StarlingX Bare Metal Management