StarlingX Bare Metal and Node Management, Hardware Maintenance
Go to file
Eric MacDonald 537935bb0c Reorder process restart operations to prevent pmond futex deadlock
All compute hosts seen to self reboot by hostw during patching due to 
stuck pmond process

Current method to kill the running process leads to a race condition 
that results in a user space futex dead lock that hangs pmond and 
results in a watchdog self-reset due to quorum master 'pmond' failure.

The dead lock was traced to the ordering of the kill process.

Current steps to kill:

 - kill process
 - remove pidfile
 - unregister pid with kernel

Deadlock is avoided by reversing the kill steps to what
is more logical.

 - unregister pid with kernel
 - remove pidfile
 - kill process

Also introduced audit that registers manually restarted processes
with the kernel.

Failure Rate Before Fix: 1 every 25 process restarts.
                         Mostly fails before 5.

Failure Rate  After Fix: No failures after 15000 process restarts
across 8 hosts including all host types between 2 different labs 2
different loads 18.07 and 18.08.

Test Method: Pmon restart regression test restarts all processes on
a host. Total soak restart of 25 monitored processes for 50 loops
over 12 hosts = 15000 restarts.

Also regressed process kill / recovery handling. 
(5000 process recoveries)

Change-Id: Icac64df52df9d8074fcd886567dda6e53641572d
Signed-off-by: David Sullivan <david.sullivan@windriver.com>
Story: 2002993
Task: 23007
2018-08-16 20:22:15 +00:00
bsp-files Extend cgcs disk partition for gnocchi usage 2018-08-08 15:54:44 -04:00
installer Update boot configs to match CentOS 7.5 kernel 2018-07-06 11:26:06 -04:00
kickstart Rename mwa-* subdirectories to match the git repo name 2018-07-03 16:29:24 -04:00
mtce-common Reorder process restart operations to prevent pmond futex deadlock 2018-08-16 20:22:15 +00:00
mtce-compute Rename mwa-* subdirectories to match the git repo name 2018-07-03 16:29:24 -04:00
mtce-control Rename mwa-* subdirectories to match the git repo name 2018-07-03 16:29:24 -04:00
mtce-storage Rename mwa-* subdirectories to match the git repo name 2018-07-03 16:29:24 -04:00
.gitignore Add default test framework 2018-06-11 18:51:02 -05:00
.gitreview Add .gitreview 2018-05-31 07:36:43 -07:00
.zuul.yaml Remove non-voting gate job 2018-06-29 14:31:56 -05:00
CONTRIBUTORS.wrs StarlingX open source release updates 2018-05-31 07:36:43 -07:00
LICENSE StarlingX open source release updates 2018-05-31 07:36:43 -07:00
README.rst StarlingX open source release updates 2018-05-31 07:36:43 -07:00
centos_pkg_dirs Split centos-pkg-dirs along git boundaries. 2018-06-20 16:25:33 -04:00
mwa-beas.map StarlingX open source release updates 2018-05-31 07:36:43 -07:00
test-requirements.txt Add default test framework 2018-06-11 18:51:02 -05:00
tox.ini Add default test framework 2018-06-11 18:51:02 -05:00

README.rst

stx-metal

StarlingX Bare Metal Management