All compute hosts seen to self reboot by hostw during patching due to
stuck pmond process
Current method to kill the running process leads to a race condition
that results in a user space futex dead lock that hangs pmond and
results in a watchdog self-reset due to quorum master 'pmond' failure.
The dead lock was traced to the ordering of the kill process.
Current steps to kill:
- kill process
- remove pidfile
- unregister pid with kernel
Deadlock is avoided by reversing the kill steps to what
is more logical.
- unregister pid with kernel
- remove pidfile
- kill process
Also introduced audit that registers manually restarted processes
with the kernel.
Failure Rate Before Fix: 1 every 25 process restarts.
Mostly fails before 5.
Failure Rate After Fix: No failures after 15000 process restarts
across 8 hosts including all host types between 2 different labs 2
different loads 18.07 and 18.08.
Test Method: Pmon restart regression test restarts all processes on
a host. Total soak restart of 25 monitored processes for 50 loops
over 12 hosts = 15000 restarts.
Also regressed process kill / recovery handling.
(5000 process recoveries)
Change-Id: Icac64df52df9d8074fcd886567dda6e53641572d
Signed-off-by: David Sullivan <david.sullivan@windriver.com>
Story: 2002993
Task: 23007