Add check to avoid restarting running device plugin pod

This script was set to always restart the local sriov device plugin pod
which could result in sriov pods not starting properly.

Originally, this sequence of commands would not work properly if the
device plugin was running

kubectl delete pods -n kube-system --selector=app=sriovdp
--field-selector=spec.nodeName=${HOST} --wait=false

kubectl wait pods -n kube-system --selector=app=sriovdp
--field-selector=spec.nodeName=${HOST} --for=condition=Ready
--timeout=360s

Result when device plugin is running:
pod "kube-sriov-device-plugin-amd64-rbjpw" deleted
pod/kube-sriov-device-plugin-amd64-rbjpw condition met

The wait command succeeds against the deleted pod and the script
continues. It then deletes labeled pods without having confirmed that
the device plugin is running and can result in sriov pods not starting
properly.

Ensuring that we are only restarting a not-running device plugin pod
prevents the wait condition from immediately passing.

Closes-Bug: 1928965

Signed-off-by: Cole Walker <cole.walker@windriver.com>
Change-Id: I1cc576b26a4bba4eba4a088d33f918bb07ef3b0d
This commit is contained in:
Cole Walker 2021-06-09 17:04:56 -04:00
parent 6c61e3b665
commit 8e84309624
1 changed files with 10 additions and 3 deletions

View File

@ -167,11 +167,18 @@ function _labeled_pods {
# Don't have to restart device-plugin if no labeled pods are present. System may not be configured for SRIOV.
if [ ! -z "${PODS}" ]; then
LOG "Waiting for SRIOV device plugin pod to become available"
kubectl delete pods -n kube-system --selector=app=sriovdp --field-selector=spec.nodeName=${HOST} --wait=false
kubectl wait pods -n kube-system --selector=app=sriovdp --field-selector=spec.nodeName=${HOST} --for=condition=Ready --timeout=360s
# Check if device-plugin is ready, but do not wait
kubectl wait pods -n kube-system --selector=app=sriovdp --field-selector=spec.nodeName=${HOST} --for=condition=Ready --timeout=0s
# If device plugin is not ready, restart it and wait
if [ "$?" -ne 0 ]; then
ERROR "SRIOV device plugin timed out on ready wait. Continuing anyway. SRIOV pods may not recover."
kubectl delete pods -n kube-system --selector=app=sriovdp --field-selector=spec.nodeName=${HOST} --wait=false
kubectl wait pods -n kube-system --selector=app=sriovdp --field-selector=spec.nodeName=${HOST} --for=condition=Ready --timeout=360s
if [ "$?" -ne 0 ]; then
ERROR "SRIOV device plugin timed out on ready wait. Continuing anyway. SRIOV pods may not recover."
fi
fi
fi