config/sysinv/sysinv/sysinv/sysinv
Saba Touheed Mujawar 4c42927040 Add retry robustness for Kubernetes upgrade control plane
In the case of a rare intermittent failure behaviour during the
upgrading control plane step where puppet hits timeout first before
the upgrade is completed or kubeadm hits its own Upgrade Manifest
timeout (at 5m).

This change will retry running the process by
reporting failure to conductor when puppet manifest apply fails.
Since it is using RPC to send messages with options, we don't get
the return code directly and hence, cannot use a retry decorator.
So we use the sysinv report callback feature to handle the
success/failure path.

TEST PLAN:
PASS: Perform simplex and duplex k8s upgrade successfully.
PASS: Install iso successfully.
PASS: Manually send STOP signal to pause the process so that
      puppet manifest timeout and check whether retry code works
      and in retry attempts the upgrade completes.
PASS: Manually decrease the puppet timeout to very low number
      and verify that code retries 2 times and updates failure
      state
PASS: Perform orchestrated k8s upgrade, Manually send STOP
      signal to pause the kubeadm process during step
      upgrading-first-master and perform system kube-upgrade-abort.
      Verify that upgrade-aborted successfully and also verify
      that code does not try the retry mechanism for
      k8s upgrade control-plane as it is not in desired
      KUBE_UPGRADING_FIRST_MASTER or KUBE_UPGRADING_SECOND_MASTER
      state
PASS: Perform manual k8s upgrade, for k8s upgrade control-plane
      failure perform manual upgrade-abort successfully.
      Perform Orchestrated k8s upgrade, for k8s upgrade control-plane
      failure after retries nfv aborts automatically.

Closes-Bug: 2056326

Depends-on: https://review.opendev.org/c/starlingx/nfv/+/912806
            https://review.opendev.org/c/starlingx/stx-puppet/+/911945
            https://review.opendev.org/c/starlingx/integ/+/913422

Change-Id: I5dc3b87530be89d623b40da650b7ff04c69f1cc5
Signed-off-by: Saba Touheed Mujawar <sabatouheed.mujawar@windriver.com>
2024-03-19 08:49:36 -04:00
..
agent Merge "Report port and device inventory after the worker manifest" 2024-03-11 16:19:09 +00:00
api Merge "Fix LDAP issue for DC subcloud" 2024-03-13 20:18:24 +00:00
cert_alarm Change cert-alarm service audit behavior 2024-03-05 12:52:18 -05:00
cert_mon Merge "Disable cert-mon audit for subclouds being rehomed" 2023-11-09 21:37:53 +00:00
cmd Implement IPsec Cert-Renewal Operation 2024-03-08 12:24:02 -03:00
common Add retry robustness for Kubernetes upgrade control plane 2024-03-19 08:49:36 -04:00
conductor Add retry robustness for Kubernetes upgrade control plane 2024-03-19 08:49:36 -04:00
db Introduce Puppet variables for primary and secondary pool addresses. 2024-03-12 07:25:46 -03:00
helm Merge "Fix delete process to apps that have charts disabled" 2024-03-07 13:43:22 +00:00
ipsec_auth Addition of OTS Token activation procedure 2024-03-13 18:32:13 -03:00
loads Update extract playbooks target directory 2023-05-02 13:50:36 +00:00
objects New RESTful API and DB schema for network to address-pools. 2024-03-06 07:34:14 -03:00
openstack Restore openstack/common/context file 2023-05-24 12:43:16 +00:00
puppet Add retry robustness for Kubernetes upgrade control plane 2024-03-19 08:49:36 -04:00
tests Add retry robustness for Kubernetes upgrade control plane 2024-03-19 08:49:36 -04:00
zmq_rpc New RESTful API and DB schema for network to address-pools. 2024-03-06 07:34:14 -03:00
__init__.py Fix tox certificate issues in python2 2021-11-18 15:14:51 -06:00
_i18n.py Eliminate sdist step from sysinv zuul 2021-04-12 09:34:17 -05:00
netconf.py Fix bad syntax in requirements.txt file 2021-09-14 09:15:56 -05:00
sanity_coverage.py Fix tox-docs failing sphinx 2022-05-31 13:56:30 +00:00
version.py Remove python2 jobs from zuul for this repo 2023-02-07 19:36:45 +00:00