Commit Graph

269 Commits

Author SHA1 Message Date
Joshua Reed 967eedadb7 Apply Helm Overrides to initially disabled charts.
The previous implementation of the _get_list_of_charts
method would not take into account whether or not a
particular application chart was enabled or disabled.

This change now only includes charts that are enabled, or
if the function caller asks for all of them with the
include_disabled override set.  The override is set as a
part of the perform_app_upload routine to ensure overrides
are generated and applied to all charts, including those
which are initially disabled.

This change also seeks to handle issues where the
kustomize-orig.yaml file is not created by the time the
perform_app_upload routine runs _get_list_of_charts by
including an extra check.

Finally, the override generation in the perform_app_apply
function is moved to happen first in the sequence of events
such that the app object is populated with overrides prior
to any other operations occuring.  This must be done to
ensure the correct chart list is used.

This fix ensures that:

1. When all charts are needed then an option can be specified
   (i.e. when determining all the container images needed for
   the application) This is done with include_disabled flag.
2. All possible charts, as filtered by the metadata/user
   driven and DB stored enabled status, are consistently
   returned regardless of the current state of the top-level
   application kustomization.yaml.
3. A final check for kustomization-orig.yaml is performed and
   the file is created, if missing, before
   _get_list_of_charts executes with include_disabled=True

Test Plan:
PASS: build-pkgs -a && build-image
PASS: AIO-SX full install with clean bootstrap
PASS: Enable the cms-replication chart on the dell-storage app
PASS: Use system helm-override-update to pass
      --set config.clusterID=ClusterA
PASS: system application-apply dell-storage
PASS: Check the YAML structure of the
      configmap/dell-replication-controller-config
      for ClusterA and properly formatted.
PASS: Additional check to ensure that stx-openstack application
      successfully uploads and applies.
PASS: Check that a helm override are generated even for an
      application that doesn't have a kustomize operator.  This
      was done for the metrics-server app.  A helm override was
      created and the subsequent metrics-server.yaml file in
      /opt/platform/helm contained the override after the
      system applciation-apply command was run.

Relates to previous attempt at a fix:
https://review.opendev.org/c/starlingx/config/+/890570

Closes-Bug: 2029303

Change-Id: I4c501b982e4061e5067ca0e8e43f37a9eecfcb68
Signed-off-by: Joshua Reed <joshua.reed@windriver.com>
2024-04-12 13:12:28 -06:00
Igor Soares b1b160f48b Fix charts upload when there are existing ones
This fixes a bug that prevents StarlingX application charts from
being uploaded to the helm repository when one or more of them have been
uploaded before.

The charts upload logic was changed to check if all charts provided by
the given application are valid prior to uploading. If a chart is
invalid then no charts for that application will be uploaded, since the
upload process cannot proceed in that scenario.

Test Plan:
PASS: build-pkgs -a && build-image
PASS: AIO-SX fresh install
PASS: Build a platform-integ-apps version containing one existing chart
      and two nonexistent charts in the local Helm repository.
      Update platform-integ-apps to the built version.
      Confirm that the existing chart was not re-uploaded and that the
      nonexistent ones were correctly uploaded to the Helm repository.
PASS: Apply/remove/delete platform-integ-apps

Closes-Bug: 2053074
Depends-on: https://review.opendev.org/c/starlingx/integ/+/912305

Change-Id: I155d457f58be1986cc6f25178929aedfbe1d0693
Signed-off-by: Igor Soares <Igor.PiresSoares@windriver.com>
2024-04-02 12:05:28 -03:00
Zuul c24d0950bc Merge "Fix delete process to apps that have charts disabled" 2024-03-07 13:43:22 +00:00
David Bastos c9b71ebd65 Fix delete process to apps that have charts disabled
When deleting an application that has one chart or more disabled,
the app framework was not able to correctly delete the disabled
charts from the helm repository.

If, after deleting an app, an attempt was made to upload that same
app, a failure would occur, informing that the charts were already
in the helm repository.

The correction consists of using the  kustomization-orig.yaml file
instead of kustomization.yaml in the deletion process to list the
enabled and disabled charts.

Another fix was made in case an application has the status of
"upload failed" and an attempt is made to delete another app. This
caused a Python runtime error because the get_chart_tarball_path
function tried to access the dictionary key and it wasn't there.

The solution was to check if the key for that chart exists and only
then try to access it. New logs are added to alert the user if the
chart does not exist.

Test Plan:
PASS: Build-pkgs
PASS: Upload, apply, remove and delete dell-storage
PASS: Upload, apply, remove and delete oidc-auth-apps
PASS: upload, apply, remove and delete metrics-server
PASS: Deletes app that has charts disabled and all charts are
      deleted from the helm repository correctly.
PASS: After deleting and trying to upload the same app, no error
      occurs and the upload and apply process is completed
      successfully.
PASS: Deleting an app with another app with "upload failed"
      status and no Python runtime error occurs

Closes-Bug: 2055697

Change-Id: I22de414e8780fe3691d06bdd015e4c927dcc10f0
Signed-off-by: David Bastos <david.barbosabastos@windriver.com>
2024-03-05 17:20:31 -03:00
Zuul ff0a24e8db Merge "Fix sysinv-helm command." 2024-03-04 19:00:26 +00:00
Joshua Reed b2e75b771f Fix sysinv-helm command.
Running the sysinv-helm command to create fluxcd app
overrides fails with a stack trace printed to terminal.

This fix pre-populates the apps_metadata_dict variable with
an empty dictionary in the event that its empty, which it would
be the first time when this command is run.

Test Plan:
PASS: Verify command 1 below produces {} in the output yaml.
PASS: Verify command 2 below produces fully populated output
      accross several yaml files.
PASS: Verify sysinv conductor starts up correctly with these
      changes in place.

Command 1:
system application-upload \
  /usr/local/share/applications/helm/metrics-server-1.1-44.tgz
sudo sysinv-helm create-fluxcd-app-overrides \
  /home/sysadmin metrics-server metrics-server

Command 2:
sudo sysinv-helm create-fluxcd-app-overrides \
  /home/sysadmin platform-integ-apps kube-system

Closes-Bug: 2055463

Change-Id: I3251776653bcfb1cf11f3dfec388953d476b8465
Signed-off-by: Joshua Reed <joshua.reed@windriver.com>
2024-02-29 13:33:53 -07:00
Zuul 7e9e133870 Merge "Fix misleading app status after failed override update" 2024-02-29 18:11:47 +00:00
David Bastos ce4b7c1eb3 Fix misleading app status after failed override update
Application status was misleading after a failed override update with
illegal values. Application should be in failed (apply-failed) state,
and alarm should be raised accordingly. Instead, we're led to believe
that the update was completed successfully.

The solution consists of adding a default delay to the system of 60
seconds before changing the helmrelease status. This way we ensure
that reconciliation has already been called.

This also ensures that any application can override this default
value via metadata. Just create a variable with the same name with
the amount of time that is needed.

Test Plan:
PASS: Build-pkgs && build-image
PASS: Upload, apply, delete and update nginx-ingress-controller
PASS: Upload, apply, delete and update platform-integ-apps
PASS: Upload, apply, delete and update metrics-server
PASS: Update user overrides (system user-override-update) with illegal
      values. When reapplying the app it should fail.
PASS: Update user overrides (system user-override-update) with correct
      values. When reapplying the app it should complete successfully.
PASS: If the app has the fluxcd_hr_reconcile_check_delay key in its
      metadata, the system's default delay value must be overwritten.

Closes-Bug: 2053276

Change-Id: I5e75745009be235e2646a79764cb4ff619a93d59
Signed-off-by: David Bastos <david.barbosabastos@windriver.com>
2024-02-16 17:51:14 +00:00
Igor Soares 46f5ccfc55 Update apps during Kubernetes upgrade
Update StarlingX applications during Kubernetes upgrades according to
their metadata.

Applications are updated if they have "k8s_upgrades:auto_update" set to
true on their metadata.yaml file. The ones that have
"k8s_upgrades:timing" set to "pre" are updated during the
"kube-upgrade-start" phase. The ones that set "k8s_upgrades:timing" to
"post" are updated during the "kube-upgrade-complete" phase.

In order to better support application updates during
kube-upgrade-start, two new statuses were added: 'upgrade-starting' and
'upgrade-starting-failed'. The 'upgrade-starting' state is the new
initial state when triggering a Kubernetes upgrade. If starting the
upgrade fails, then the status is updated to 'upgrade-starting-failed'
and users can either abort the upgrade (only available in simplex loads)
or try starting it again. No changes were made to kube-upgrade-complete
in that regard because at that point a new Kubernetes version is already
in place. A review was raised to nfv-vim to reflect the new statuses on
the Kubernetes orchestrated upgrade code:
https://review.opendev.org/c/starlingx/nfv/+/906594

Application auto updates can be retried when restarting a Kubernetes
upgrade that previously failed due to a failing app.

A bug that was preventing the k8s_upgrade metadata section from being
parsed was fixed by this commit as well.

Test Plan:
PASS: build-pkgs -a && build-image
PASS: Create a new platform-integ-apps tarball adding
      "k8s_upgrades:auto_update=true" and "k8s_upgrades:timing=pre" to
      metadata.yaml.
      Add the new tarball to /usr/local/share/applications/helm/.
      Run kube-upgrade-start.
      Check if platform-integ-apps was correctly updated.
      Check if no other apps were updated.
PASS: Create a new metrics-server tarball adding
      "k8s_upgrades:auto_update=true" and "k8s_upgrades:timing=post" to
      to metadata.yaml.
      Add the new tarball to /usr/local/share/applications/helm/.
      Run kube-upgrade-complete.
      Check if metrics-server was correctly updated.
      Check if no other apps were updated.
PASS: Create a new platform-integ-apps tarball adding
      "k8s_upgrades:auto_update=true" and "k8s_upgrades:timing=pre" to
      metadata.yaml.
      Add the new tarball to /usr/local/share/applications/helm/.
      Restart sysinv to update the database.
      Replace the platform-integ-apps tarball with another tarball that
      does not have a metadata.yaml file.
      Check if an error is logged when running kube-upgrade-start
      reporting that platform-integ-apps failed to be updated.
      Confirm that the Kubernetes upgrade was not started.
      Abort Kubernetes upgrade
      Check if upgrade was successfully aborted
PASS: Create a new metrics-server tarball adding
      "k8s_upgrades:auto_update=true" and "k8s_upgrades:timing=post" to
      to metadata.yaml.
      Add the new tarball to /usr/local/share/applications/helm/.
      Restart sysinv to update the database.
      Replace the snmp tarball with another tarball that does not have
      a metadata.yaml file.
      Check if an error is logged when running kube-upgrade-complete
      reporting that metrics-server failed to be updated.
      Check if the Kubernetes upgrade was marked as completed.
PASS: AIO-SX fresh install
      Manual upgrade to Kubernetes v1.27.5
      Check if upgrade was successfuly done
PASS: AIO-SX fresh install
      Orchestrated upgrade to Kubernetes v1.27.5
      Check if upgrade was successfuly done
PASS: AIO-SX fresh install with Kubernetes v1.24.4
      Orchestrated upgrade to Kubernetes v1.27.5
      Check if upgrade was successfuly done

Story: 2010929
Task: 49416

Change-Id: I31333bf44501c7ad1688635b75c7fcef11513026
Signed-off-by: Igor Soares <Igor.PiresSoares@windriver.com>
2024-02-13 15:01:54 -03:00
Thiago Miranda caf9de1603 Optimizing image downloads
In this commit, we obtain a list of images already present in
containerd to avoid unnecessary checks and pulls, reducing CPU
consumption.

TEST PLAN:
PASS: Lock/Unlock controllers
PASS: Successfully swact between controllers
PASS: Successfully recover after power down and up both controllers
PASS: Successfully bootstrap (Simplex and Duplex)
PASS: Successfully recover after active controller goes down
PASS: Successfully application lifecycle

Story: 2010985
Task: 49228

Change-Id: I58dd11c8d590b60ab100f79a03e17c5921e3721b
Signed-off-by: Thiago Miranda <tmarques@windriver.com>
Co-authored-by: Eduardo Juliano Alberti <eduardo.alberti@windriver.com>
2024-01-16 12:43:29 +00:00
Zuul 63ffee43a0 Merge "Delete Kubernetes resources on application updates" 2023-11-03 15:39:36 +00:00
David Barbosa Bastos 301b401c9a Delete Kubernetes resources on application updates
Within the application update process, it is necessary to
call "kubectl delete -k <manifest_dir>" to delete the
charts that will not be present in the new version of the
application. This way we eliminate unnecessary
remnants of secrets in the N+1 application.

Test Plan:
PASS: Upload/Apply/Remove/Delete cert-manager
PASS: Upload/Apply/Remove/Delete plataform-integ-apps
PASS: Platform-integ-app update to the new version with
changed list of the charts
PASS: Secrets no longer used were deleted
PASS: If the update fails, it must remove the secrets that
are in version N+1 and should not be in version N

Closes-Bug: 2040277

Change-Id: I1c281491d30b46a7cbf53211890bb4add021dcc8
Signed-off-by: David Barbosa Bastos <david.barbosabastos@windriver.com>
2023-11-03 14:57:36 +00:00
Zuul 06d1e73c6a Merge "Do not remove charts that are in use by apps" 2023-11-02 18:29:45 +00:00
Igor Soares 80f8754d2b Improve application metadata validation code
Improve metadata validation code to:
    1) Provide better reuse of the typing check code within the
       validation method, since there are a number of duplicated type
       checks in place.
    2) Facilitate reusing the validation code in other parts of the
       system. The validation function is lengthy so it was refactored
       and moved to a new file called app_metadata.py.
    3) Add Kubernetes version validation without incurring in huge code
       repetition.
    4) Raise an exception when the k8s_minimum_version section is
       missing from the metadata. This has been introduced and commented
       out for now until at least all default applications have the
       k8s_minimum_version section included on their metadata files.
    5) Raise an exception if the 'auto_update' or 'timing' fields are
       missing from the k8s_upgrades section.

Test Plan:
PASS: build-pkgs -a && build-image
PASS: AIO-SX install
PASS: Upload/apply cert-manager
PASS: Upload/apply platform-integ-apps
PASS: Upload/apply snmp
PASS: Upload/apply a test application that contains the new k8s_upgrade
      metadata section

Story: 2010929
Task: 48948

Change-Id: I6139b8694962855f50c114c4d645b51d7b374f42
Signed-off-by: Igor Soares <Igor.PiresSoares@windriver.com>
2023-10-25 14:11:02 +00:00
Igor Soares 8a1852a7fb Do not remove charts that are in use by apps
Prevent charts referenced by multiple StarlingX applications from being
deleted from Helm repositories.

Helm charts are deleted from Helm repositories when applications are
deleted or updated. However, in scenarios that those charts are in use
by other applications that deletion should not happen. This commit
introduces a check to cover this scenario and prevent the deletion.

In addition, the formatting of the error message raised when a chart of
the same version is uploaded was also fixed.

Test Plan:
PASS: build-pkgs -c -p sysinv
PASS: Upload app snmp-1.0-90.tgz containing the snmp-1.0.0.tgz chart.
      Create a sibling app called snmp-clone-1.0-1.tgz containing the
      same chart.
      Upload app snmp-clone-1.0-1.tgz.
      Delete app snmp-1.0-90.tgz.
      Confirm that the snmp-1.0.0.tgz chart still is in the repository.
      Delete app snmp-clone-1.0-1.tgz.
      Confirm that the snmp-1.0.0.tgz chart was removed from the
      repository.
PASS: Upload app snmp-1.0-90.tgz containing the snmp-1.0.0.tgz chart.
      Apply app snmp-1.0-90.tgz.
      Upload app snmp-clone-1.0-1.tgz which contains the same chart.
      Update app snmp-1.0-90 to snmp-1.0-91 which contains the
      snmp-1.0.1.tgz chart.
      Confirm that the snmp-1.0.0.tgz chart still is in the repository.
      Delete app snmp-clone-1.0-1.tgz
      Confirm that the snmp-1.0.0.tgz chart was removed from the
      repository.
      Remove app snmp-1.0-91.
      Delete app snmp-1.0-91.
      Confirm that the snmp-1.0.1.tgz chart was removed from the
      repository.

Story: 2010929
Task: 48882

Change-Id: Ie5307f67726ee2e1e774f22af22b286548dfc78d
Signed-off-by: Igor Soares <Igor.PiresSoares@windriver.com>
2023-10-25 09:34:11 -03:00
Igor Soares 120df4573d Enforce Helm chart version uniqueness
During application upload if the incoming chart version already exists
in the target Helm repository but the implementation is different then
reject the upload. Also, if the incoming chart version exists and the
incoming chart is identical to the existing version then just skip the
upload. This relies on the digest and version checks performed
by the helm-upload script.

Helm repository management capabilities have been included to the
Application Framework in order to support applications' lifecycle. When
an application is removed, its Helm charts are also removed from the
repository. Similarlly, when an application is updated, the charts
related to the previous version are also removed from the repository.
Repositories are re-indexed when charts are deleted and their index
files are updated accordingly.

Test Plan:
PASS: build-pkgs && build-image
PASS: AIO-SX fresh install
PASS: Run "system application-upload vault-1.0-54.tgz"
      Check if the chart was correctly uploaded to the repository
      Run "system application-delete vault"
      Check if the chart was correctly deleted from the repository
      Check if the index.yaml file was regenerated accordingly
PASS: Run "system application-upload vault-1.0-54.tgz"
      Run "system application-apply vault"
      Bump chart version, bump app version and repackage
      Run "system application-update vault-1.0-55.tgz"
      Check if the chart of the previous version was correctly
      removed from the Helm repository
      Check if the index.yaml file was regenerated accordingly
PASS: Run "system application-upload vault-1.0-54.tgz"
      Run "system application-apply vault"
      Bump app version to vault-1.0-55 on metadata.yaml and repackage
      system application-update vault-1.0-55.tgz
      Check that the app was updated and the chart remained untouched
PASS: Run "system application-upload vault-1.0-54.tgz"
      Run "system application-apply vault"
      Change image tag on chart, keep chart version, bump app version on
      metadata and repackage
      Run "system application-update vault-1.0-55.tgz"
      Check that the app update failed

Story: 2010929
Task: 48882

Depends-On: https://review.opendev.org/c/starlingx/integ/+/896870

Change-Id: I421482969f4cc9fd3789309c5c69b9a3f233053a
Signed-off-by: Igor Soares <Igor.PiresSoares@windriver.com>
2023-10-16 16:12:23 -03:00
Igor Soares 4a0521d70b Improve AppFwk namespaces handling
This commit improves the application framework namespaces handling by:

1) Removing the requirement for application developers to set a
mandatory namespace on the root level kustomization.yaml. This aims to
uncouple Flux resources namespaces and application specific
namespaces. Establishing a mandatory top level namespace tells
kustomize to put all underlying resources inside that same namespace.
However, Flux resources such as HelmRepository can be on a different
namespace than others resources such as the Chart definition namespace.
An example from Flux's documentation can be seen at [1].
If no namespace is explicitly set for a given resource then the
"default" namespace is assumed following the Kubernetes standard.

2) Allowing plugins to span multiple namespaces within a chart. Even
though the code to generate Helm overrides files for multiple
namespaces within the same chart was already in place, only the
override file named after the same chart's namespace was being copied
over to the system overrides. This behavior is fixed by creating a
single intermediate override file per chart, containing content for all
its namespaces. Single namespaced charts did not have their behavior
changed, which guarantees compatibility with all current apps. No
action is required from application developers unless they want to
start using multiple namespaces per chart. This allows work done on
kubevirt to proceed, since it now requires overrides that span two
namespaces [2].

[1] https://fluxcd.io/flux/faq/#can-i-use-flux-helmreleases-without-gitops
[2] https://review.opendev.org/c/starlingx/app-kubevirt/+/884431/2/python3-k8sapp-kubevirt/k8sapp_kubevirt/k8sapp_kubevirt/helm/kubevirt.py

Test plan:
PASS: build-pkgs -a & build-image
PASS: Full AIO-SX install
PASS: Edit kubevirt manifests to remove the root level kustomization
      namespace. Place the Helm repository and chart definitions on
      flux-helm and kubevirt namespaces respectively.
      Add the CDI namespace to the get_overrides method on the app
      plugin so that two application-specific namespaces are
      available for its single chart. Rebuild the app.
      Upload/apply kubevirt
      Check if the number of pod replicas correctly reflects the
      system overrides.
      Remove/delete kubevirt
PASS: Add user overrides to both kubevirt namespaces changing the number
      of replicas.
      Check if the number of pod replicas correctly reflects the user
      overrides.
PASS: Apply a single namespaced version of kubevirt and update to
      another single namespaced version.
PASS: Apply a single namespaced version of kubevirt and update to a
      multiple namespaced version.
PASS: Apply a multiple namespaced version of kubevirt and update to a
      single namespaced version.
PASS: Apply a multiple namespaced version of kubevirt and update to
      another multiple namespaced version.
PASS: Upload/apply platform-integ-apps
PASS: Upload/apply/remove oidc-auth-apps using user overrides
PASS: Upload/apply/remove/delete snmp
PASS: Upload/apply/remove/delete metrics-server
PASS: Upload/apply/remove/delete vault
PASS: Upload/apply/remove/delete auditd

Closes-Bug: 2025511
Co-Authored-By: David Barbosa Bastos <david.barbosabastos@windriver.com>
Change-Id: I7b66d19c9cd977e6e1a82a613907e794de026cfb
Signed-off-by: Igor Soares <Igor.PiresSoares@windriver.com>
2023-08-21 18:18:00 +00:00
David Barbosa Bastos b5db54b0d9 Improve handling kubectl errors
Implementation is correcting unexpected behavior. Before, any warning
in the "kubectl apply/delete -k" command aborted the apply process,
even if the command had been successfully executed. Now, the warning
will be shown, but it will not cancel the apply process.

Test Plan:
PASS: Upload/apply/remove/delete kubevirt-app
PASS: Upload/apply/remove/delete metrics-server
PASS: Upload/apply/remove/delete vault
PASS: If "kubectl apply -k" generates a warning the apply process
is not aborted
PASS: If "kubectl apply -k" generates an error, the apply process
is aborted

Closes-bug: 2025511

Change-Id: Ief4bab661aed29ba51f6ea75a1582109741d3599
Signed-off-by: David Barbosa Bastos <david.barbosabastos@windriver.com>
2023-08-01 19:29:00 -03:00
Igor Soares 9c6d1a77eb Improve error handling when loading app plugins
Improve error handling when loading app plugins on scenarios that DRBD
fails to sync the helm overrides folder.

This commit fixes a behavior that, in such scenarios, triggers the
generic app plugin which, in turn, raises the following cryptic error
message: "Automatic apply is disabled".

That behavior was changed to raising a SysinvException while attempting
to load the plugins. In addition, the exception now contains a more
descriptive message, clearly stating that there was a failure while
loading the application plugins, also mentioning the specific plugin
folders that could not be found. That exception is handled and logged
so that sysinv can keep running. If DRBD succeeds to sync after failing,
plugins are properly loaded and normal operation can resume.

Test Plan:
PASS: build-pkgs and build-image
PASS: AIO-SX full system deploy
PASS: Simulate the DRBD sync failure by renaming /opt/platform/helm and
      restarting sysinv. Then watch the exception being raised and
      logged referencing the correct plugin folder, rather than
      displaying the "Automatic apply is disabled for
      platform-integ-apps" message. Rename the overrides folder back to
      its original name and check if plugins were correctly loaded.
PASS: Same as above but renaming the folder back and forth while
      sysinv is running.

Closes-Bug: 2024491
Change-Id: Iefa4259fd468a9ae582fc1138b1d1022eba36b0d
Signed-off-by: Igor Soares <Igor.PiresSoares@windriver.com>
2023-06-23 15:15:53 -03:00
Joshua Reed 50aa301e6d Correct Log bug in sysinv conductor.
In file kube_app.py, the class FlucCDHelper has a function
called make_fluxcd_operation.  Inside there is a LOG.error
with an invalid python string formatter.  The manifest_dir
string is missing, and this change corrects this issue.

Test Plan:
PASS: build-pkg -a &&  build-image
PASS: full AIO-SX install
PASS: run system application-upload
          /usr/local/share/applications/helm/security-profiles-operator-22.12-1.tgz
      run system application-list to validate the security
          profile application is uploaded successfully
      run system application-apply security-profiles-operator to deploy
          the application
      Last: observe the corrected log output in /var/log/sysinv.log

Closes-Bug: 2024020
Change-Id: Icfffd04309721193b71654927751b783b9c6ace2
Signed-off-by: Joshua Reed <joshua.reed@windriver.com>
2023-06-15 12:44:10 -07:00
Zuul 20e16db383 Merge "SX host-lock failed by "Timeout while waiting on R"" 2023-06-09 04:51:28 +00:00
Zuul 45814e8e48 Merge "New image parsing pattern that supports "registry"" 2023-06-05 13:10:33 +00:00
David Barbosa Bastos 12bb149f92 New image parsing pattern that supports "registry"
Added support for new "registry" pattern. Image settings
inside charts can now have the following pattern:

image:
   registry: <str>
   repository: <str>

Test Plan:
PASS: Upload and apply process successfully completed with
tarball changed to new pattern using "registry"
PASS: metrics-server, nginx-ingress-controller, vault and
sts-silicom upload and apply process without "registry"
completed successfully.

Closes-bug: 2019730

Change-Id: Id5cadafedf9b85891700dffcede9b0b09ee64359
Signed-off-by: David Barbosa Bastos <david.barbosabastos@windriver.com>
2023-06-01 15:56:56 +00:00
David Barbosa Bastos e2af795583 SX host-lock failed by "Timeout while waiting on R"
When sysinv is restarted and there is an application stuck,
the _abort_operation function was called with a parameter
different from the expected one. The parameter needs to be
an instance of AppOperation.Application.

Function call was changed with correct parameter and added
documentation in the _abort_operation function.

Test Plan:
PASS: Restart sysinv successfully
PASS: Restart sysinv with stuck kubervirt app performed
successfully
PASS: Successfully lock and unlock the controller
PASS: Shows the name of the chart that caused the app to abort
PASS: Individually show the image that failed when trying to
apply the app
PASS: Command "system application-abort" executed and output
message "operation aborted by user" displayed in
application-list as expected

Closes-bug: 2022007

Signed-off-by: David Bastos <david.barbosabastos@windriver.com>
Change-Id: I948ec8f9700d188a5f8e099a4992853822735b95
2023-06-01 14:17:08 +00:00
Erickson Silva de Oliveira 74bb9d3bc1 Create lifecycle hook on application recovery
There was no lifecycle hook during recovery.
Thus, in platform-integ-apps it was not possible
to identify when this happened.

To solve this, a new operation constant was created and
the lifecycle hook was triggered inside _perform_app_recover().

Test Plan:
PASS: When forcing recovery, it was possible to observe the
      lifecycle with the "recover" operation (SX/DX/Storage)

Story: 2010688
Task: 48081

Change-Id: I44447ca2246a8461d98f8ea64e2e16c127c357a6
Signed-off-by: Erickson Silva de Oliveira <Erickson.SilvadeOliveira@windriver.com>
2023-05-26 14:54:23 +00:00
Zuul f4e310609f Merge "Fix application rollback strategy for Flux" 2023-05-17 19:34:58 +00:00
Zuul 0f5d272208 Merge "Add more info to alarms and progress messages" 2023-05-16 15:21:59 +00:00
Igor Soares fb8fdfe9ab Fix application rollback strategy for Flux
Update application rollback strategy for proper compatibility with
Flux.

This fixes a bug where applications fail to downgrade on DX and Standard
systems due to incompatibility between the previous existing rollback
approach used for Armada and Flux-based apps.

The deprecated code used for rolling back Armada-based applications was
removed. That code was still being called and causing an exception to be
raised due to Armada related chart attributes not being available
anymore.

The rollback to a previous version is now done by applying that
version using "kubectl apply -k <manifest dir>". That way Flux is able
to detect the version we are rolling back to and properly applies it.

Test Plan:
PASS: build-pkgs -a
PASS: build-image
PASS: full AIO-DX install
PASS: update cert-manager-1.0-64.tgz to cert-manager-1.0-65.tgz then
      check with "helm release -A -a" if chart and app versions were
      properly updated.
PASS: downgrade cert-manager-1.0-65.tgz to cert-manager-1.0-64.tgz then
      check with "helm release -A -a" if chart and app versions were
      properly downgraded.
PASS: full AIO-SX install

Closes-Bug: 2019259
Signed-off-by: Igor Soares <Igor.PiresSoares@windriver.com>
Change-Id: Ice1e4d58ff228aea1d4d530e4679ee07263d83f9
2023-05-12 17:09:33 -04:00
Robert Church 2785f64e54 AppFrmwk: Cleanup unique helm releases over update
When updating an application some helm releases are unique to a specific
application version. This requires that a when an application
successfully or unsuccessfully updates, specific helm releases must be
removed by the framework as it will not be managed by the new (or old)
version of the application that is being applied during update (or
recovery).

Changes include:
 - When helm releases are cleaned up via delete_helm_release() also
   remove the FluxCD helmrelease CRD so that the helm controller will
   not re-deploy the helm release.
 - Refactor calls to delete_helm_v3_release() to delete_helm_release()
   as helm v2 is no longer supported, so differentiation is irrelevant.
 - Refactor retrieve_helm_releases() by removing the wrapper function
   and renaming retrieve_helm_v3_releases().
 - Refactor HelmTillerFailure exception to HelmFailure. Tiller is no
   longer present in the system as helm v3 is tillerless and the Armada
   pod containing the Tiller container is no longer supported.
 - Fix issue that when an application does not specify any images in any
   chart values.yaml an exception is thrown when applying the
   application due to a null dict being written to the application
   images file.

Test Plan:
PASS - Build, install, deploy AIO-SX
PASS - Build custom platform-integ-apps without the ceph audit chart.
       Perform application update and confirm that the unique helm
       release from the previous application version is properly
       cleaned up.

Closes-Bug: #2019138
Signed-off-by: Robert Church <robert.church@windriver.com>
Change-Id: I3a14f8f6b990351f8415a3fe3ce0b9637672dbcb
2023-05-11 07:57:22 -05:00
David Barbosa Bastos 6a95a0eb42 Add more info to alarms and progress messages
Improved app crash progress messages. It is now possible to
identify failure to download a specific image or if it failed
to download all. It is also possible to check in the progress
column if any specific helmchart failed.

Test Plan:
PASS: Shows specific error message when failing to download all
docker images
PASS: Individually show the image that failed when trying to
apply the app
PASS: Shows the name of the chart that caused the app to abort

Story: 2010736
Task: 47964

Signed-off-by: David Bastos <david.barbosabastos@windriver.com>
Change-Id: If953120852ad7812971adebf23a675ca2134cca1
2023-05-10 19:14:30 +00:00
Fabricio Henrique Ramos c937f46ece Remove armada and helm v2
With the application framework moving to FluxCD,
Armada is no longer supported and its configuration
files and resources are no longer necessary.

The same applies to helm v2 (Tiller) with the system
now using helm v3.

Test Plan:
PASS: app operations still work
PASS: deploy and all system apps uploaded/applied
PASS: deploy unlocked/enabled/available
PASS: upgrade sx
PASS: upgrade dx
PASS: deploy dc
PASS: bnr

Signed-off-by: Fabricio Henrique Ramos <fabriciohenrique.ramos@windriver.com>
Change-Id: I48bf128afa3b85295e83f524c827ced8a5e3da75
2023-03-23 17:19:33 -03:00
Igor Soares 2cba79c276 Recover from pre-upgrade hook timeout
Implement logic to recover from Helm pre-upgrade hook timeout.

In order to successfully recover in this scenario, applications
need to be removed prior to retrying. A new constant was created
to store Helm error messages that can trigger this recovery logic.
More errors can be added to this constant in the future if needed.

The recovery logic leverages the already existing retry mechanism
triggered when an ApplicationApplyFailure exception is raised.

Test Plan:
PASS: build-pkgs -p sysinv
PASS: Add pre-upgrade hook to platform-integ-apps running
      "sleep 300" and set the Helm release timeout to 3
      minutes. Then rebuild package, update app, observe the
      hook timeout being triggered, delete flux pods, and
      watch the recovery logic successfully finish updating
      the app.
PASS: upload/apply/remove/delete unmodified snmp app

Closes-Bug: 2011850
Signed-off-by: Igor Soares <Igor.PiresSoares@windriver.com>
Change-Id: Ib2cf97ea728e8a9bec4559de04d3731f34f35f1b
2023-03-20 14:05:07 -04:00
Zuul df0fda20d0 Merge "Remove metadata from apps_metadata dict when deleting app" 2023-03-09 15:42:14 +00:00
Joao Victor Portal f5a8ba9f22 Support old Barbican registry secrets
Older versions of StarlingX created Barbican registry secrets as
"text/plain", that are retrieved as "string" in Python 3. This change
adds support to these secrets, that may be used in upgraded systems. In
non-upgraded newer systems, these secrets are created as
"application/octet-stream", being retrieved as "bytes" to be decoded
using UTF-8 in Python 3.

Test Plan:

PASS: Check that the system can retrive the registries usernames and
passwords when the secrets have the content type as "text/plain".

PASS: Check that the system can retrive the registries usernames and
passwords when the secrets have the content type as
"application/octet-stream".

Closes-Bug: 2009631
Signed-off-by: Joao Victor Portal <Joao.VictorPortal@windriver.com>
Change-Id: I5a71239b09ef1124449dc66f86ef790e1f23222c
2023-03-07 18:09:17 -03:00
Gabriel de Araújo Cabral b1f44a16d0 Remove metadata from apps_metadata dict when deleting app
When some application is deleted, it's information remains
cached in all dicts and lists from apps_metadata, with this
structure:

    apps_metadata =
        {apps: {},
        platform_managed_apps_list: {},
        desired_states: {},
        ordered_apps: []}

The mentioned behavior was causing a bug after an upgrade from
stx 6 to stx 8, because an app that has no support in stx 8 is
deleted but continue cached in apps_metadata, and when
k8s_application_audit() is run after the upgrade is complete,
the system uses the cached list to try to upload this app until
the upload-failed state appears in the alarms list.

Now the created method is called right after deleting an
application, this way the app will be completely out of the
system.

Test-Plan:
  PASS: Ugrade from stx 6.0 to 8.0 in a standard config lab.
  PASS: During the upgrade, an app from the previous version
  which has no support in the next version is automatically
deleted.
  PASS: Validate that the deleted app is not in any
  collection from apps_metadata.
  PASS: Validate that the upload try of the deleted app doesn't
happen anymore.

Closes-Bug: 2009025

Signed-off-by: Gabriel de Araújo Cabral <gabriel.cabral@windriver.com>
Change-Id: I6a54218b398493acc931c5eca34b800383b16cc0
2023-03-07 12:15:29 -05:00
Igor Soares a26ff56776 Promote apps desired state on application apply
Promote application desired state from "uploaded" to "applied" during
application apply and demote during application removal.

If the original application desired state is "uploaded" the
corresponding application metadata field on the database is updated
to "applied" during the apply operation. Correspondingly, the desired
state is reinstated to "uploaded" during removal.

This ensures that applications will remain applied across sysinv
restarts, fixing a bug that caused apps to return to uploaded state
if the conductor was restarted during a reapply operation.

Test Plan:
PASS: build-pkgs
PASS: build-image
PASS: AIO-SX full deployment
PASS: upload/apply/remove/delete platform-integ-apps
PASS: upload/apply/remove/delete snmp
PASS: upload/apply/remove/delete cert-manager with "desired_state"
metadata set as "uploaded"
PASS: restart sysinv-conductor during cert-manager apply/reapply operations

Closes-Bug: 2008014
Signed-off-by: Igor Soares <Igor.PiresSoares@windriver.com>
Change-Id: I6b64d3d3983a0571014168124cd61416779d598f
2023-02-24 14:35:25 -05:00
Igor Soares c0f09ff4c0 Check FluxCD pods status before Helm operations
Check helm-controller and source-controller FluxCD pod status
before carrying out helm operations.

If either helm-controller or source-controller pods are not ready
an error message is logged. This aims to provide more information
for troubleshooting app installation issues.

In addition, audit is deferred if FluxCD pods are not ready
replacing the corresponding old logic for Armada.

Test Plan:
PASS: build-pkgs
PASS: build-image
PASS: AIO-SX deployment
PASS: Reapply platform-integ-apps
PASS: Apply SNMP app
PASS: Set invalid image for helm-controller and check if audit is deferred
PASS: Set invalid image for source-controller and check if audit is deferred
PASS: Delete helm-controller deployment and check if audit is deferred
PASS: Delete source-controller deployment and check if audit is deferred
PASS: Restore helm-controller and source-controller original deployments
and check if audit proceeds

Story: 2009303
Task: 47244

Signed-off-by: Igor Soares <Igor.PiresSoares@windriver.com>
Change-Id: I21569ec1a20fd86d336fc6e50d5372cb783cf412
2023-02-10 07:47:36 -05:00
Dan Voiculeasa 500fbaa133 AppFwk: Load metadata before clearing stuck application
Restarting sysinv while an application is applying will result in a
wrong reset status. For example cert-manager status is reset to
'apply-failed' instead of 'uploaded'.

When sysinv is restarted, app operations that are in progress are
reset. When apps were decoupled from sysinv [1], a requirement to
have the app metadata loaded was introduced.

Tests on AIO-SX:
PASS: deploy, unlocked enabled available
PASS: forced 'cert-manager' to be 'applying', forced sysinv conductor
restart, observed status was reset to 'uploaded'.

[1]: https://review.opendev.org/c/starlingx/config/+/774292/10/sysinv/sysinv/sysinv/sysinv/conductor/kube_app.py#333
Partial-Bug: 2003198
Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Change-Id: Ibefc6362c7a7f03571be3cf35b6592cf0c68bca3
2023-01-18 15:48:42 +02:00
Dan Voiculeasa 0127278c09 AppFwk: Fix armada operation during upgrades
An error is seen in the AppFramework during upgrades from N to N+2
side.

Specifically during [1], cert-manager is not properly removed,
preventing the new version apply:
ERROR sysinv.conductor.kube_app [-] Unsupported armada request: remove.

When [2] was introduced one request for app removal to armada/fluxcd
was renamed in one place from APP_DELETE_OP to APP_REMOVE_OP. This
needs to be corrected for armada case to support N to N+2 upgrades.

Allow the armada operation to be named APP_DELETE_OP as it was before.

Tested on AIO-SX upgrade from stx.6.0 to master(soon to be stx.8.0).
Had other patches applied to the system, but will address those
issues later.
PASS: cert-manager updated during upgrade script 64-.

[1]: 09981f9d90/controllerconfig/controllerconfig/upgrade-scripts/64-upgrade-cert-manager.sh (L168)
[2]: https://review.opendev.org/c/starlingx/config/+/866200/5/sysinv/sysinv/sysinv/sysinv/conductor/kube_app.py#3346
Story: 2009303
Task: 47135
Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Change-Id: Ia57980de2acac7d510e01903c16596b90bee3b4c
2023-01-16 11:29:30 +02:00
Igor Soares 8bcd48f728 Anticipate failure for retries exhausted timeout
Anticipate failure for corner cases in which application apply
operations timeout due to another operation in progress.

Helm resource statuses are parsed in order to match that
specific case and report a failure before the timeout is reached.

Test Plan:
PASS: AIO-SX full build and deployment
PASS: Apply app with no exceptions

Closes-Bug: 2002311
Signed-off-by: Igor Soares <igor.piressoares@windriver.com>
Change-Id: Idd145fe10a9b6b5705f42a2726a42143aa46faed
2023-01-09 12:43:32 -03:00
Igor Soares a3a3319569 Add chart name to FluxCD application apply progress info
Add the name of the last chart applied to FluxCD application apply output
and to the log.

The last chart applied is based on the most recent successful status for
a given release.

Test Plan:
PASS: AIO-SX full deployment
PASS: platform-integ-apps removal and apply
PASS: cert-manager removal and apply

Story: 2009138
Task: 47062
Signed-off-by: Igor Soares <Igor.PiresSoares@windriver.com>
Change-Id: I85fa375e11fda78d95ff34857e51cef59eb0fdb4
2022-12-22 13:56:34 -05:00
Dan Voiculeasa 85e3b47912 AppFwk: Remove recovery logic based on spec.suspend
Because of FluxCD upversion described in [1], we don't need the recovery
logic flipping spec.suspend. Flux is supposed to properly
reconciliate the resources.

Remove recovery logic concerned with flipping spec.suspend was
removed.
Remove optimizations for triggering reconciliation by flipping
spec.suspend.

Disclaimer for tests:
1) This was applied on to of [1].
2) cert-manager, nginx-ingress-controller, platform-integ-apps had the
reconciliation interval decreased to 1m to allow Flux to manage the
resources by itself in a reasonable time interval.
There will be future commits per app updating reconciliation interval.

Tests on AIO-SX:
PASS: bootstrap
PASS: unlocked enabled available
PASS: apps applied
PASS: inspect flux pod logs for errors
PASS: re-test known trigger for 1996747 and 1995748
PASS: re-test known trigger 1997368

[1]: https://review.opendev.org/c/starlingx/ansible-playbooks/+/866820/
Depends-On: https://review.opendev.org/c/starlingx/ansible-playbooks/+/866820/
Related-Bug: 1995748
Related-Bug: 1996747
Related-Bug: 1997368
Partial-Bug: 1999032
Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Change-Id: I932d85d8b366479b2c1d2c88a0acf7fad219b131
2022-12-07 16:36:51 +02:00
Zuul b1715439ec Merge "appfwk: fix app remove stuck after apply-failed" 2022-12-06 16:13:31 +00:00
Leonardo Fagundes Luz Serrano f5efbfe46d appfwk: fix app remove stuck after apply-failed
At the moment, flux doesn't delete helm releases if they
have a running operation (eg. HR in 'pending-install' status).
That causes the app remove operation to get stuck and timeout due
to some resources not terminating, most commonly after a failed
application apply operation.

Until this flux behaviour gets changed, we need to uninstall
these 'stuck' releases in helm directly.

Test Plan:
PASS Cause a HR to get stuck by tainting the node before applying
     the app, then successfully remove the app

Closes-Bug: 1998384

Signed-off-by: Leonardo Fagundes Luz Serrano <Leonardo.FagundesLuzSerrano@windriver.com>
Change-Id: I7466d61f79129b8f70f8a97f8968549f5823d811
2022-12-05 17:57:37 -03:00
Zuul 7879143464 Merge "Add support to preserve app attributes when updating app" 2022-12-05 19:03:28 +00:00
Fabricio Henrique Ramos 1e2073bb1a Add support to preserve app attributes when updating app
Add support to preserve the app attributes from the old version
when updating app to a new version.

The key "maintain_attributes" in application metadata file
indicates if the app attributes will be reused or not during
update, user can specify --reuse-attributes <true/false>
to override the metadata preference specified by the application.

Database column which stores the app attributes is called
system_overrides (table helm_overrides), when attributes is mentioned
in the code, it means the property stored in column system_overrides
in the database. That property is shown to the user as attributes.
The naming confusion will be fixed later.

Test Plan:

PASS: Update app without specify --reuse-attributes
PASS: Update app without specify --reuse-attributes app metadata
      defaults to maintain_attributes=true
PASS: Update app specify --reuse-attributes false
PASS: Update app specify --reuse-attributes true
PASS: Disabled helm chart stays disabled with update

Closes-Bug: https://bugs.launchpad.net/starlingx/+bug/1998499
Signed-off-by: Fabricio Henrique Ramos <fabriciohenrique.ramos@windriver.com>
Change-Id: I0f9c5c7314deb10f89853c9e5c8e15daf99580ed
2022-12-01 12:23:12 -03:00
Zuul 884a31788a Merge "AppFwk: Add FluxCD recovery logic for apply operation [2]" 2022-11-29 21:09:32 +00:00
Dan Voiculeasa b83b0e70fe AppFwk: Add FluxCD recovery logic for apply operation [2]
Add some robustenss to the app framework. It is observed that the
framework can reach a state where a helm charts are not uploaded to
HelmRepository. This leads to app framework waiting for reconciliation
of HelmRepository to be fired. Currently the reconciliation interval
is set to 60 minutes for every app checked.

Issue becomes obvious when udating the app to use newer HelmCharts.
HelmChart observed status is '''chart pull error: failed to get chart
version for remote reference: no chart name found''' which is a
string the recovery logic will attempt to recover from.

Update recovery logic to trigger a HelmRepository reconciliation
before a HelmChart reconciliation.

Skip CentOS testing because we use the same fluxcd and kubernetes.
The only difference is the python kubernetes library, but the
implementation does not use any new API calls.

Tests on AIO-SX Debian:
PASS: AIO-SX unlocked enabled available
PASS: inspect logs to see HelmRepository
      reconciliation is triggered by the recovery logic.

Closes-Bug: 1995748
Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Change-Id: I34ae586a5a267b636164d011b5fa5d44ce8c9a6c
2022-11-28 23:52:14 +02:00
Zuul aede2f1492 Merge "AppFwk: Recover apply from helm operation in progress" 2022-11-24 15:56:56 +00:00
Dan Voiculeasa 00c2129a16 AppFwk: Recover apply from helm operation in progress
It is observed that when a helm release is in pending state, another
helm release can't be started by FluxCD. FluxCD will not try to
do steps to apply the newer helm release, but will just error.

This prevents us from applying a new helm release over a release with
pods stuck in Pending state (just an example).

When the specific message for helm operation in progress is detected,
attempt to recover by moving the older releases to failed state.
Move inspired by [1].
To do so, patch the helm secret for the specific release.
As an optimization, trigger the FluxCD HelmRelease reconciliation right
after.
One future optimization we can do is run an audit to delete the helm
releases for which metadata status is a pending operation, but release
data is failed (resource that we patched in this commit).

Refactor HelmRelease resource reconciliation trigger, smaller size.

There are upstream references related to this bug, see [2] and [3].

Tests on Debian AIO-SX:
PASS: unlocked enabled available
PASS: platform-integ-apps applied
after reproducing error:
PASS: inspect sysinv logs, see recovery is attemped
PASS: inspect fluxcd logs, see that HelmRelease reconciliation is
triggered part of recovery

[1]: https://github.com/porter-dev/porter/pull/1685/files
[2]: https://github.com/helm/helm/issues/8987
[3]: https://github.com/helm/helm/issues/4558
Closes-Bug: 1997368
Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Change-Id: I36116ce8d298cc97194062b75db64541661ce84d
2022-11-24 12:21:48 +02:00