Commit Graph

598 Commits

Author SHA1 Message Date
Zuul 37beadd020 Merge "Introduce multi-version auto downgrade for apps" 2024-04-18 17:53:43 +00:00
amantri cca5becb65 Implement new certificate APIs
Add an API /v1/certificate/get_all_certs to retrieve all the
platform certs(oidc, wra, adminep, etcd,
service account certs, system-restapi-gui-certificate,
open-ldap, openstack, system-registry-local-certificate,
k8s certs) in JSON response and use this response to format
the "system certificate-list" output as "show-certs.sh" output.

Add an API /v1/certificate/get_all_k8s_certs to retrieve all the
tls,opaque certs in JSON response and use this response to
format the "system k8s-certificate-list" output as
"show-certs.sh -k" output

Implement "system certificate-show <cert name>",
"system k8s-certificate-show <cert name>" to show the full
details of the certificate.

Implement filters in api and cli to show the expired and expiry
certificates

Testcases:
PASS: Verify all the cert values(Residual Time,Issue  Date, Expiry Date
      ,Issuer,Subject,filename,Renewal) are showing fine for all the
      following cert paths when "system certificate-list" is executed
	  /etc/kubernetes/pki/apiserver-etcd-client.crt
	  /etc/kubernetes/pki/apiserver-kubelet-client.crt
	  /etc/pki/ca-trust/source/anchors/dc-adminep-root-ca.crt
	  /etc/ssl/private/admin-ep-cert.pem
	  /etc/etcd/etcd-client.crt
	  /etc/etcd/etcd-server.crt
	  /etc/kubernetes/pki/front-proxy-ca.crt
	  /etc/kubernetes/pki/front-proxy-client.crt
	  /var/lib/kubelet/pki/kubelet-client-current.pem
	  /etc/kubernetes/pki/ca.crt
	  /etc/ldap/certs/openldap-cert.crt
	  /etc/ssl/private/registry-cert.crt
	  /etc/ssl/private/server-cert.pem
PASS: Verify all the cert values(Residual Time,Issue Date, Expiry Date
      ,Issuer,Subject,filename,Renewal) are showing fine for all the
       service accts when "system certificate-list" is executed
          /etc/kubernetes/scheduler.conf
          /etc/kubernetes/admin.conf
	  /etc/kubernetes/controller-manager.conf
PASS: Verify the system-local-ca secret is shown in the output of
      "system certificate-list"
PASS: List ns,secret name in the output of ssl,docker certs if the
      system-restapi-gui-certificate, system-registry-local-certificate
      exist on the system when "system certificate-list" executed
PASS: Apply oidc app verify that in "system certificate-list" output
      "oidc-auth-apps-certificate", oidc ca issuer and wad cert are
      shown with all proper values
PASS: Deploy WRA app verify that "mon-elastic-services-ca-crt",
      "mon-elastic-services-extca-crt" secrets are showing in the
      "system certificate-list" output and also kibana,
      elastic-services cert from mon-elastic-services-secrets secret
PASS: Verify all the cert values(Residual Time,Issue Date, Expiry Date
      ,Issuer,Subject,filename,Renewal) are showing fine for all the
      Opaque,tls type secrets when "system k8s-certificate-list" is
      executed
PASS: Execute "system certificate-show <cert name>" for each
      cert in the "system ceritificate-list" output and
      check all details of it
PASS: Execute "system certificate-list --expired" shows the
      certificates which are expired
PASS: Execute "system certificate-list --soon_to_expiry <N>"
      shows the expiring certificates with in the specified
      N days
PASS: Execute "system k8s-certificate-list --expired" shows the
      certificates which are expired
PASS: Execute "system k8s-certificate-list --soon_to_expiry <N>"
      shows the expiring certificates with in the specified
      N days
PASS: On DC system verify that admin endpoint certificates are
      shown with all values when "system certificate-list" is
      executed
PASS: Verify the following apis
	/v1/certificate/get_all_certs
        /v1/certificate/get_all_k8s_certs
        /v1/certificate/get_all_certs?soon_to_expiry=<no of days>
        /v1/certificate/get_all_k8s_certs?soon_to_expiry=<no of days>
        /v1/certificate/get_all_certs?expired=True
        /v1/certificate/get_all_k8s_certs?expired=True

Story: 2010848
Task: 48730
Task: 48785
Task: 48786

Change-Id: Ia281fe1610348596ccc1e3fad7816fe577c836d1
Signed-off-by: amantri <ayyappa.mantri@windriver.com>
2024-04-17 14:18:21 -04:00
Zuul a4ab746619 Merge "Update network interface puppet resource gen to support dual-stack" 2024-04-17 15:29:33 +00:00
Zuul 5f4e3a3378 Merge "Adding QAT devices support in sysinv" 2024-04-17 13:49:58 +00:00
Lucas Ratusznei Fonseca ff3a5d2341 Update network interface puppet resource gen to support dual-stack
This change updates the puppet resource generation logic for network
interfaces to suport dual-stack.

Change summary
==============

- Aliases / labels
    Previously, each alias was associated to a specific network. Now,
    since more than one address can be associated to the same network,
    the aliases are also associated to addresses. The label name is
    now :<network_id>-<address_id>. The network_id is 0 if there's no
    network associated with the alias, that's the case for the base
    interface config or for the cases where the address is not
    associated to a network. The address_id is 0 if there's no address
    associated with the alias, which is the case for the base config
    and for when there's no static address associated to the network,
    i.e. the method is DHCP.

- Static addresses
    Previously, interfaces with more than one static addresses not
    associated with pools would be assigned just the first one. Now,
    an alias config is generated for each address.

- CentOS compatibility
    All the code related to CentOS was removed.

- Duplex-direct mode
    Duplex-direct systems must have DAD disabled for management and
    cluster-host interfaces. The disable DAD command is now generated
    only in the base interface config for all types of interfaces.

- Address pool names
    The change assumes a new standard for address pool names, they will
    be formed by the old names with the suffixes '-ipv4' or '-ipv6'.
    For example: management-ipv4, management-ipv6. Since other systems
    that rely on the previous standard are not yet upgraded to
    dual-stack, the constant DUAL_STACK_COMPATIBILITY_MODE was
    introduced to control resource generation and validation logic in a
    way that assures compatibility. The constant and the conditionals
    will be removed once the other modules are updated. The
    conditionals were implemented more as a way to highlight which
    parts of the code are affected and make the changes easier in the
    future.

- Tests / DB Base
    The base class for tests was updated to generate more consistent
    database states. Mixins for dual-stack cases were also created.

- Tests / Interface
    Most of the test functions in the class InterfaceTestCase caused
    unnecessary updates to the database and the context. The class
    was splitted in two, the first one containing the tests that only
    need the basic database setup (controller, one interface
    associated with the mgmt network), and the other one for the tests
    that need different setups.
    A new fixture was created to test multiple system configs (IPv4,
    IPv6, dual-stack), which inspects in detail the generated
    hieradata. The tests associated with the InterfaceHostV6TestCase
    were moved to the new fixture, and new ones were introduced.

Test plan
=========

Online setup tests
------------------

System: STANDARD (2 Controllers, 2 Storages, 1 Worker)

Stack setups:
  - Single stack IPv4
  - Single stack IPv6
  - Dual stack, primary IPv4
  - Dual stack, primary IPv6

[PASS] TC1 - Online setup, regular ethernet
    mgmt0 (Ethernet) -> PXEBOOT, MGMT, CLUSTER_HOST

[PASS] TC2 - Online setup, VLAN over ethernet
    pxe0 (Ethernet) -> PXEBOOT
    mgmt0 (VLAN over pxe0) -> MGMT, CLUSTER_HOST

[PASS] TC3 - Online setup, bondig
    mgmt0 (Bond) -> PXEBOOT, MGMT, CLUSTER_HOST

[PASS] TC4 - Online setup, VLAN over bonding
    pxe0 (Bond) -> PXEBOOT
    mgmt0 (VLAN over pxe0) -> MGMT, CLUSTER_HOST

Installation tests
------------------

Systems:
  - AIO-SX
  - AIO-DX
  - Standard (2 Controllers, 2 Storages, 1 Worker)

[PASS] TC5 - Regular installation on VirtualBox, IPv4

[PASS] TC6 - Regular installation on VirtualBox, IPv6

Data interface tests
--------------------

System: AIO-DX

Setup:
    data0 -> Ethernet, ipv4_mode=static, ipv6_mode=static
    data1 -> VLAN on top of data0, ipv4_mode=static, ipv6_mode=static

For both interfaces, the following was performed:

[PASS] TC7 - Add static IPv4 address
[PASS] TC8 - Add static IPv6 address
[PASS] TC9 - Add IPv4 route
[PASS] TC10 - Add IPv6 route
[PASS] TC11 - Remove IPv4 route
[PASS] TC12 - Remove IPv6 route
[PASS] TC13 - Remove static IPv4 address
[PASS] TC14 - Remove static IPv6 address

Story: 2011027
Task: 49815
Change-Id: Ib9603cbd444b21aefbcd417780a12c079f3d0b0f
Signed-off-by: Lucas Ratusznei Fonseca <lucas.ratuszneifonseca@windriver.com>
2024-04-16 16:23:15 -03:00
Igor Soares 1d228bab28 Introduce multi-version auto downgrade for apps
Introduce automatic downgrade of StarlingX applications to the
multiple application version feature.

Auto downgrades are triggered by default in scenarios which the applied
application bundle is not available anymore under the applications
folder but an older version of the same app is. For instance, when
platform patches are removed and a previously available ostree is
deployed, thus restoring the old set of available apps under the
/usr/local/share/applications/helm/ directory.

A new section called 'downgrades' can be added to the metadata.yaml file
to disable the default behavior. For example:

downgrades:
  auto_downgrade: false

When auto downgrades are disabled the current applied version remains
unchanged.

Test plan:
PASS: build-pkgs -a && build-image
PASS: AIO-SX fresh install.
PASS: Apply platform-integ-apps.
      Update platform-integ-apps using a tarball that is not available
      under /usr/local/share/applications/helm/ and that does not
      contain the downgrade section.
      Confirm that platform-integ-apps is downgraded.
PASS: Apply platform-integ-apps.
      Update platform-integ-apps using a tarball that is not available
      under /usr/local/share/applications/helm/ and that has the
      auto_downgrade metadata option set to 'true'.
      Confirm that platform-integ-apps is downgraded.
PASS: Apply platform-integ-apps.
      Update platform-integ-apps using a tarball that is not available
      under /usr/local/share/applications/helm/ and that has the
      auto_downgrade metadata option set to 'false'.
      Confirm that the originally applied platform-integ-apps version
      remains unchanged.
PASS: Run a kubernetes upgrade with apps to be pre and post updated.
      Confirm that apps are successfully updated and not downgraded
      after the Kubernetes upgrade has finished.

Story: 2010929
Task: 49847

Change-Id: I33f0e0a5b8db128aef76fb93ba322364881097cf
Signed-off-by: Igor Soares <Igor.PiresSoares@windriver.com>
2024-04-15 12:56:05 -03:00
Md Irshad Sheikh 463165eca8 Adding QAT devices support in sysinv
The commit adds code to auto discover QAT devices with ids 4940 & 4942
and list them as part of system host-device-list command.

Also host-device-modify command has been modified to not allow
any QAT device configuration due to upstream qat_service code
limitations. Now QAT devices are already inited with max VF
number and other default configurations during bootstrap, so
no further modification is required.

TEST CASES:

PASSED: The development iso should be successfully deployed.
        And QAT devices should get listed using
        host-device-list command.

PASSED: system host-device-modify command should raise error
        when tried to edit any QAT configuration.

PASSED: system host-device-show command should show all default
        QAT device configurations.

Story: 2010604
Task: 49701

Change-Id: Id6b00b9e69b233d513e42375d5f8196ddd745e20
Signed-off-by: Md Irshad Sheikh <mdirshad.sheikh@windriver.com>
2024-04-03 07:51:28 -04:00
Tara Subedi 933d3a3a73 Report port and device inventory after the worker manifest
This is incremental fix of bug:2053149.
Upon network boot (first boot) of worker node, agent manager is
supposed to report ports/devices, without waiting for worker manifest,
as that would never run on first boot. Without this, after system
restore, it will be unable to unlock compute node due to sriov config
update.

kickstart records first boot as "/etc/platform/.first_boot". Agent
manager deletes this file. In case agent manager get crashed, it will
start again. This time, agent manager don't see .first_boot file, and
don't know this is still first boot and it won't report inventory for
the worker node.

This commit fixes this issue by creating volatile file
"/var/run/.first_boot" before deleting "/etc/platform/.first_boot", and
agent relies on both files to figure out it is first boot or not. This
present same logic for multiple crash/restart of agent manager.

TEST PLAN:
PASS: AIO-DX bootstrap has no issues. lock/unlock has no issues.
PASS: Network-boot worker node, before doing unlock, restart agent
      manager (sysinv-agent), check sysinv.log to see ports are reported.

Closes-Bug: 2053149
Change-Id: Iace5576575388a6ed3403590dbeec545c25fc0e0
Signed-off-by: Tara Nath Subedi <tara.subedi@windriver.com>
2024-03-26 10:37:56 -04:00
Saba Touheed Mujawar 4c42927040 Add retry robustness for Kubernetes upgrade control plane
In the case of a rare intermittent failure behaviour during the
upgrading control plane step where puppet hits timeout first before
the upgrade is completed or kubeadm hits its own Upgrade Manifest
timeout (at 5m).

This change will retry running the process by
reporting failure to conductor when puppet manifest apply fails.
Since it is using RPC to send messages with options, we don't get
the return code directly and hence, cannot use a retry decorator.
So we use the sysinv report callback feature to handle the
success/failure path.

TEST PLAN:
PASS: Perform simplex and duplex k8s upgrade successfully.
PASS: Install iso successfully.
PASS: Manually send STOP signal to pause the process so that
      puppet manifest timeout and check whether retry code works
      and in retry attempts the upgrade completes.
PASS: Manually decrease the puppet timeout to very low number
      and verify that code retries 2 times and updates failure
      state
PASS: Perform orchestrated k8s upgrade, Manually send STOP
      signal to pause the kubeadm process during step
      upgrading-first-master and perform system kube-upgrade-abort.
      Verify that upgrade-aborted successfully and also verify
      that code does not try the retry mechanism for
      k8s upgrade control-plane as it is not in desired
      KUBE_UPGRADING_FIRST_MASTER or KUBE_UPGRADING_SECOND_MASTER
      state
PASS: Perform manual k8s upgrade, for k8s upgrade control-plane
      failure perform manual upgrade-abort successfully.
      Perform Orchestrated k8s upgrade, for k8s upgrade control-plane
      failure after retries nfv aborts automatically.

Closes-Bug: 2056326

Depends-on: https://review.opendev.org/c/starlingx/nfv/+/912806
            https://review.opendev.org/c/starlingx/stx-puppet/+/911945
            https://review.opendev.org/c/starlingx/integ/+/913422

Change-Id: I5dc3b87530be89d623b40da650b7ff04c69f1cc5
Signed-off-by: Saba Touheed Mujawar <sabatouheed.mujawar@windriver.com>
2024-03-19 08:49:36 -04:00
Zuul a396dff37c Merge "Prevent configuring the Dell Minerva NIC VFs" 2024-03-11 17:28:01 +00:00
Zuul 6c3df45f05 Merge "Report port and device inventory after the worker manifest" 2024-03-11 16:19:09 +00:00
Zuul b9ab073997 Merge "Upgrade changes to support MGMT FQDN" 2024-03-08 19:56:49 +00:00
Zuul c24d0950bc Merge "Fix delete process to apps that have charts disabled" 2024-03-07 13:43:22 +00:00
Zuul 40168bb769 Merge "Add mgmt_ipsec flag handling" 2024-03-06 18:55:18 +00:00
Leonardo Mendes 31ee720a54 Add mgmt_ipsec flag handling
This commit adds mgmt_ipsec flag handling for IPSec Auth Server
for successful and failed negotiation following the requirements
below:

- If the negotiation succeeds, the flag needs to be set to "enabled",
  which can then be checked during certificate renewal operation.
- If the negotiation fails, the flag needs to be removed from host
  in negotiation, so that the host can retry the negotiation.

Test Plan:
PASS: Full build, system install, bootstrap and unlock DX system w/
      unlocked enabled available status.
PASS: Execute "sudo ipsec-client pxecontroller" command. Open another
      terminal and execute the command "echo "Li69nux*" | sudo -S -u
      postgres psql -d sysinv -c "select capabilities from i_host;""
      to see mgmt_ipsec flag being updated first to "enabling" and
      then updated to "enabled" at the end of operation.
PASS: To simulate a flag removal, first execute the command
      "kubectl delete clusterissuer system-local-ca" and then repeat
      the fisrt test. Observe that the flag will be updated to
      "enabling" and will be removed when an error occurred during
      the process.

Story: 2010940
Task: 49659

Change-Id: I0746cc890b4bf6d3c9722d096b62247652e164d4
Signed-off-by: Leonardo Mendes <Leonardo.MendesSantana@windriver.com>
2024-03-06 11:45:13 -03:00
David Bastos c9b71ebd65 Fix delete process to apps that have charts disabled
When deleting an application that has one chart or more disabled,
the app framework was not able to correctly delete the disabled
charts from the helm repository.

If, after deleting an app, an attempt was made to upload that same
app, a failure would occur, informing that the charts were already
in the helm repository.

The correction consists of using the  kustomization-orig.yaml file
instead of kustomization.yaml in the deletion process to list the
enabled and disabled charts.

Another fix was made in case an application has the status of
"upload failed" and an attempt is made to delete another app. This
caused a Python runtime error because the get_chart_tarball_path
function tried to access the dictionary key and it wasn't there.

The solution was to check if the key for that chart exists and only
then try to access it. New logs are added to alert the user if the
chart does not exist.

Test Plan:
PASS: Build-pkgs
PASS: Upload, apply, remove and delete dell-storage
PASS: Upload, apply, remove and delete oidc-auth-apps
PASS: upload, apply, remove and delete metrics-server
PASS: Deletes app that has charts disabled and all charts are
      deleted from the helm repository correctly.
PASS: After deleting and trying to upload the same app, no error
      occurs and the upload and apply process is completed
      successfully.
PASS: Deleting an app with another app with "upload failed"
      status and no Python runtime error occurs

Closes-Bug: 2055697

Change-Id: I22de414e8780fe3691d06bdd015e4c927dcc10f0
Signed-off-by: David Bastos <david.barbosabastos@windriver.com>
2024-03-05 17:20:31 -03:00
Fabiano Correa Mercer d449622f4a Upgrade changes to support MGMT FQDN
The release stx.9 with FQDN support for MGMT network
uses the hieradata with the new pattern:
<hostname>.yaml
But the release stx.8 is still using the old name:
<mgmt_ip>.yaml
During an upgrade controller-0 want to update
the <mgmt_ip>.yaml and controller-1 wants to use
the <hostname>.yaml, so it is necessary to change
the code to use/update the right hieradata.
Additionally, during an upgrade the active
controller running the old release can't resolve
the FQDN (i.e: controller.internal ), for this
reason during the controller-1 upgrade, the FQDN
can not be used.

Test Plan:
IPv6 AIO-SX fresh install
IPv6 AIO-DX fresh install
IPv4 AIO-SX upgrade from previous release
    without story 2010722 to new release
    that has the story 2010722 (not master)
IPv4 AIO-DX upgrade from previous release
    without story 2010722 to new release
    that has the story 2010722 (not master)
IPv4 STANDARD upgrade from previous release
    without story 2010722 to new release
    that has the story 2010722 (not master)
IPv6 AIO-DX upgrade from previous release
    without story 2010722 to new release
    that has the story 2010722 (not master)
IPv6 DC lab upgrade from previous release
    without story 2010722 to new release
    that has the story 2010722 (not master)

Story: 2010722
Task: 48609

Signed-off-by: Fabiano Correa Mercer <fabiano.correamercer@windriver.com>
Change-Id: I555185bea7fadb772a4023b6ecb4379e01e0f16c
2024-03-05 12:42:21 -03:00
Tara Subedi 9c3bf050cd Report port and device inventory after the worker manifest
The SR-IOV configuration of a device is not retained across reboots,
until puppet manifests bind/enable completes. The sysinv-agent should
not report device inventory at any time after it is started, it should
wait until puppet worker manifest completes. Though during bootstrap
(fresh install), restore, network-boot and subsequent reboots in case
of non-worker roles (controller, storage) sysinv-agent can report at
any time it is started.

Upon reboot, SR-IOV configuration (of ACC100) (sriov_numvfs=0) is
updated to intended configuration by puppet worker manifest. In this
case, there is a small chance that the sysinv-agent audit (every 60
seconds) will run before the driver configuration. Since the agent will
only actually report the port and device inventory once, the SR-IOV
configuration data is not accurately reflected in the db, thus
requiring additional lock/unlock(s) to force correction.

After fresh-install/restore/network-boot and reboot, there was no
/etc/platform/.initial_worker_config_complete and
/var/run/.worker_config_complete files until puppet worker manifest
completes. sysinv-agent audit happened to read device inventory before
the driver configuration (i.e. before worker manifest completed), thus
not accurately reflected in the db.

This commit fixes such that port and device configuration are only
reported after the worker manifest has completed, in case the host is
being configured as worker subfunction.

TEST PLAN:
   PASS: Fresh install node (that has ACC100 device) AIO, check
         host-device-list/show (before config/unlock) to see
         ACC100 device config:: driver:None, vf-driver:None, N:0.

   PASS: After above, update config (ACC100 device config::
         driver:igb_uio, vf-driver:igb_uio, N:1) and also use
         host-label-assign as sriovdp=enabled and unlock, for
         subsequent reboots validate device config as
         (driver:igb_uio, vf-driver:igb_uio, N:1) and validate
         content of /etc/pcidp/config.json.

   PASS: Restore node from backup (ACC100 device config::
         driver:igb_uio, vf-driver:igb_uio, N:1 and also
         host-label-assing as sriovdp=enabled), once node
         come back up, check host-device-list/show for after-boot
         update time and num_vfs = 1. Also validate content of
         /etc/pcidp/config.json.

    PASS: In AIO-DX setup, ports and devices can be listed and
         and second worker node can be unlocked, after the
         network-boot.

Closes-Bug: 2053149
Change-Id: I69d483041bd75ea0abbd68cedccfbc5f10062c75
Signed-off-by: Tara Nath Subedi <tara.subedi@windriver.com>
2024-03-01 09:21:21 -05:00
Caio Bruchert 46fc50f419 Prevent configuring the Dell Minerva NIC VFs
Since the Dell Operator will be responsible for VF configuration, the
configuration of this NIC using sysinv is being blocked to prevent user
mistakes. This is valid for the Dell Minerva NICs using Marvell CNF105xx
family devices.

The CNF105xx device IDs were found in the octeon_ep driver source code:
https://github.com/MarvellEmbeddedProcessors/pcie_ep_octeon_host
/drivers/octeon_ep/octep_main.h

Test Plan:
PASS: host-if-modify class from none to pci-passthrough: allowed
PASS: host-if-modify class from pci-passthrough to none: allowed
PASS: host-if-modify class from none to pci-sriov w/ VFs: blocked
PASS: host-if-modify class from none to pci-sriov w/ VFs for other
      devices: allowed

Story: 2010047
Task: 49650

Signed-off-by: Caio Bruchert <caio.bruchert@windriver.com>
Change-Id: Ib6a20952060331ac230b01813be28116fbceef36
2024-02-29 13:46:35 -03:00
Zuul 16589b9828 Merge "Remove support for ignoring k8s isolated CPUs in sysinv" 2024-02-27 20:47:11 +00:00
Zuul e6610a898a Merge "Kubernetes periodic audit for cluster health" 2024-02-27 06:33:01 +00:00
Zuul 5815c70a88 Merge "Use mgmt_ipsec in sysinv for ipsec request check" 2024-02-23 17:02:30 +00:00
rakshith mr 1dc7a93f82 Kubernetes periodic audit for cluster health
A periodic audit for K8S cluster that checks the health of the
endpoints: APISERVER, SCHEDULER, CONTROLLER and KUBELET.

The audit will set/clear K8S cluster health alarm every
3 minutes.

Test Plan:
PASS: Trigger K8S cluster down alarm by manually modifying
      /etc/kubernetes/manifests/kube-apiserver.yaml configuration
      file to break the K8s cluster.
      Verify that alarm is raised within 3 minutes.
PASS: Restore the manually modified configuration. This will
      restore K8S service.
      Expect to see the alarm cleared within 3 minutes.
PASS: Fresh install on AIO-SX, and checked alarm audit log.
PASS: With K8S cluster down (for several minutes), initiate platform
      upgrade.
      Verify that k8s health check blocks the upgrade due to 850.002
      alarm.

Story: 2011037
Task: 49534

Depends-On: https://review.opendev.org/c/starlingx/fault/+/907054
Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/907345

Change-Id: I958dbb46f151df602030bd2d7576b3b3705b8ca2
Signed-off-by: rakshith mr <rakshith.mr@windriver.com>
2024-02-21 01:56:21 -05:00
Andy Ning a79dae2db6 Use mgmt_ipsec in sysinv for ipsec request check
Currently ipsec server use inv_state in sysinv to check the validation
of the auth request from client. This is found to be problematic. This
change updated ipsec server to use "mgmt_ipsec:enabling|enabled" flag
in capabilities of the i_host table for the validation checking.

Also a minor refactoring to move the flag setting into a function in
the utils module, since the flag setting will be eventually used in
multiple places.

Test Plan:
PASS: DX deployment in VBox. Verify controller-0 and controller-1
      are installed, bootstrap and unlocked successfully with IPSec
      configured and enabled.

Story: 2010940
Task: 49558

Change-Id: I397ea29d73ad8a3b8b8ce5500a4501c7bc2fbfbc
Signed-off-by: Andy Ning <andy.ning@windriver.com>
2024-02-20 15:37:23 -05:00
Zuul 767b30be38 Merge "Add coredump default service parameters" 2024-02-14 21:33:32 +00:00
Heron Vieira 50a658cedd Add coredump default service parameters
Adding coredump process_size_max, external_size_max and
keep_free default service parameters so coredump service is configured
with default values from the start, keeping it explicit for the user
what is configured on a fresh install.

Test plan
PASS: AIO-SX install, bootstrap and initial unlock
PASS: Verify if coredump service parameters are added after initial
      unlock.
PASS: Verify if coredump config file is changed after initial unlock

Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/897856

Closes-bug: 2039064

Change-Id: I13b1c1e0d6c34b34cf6ed3f1cb86c8511ac24b44
Signed-off-by: Heron Vieira <heron.vieira@windriver.com>
2024-02-14 18:41:57 +00:00
Kaustubh Dhokte e276cac428 Remove support for ignoring k8s isolated CPUs in sysinv
As we no longer have any users for this feature, we remove
support for ignoring isolated CPUs. This change removes code
that supports this feature in sysinv.

Test Plan:
AIO-SX:
PASS: Manually create /etc/kubernetes/ignore_isolcpus,
      assign host label 'kube-ignore-isol-cpus=enabled',
      yet a test pod is allocated to the application-isolated
      CPUs.

Story: 2010878
Task: 49571

Change-Id: I21d3319bd967a7a0524e922295fbcc75770a02e6
Signed-off-by: Kaustubh Dhokte <kaustubh.dhokte@windriver.com>
2024-02-14 17:47:15 +00:00
Zuul 15dc296f4a Merge "Optimizing image downloads" 2024-01-16 17:11:37 +00:00
Zuul 9e0c55868d Merge "Introduce support for multiple application bundles" 2024-01-16 14:10:31 +00:00
Thiago Miranda caf9de1603 Optimizing image downloads
In this commit, we obtain a list of images already present in
containerd to avoid unnecessary checks and pulls, reducing CPU
consumption.

TEST PLAN:
PASS: Lock/Unlock controllers
PASS: Successfully swact between controllers
PASS: Successfully recover after power down and up both controllers
PASS: Successfully bootstrap (Simplex and Duplex)
PASS: Successfully recover after active controller goes down
PASS: Successfully application lifecycle

Story: 2010985
Task: 49228

Change-Id: I58dd11c8d590b60ab100f79a03e17c5921e3721b
Signed-off-by: Thiago Miranda <tmarques@windriver.com>
Co-authored-by: Eduardo Juliano Alberti <eduardo.alberti@windriver.com>
2024-01-16 12:43:29 +00:00
Igor Soares ea00765271 Introduce support for multiple application bundles
Parse the metadata of all application bundles under the helm application
folder and save to the kube_app_bundle table. This is done during sysinv
startup and when a new ostree commit is deployed.

The auto update logic was changed to enable retrieving metadata from
the database for all available bundles of a given app and compute which
bundle should be used to carry out the upgrade.

The bundle choice is done based on the minimum and maximum Kubernetes
versions supported by the application. If multiple bundles fit that
criteria then the application with the highest version number is chosen.

The 65-k8s-app-upgrade.sh script also takes into account multiple
bundles during the platform upgrade activation step, prioritizing
lowest versions available to ensure compatibility with the Kubernetes
version carried over from the N release. A follow-up change will improve
this mechanism to discover specific app versions.

When platform patches are applied and the ostree is changed then the
content of the helm application folder is reevaluated and the database
updated accordingly if there are new or removed bundles.

Test plan:
PASS: build-pkgs -a & build-image
PASS: Fresh AIO-SX install.
PASS: Fresh AIO-DX install.
PASS: Manually place multiple tarballs of one application with
      different versions under /usr/local/share/applications/helm/
      and check if the app is updated correctly.
PASS: Build a reboot required patch that removes the istio
      bundle and adds a new metrics-server version.
      Apply the reboot required patch.
      Check if istio was removed from the kube_app_bundle table.
      Check if the metrics-server previous version was removed from the
      kube_app_bundle table.
      Check if the metrics-server new version was added to the
      kube_app_bundle table.
      Check if metrics-server was updated to the new version added
      to the database.
PASS: Build a no reboot required patch that does not restart
      sysinv, removes the istio bundle and adds a new metrics-server
      version.
      Apply the no reboot required patch.
      Check if istio was removed from the kube_app_bundle table.
      Check if the metrics-server previous version was removed from the
      kube_app_bundle table.
      Check if the metrics-server new version was added to the
      kube_app_bundle table.
      Check if metrics-server was updated to the new version added
      to the database.
PASS: Build a no reboot required patch that restarts sysinv,
      removes the istio bundle and adds a new metrics-server version.
      Apply the no reboot required patch.
      Check if istio was removed from the kube_app_bundle table.
      Check if the metrics-server previous version was removed from the
      kube_app_bundle table.
      Check if the metrics-server new version was added to the
      kube_app_bundle table and was updated.
      Check if metrics-server was updated to the new version added
      to the database.
PASS: Install power-metrics on stx-8.
      Run platform upgrade from stx-8 placing two different versions of
      metrics-server under /usr/local/share/applications/helm/.
      Check if default apps and metrics-server were properly updated
      during upgrade-activate step.
      Check if power-metrics was auto updated after upgrade-complete
      step.

Story: 2010929
Task: 49097

Change-Id: I46f7cb6ebc59ad49157e9044a4937a406313671e
Signed-off-by: Igor Soares <Igor.PiresSoares@windriver.com>
2024-01-15 17:49:29 -03:00
Zuul 68859e37db Merge "Create kube_app_bundle table" 2024-01-15 15:19:47 +00:00
Zuul d65514be34 Merge "Steps for kube-upgrade-storage" 2024-01-09 22:25:06 +00:00
Igor Soares ab469de093 Create kube_app_bundle table
This commit creates a new table called kube_app_bundle. This table will
be used to store metadata extracted from StarlingX application bundles.

Database API methods were created to allow bulk inserts to the table,
checking whether it is empty, retrieving entries by application name and
pruning all data.

A follow-up commit will enable the Application Framework to populate and
retrieve data from the table.

Test plan:
PASS: build-pkgs -a && build-image
PASS: AIO-SX fresh install
      Check if the kube_app_bundle table was created as expected
PASS: AIO-DX fresh install
      Check if the kube_app_bundle table was created as expected
PASS: upgrade from stx-8
      Check if the kube_app_bundle table was created as expected

Story: 2010929
Task: 49097

Change-Id: Ifd10f9e5e4a2d26c42d2b83084e073c7834cd75a
Signed-off-by: Igor Soares <igor.piressoares@windriver.com>
2024-01-08 16:53:40 -03:00
Zuul 4189d9a116 Merge "Avoid self-signed cert creation for HTTPS" 2023-12-18 14:34:54 +00:00
Zuul 2032f761ce Merge "Handling Luks filesystem" 2023-12-12 21:32:37 +00:00
Rahul Roshan Kachchap dea0a20af2 Handling Luks filesystem
Updating is_system_usable_block_device() method to
make sure that LUKS filesystems are ignored by sysinv-agent when
detecting partion-able block devices.

Depends on: https://review.opendev.org/c/starlingx/config-files/+/903438

Test Plan:
PASS: build-pkgs -c -p sysinv-agent
PASS: build-image
PASS: AIO-DX bootstrap
PASS: No LOG.errors in the sysinv logs stemming
      from the sysinv-agent
PASS: No LUKS filesystem device reported in
      system host-disk-list

Story: 2010872
Task: 49234

Change-Id: I9c8afbb203fbc914021ed25593ab9124df00d599
Signed-off-by: Rahul Roshan Kachchap <rahulroshan.kachchap@windriver.com>
2023-12-12 20:22:24 +00:00
Fabiano Correa Mercer 661ab6480a Updates after the mgmt network reconfiguration
Updates the no_proxy list in the
service-parameter-list during the management
network reconfiguration.

In the first reboot after the management network
reconfiguration, The system will use the new
management IPs and some files will be updated,
like the /etc/hosts.
It is necessary to update the following paths,
with the new values:

/opt/platform/sysinv
/opt/platform/config

Additionally, during the first reboot the
system is still using the old mgmt IPs until the
apply_network_config.sh and puppet code updates
the system.
The sw-patch services starts before or at same
time of these operations and can use the old
MGMT IPs and failed to answer audit requests.
For this reason it is necessary to restart
these services.

Tests:
IPv6 mgmt network reconfig in subcloud AIO-SX
IPv4 mgmt network reconfig in standalone AIO-SX
AIO-DX Fresh install
AIO-SX Fresh install
AIO-SX IPv4 apply patch after mgmt reconfig

Story: 2010722
Task: 49203

Change-Id: I8a17f50c229a53965e13c889f0ea6ff8efd687c3
Signed-off-by: Fabiano Correa Mercer <fabiano.correamercer@windriver.com>
2023-12-07 10:58:18 -03:00
Marcelo Loebens f23b3f1a89 Avoid self-signed cert creation for HTTPS
REST API & Web Server TLS certificate (system-restapi-gui-certificate)
is now being installed at bootstrap in fs to be used for HTTPS.

This guarantees that the server-cert.pem is already present upon the
first unlock of the system, removing the need to create a self-signed
cert.

The self-signed cert will only be created if the
system-restapi-gui-certificate does not exist (test scenarios), to
avoid hard failures when switching to HTTPS.

Test plan:
PASS: Deploy an AIO-SX. Verify:
      - system-restapi-gui-certificate TLS cert is correctly installed
        in /etc/ssl/private/server-cert.pem before unlocking the
        controller.
      - HTTPS is enabled and openstack public endpoints change into it
        after unlocking the controller.
      - The target certificates are issued by 'system-local-ca', and
        are managed by cert-manager.
      - The certificates in /etc/ssl/private are correct.
      - It's possible to log into the local Docker Registry.
      - Horizon is working as expected.

PASS: Deploy an AIO-DX. After unlocking controller-1, SSH to it and
      verify that the Rest API / GUI certificate created during
      bootstrap is installed as the file
      '/etc/ssl/private/server-cert.pem'.

Story: 2009811
Task: 48976

Depends-on: https://review.opendev.org/c/starlingx/ansible-playbooks/+/902088

Change-Id: If9aa644898b179fbae2b5248c84c764199bb9b7c
Signed-off-by: Marcelo Loebens <Marcelo.DeCastroLoebens@windriver.com>
2023-12-04 16:29:02 -04:00
Jagatguru Prasad Mishra 0fb91eb62a Block host-unlock till apparmor manifest completes
If the following commands are issued in quick succession,
1. system host-update controller-0 apparmor=enabled
2. system host-unlock controller-0

The puppet runtime manifest, which is executed asynchronously,
will not have enough time to run and apparmor module won't get
loaded after unlock.

This feature will add reporting of apparmor runtime
manifest status. The 'in progress' status will be persisted
in the i_host table and used to validate host-unlock

Closes-Bug: 2042926

Test plan:
PASS: AIO-DX: Issue host-unlock command soon after
      'system host-update <host> apparmor=enabled' command.
      Verify that host-unlock fails with message 'Can not unlock
      <hostname> apparmor configuration in progress.'
PASS: AIO-DX: Enable/disable the apparmor module on a host  using
      host-update command and verify if it is enabled/disabled
      respectively after reboot
PASS: AIO-SX: Enable/disable the apparmor module on a host  using
      host-update command and verify if it is enabled/disabled
      respectively after reboot

Change-Id: I8f13ad4316e4edd4a6c73648ee4b06eb379ebe76
Signed-off-by: Jagatguru Prasad Mishra <jagatguruprasad.mishra@windriver.com>
2023-11-16 02:49:42 -05:00
Zuul 171b6b99ff Merge "Use FQDN for MGMT network" 2023-11-02 19:37:55 +00:00
Zuul 4657131748 Merge "Fix the condition to delete a stuck partition in the database" 2023-11-01 21:11:30 +00:00
Zuul 23578c3f71 Merge "Additional mechanism for unsafe force" 2023-11-01 18:30:19 +00:00
Zuul b4872623a4 Merge "Introduce Kubernetes upgrade metadata for stx apps" 2023-11-01 16:52:42 +00:00
Zuul 1d6ef90409 Merge "Create runtime_config table" 2023-11-01 16:39:03 +00:00
Gabriel de Araújo Cabral 2a39372b51 Fix the condition to delete a stuck partition in the database
In the changes of [1] review, there is a condition to delete a
partition in the database that doesn't exist in the agent report
and at the same moment, there is no puppet from the
'platform::partitions::runtime' class running.

In case of a partition with the status "Creating on unlock", it
satisfies both conditions I mentioned, because the agent won't
report it, and Puppet to create a partition won't be running.
This commit changes the behavior to not delete the partition with
this status, because it will still be created during unlock.

Additionally, a failure was also identified in the check condition
when puppet is running, which was causing the partition to be
deleted incorrectly. To fix this, an in-file flag was implemented
to identify puppet manifest execution.

[1] https://review.opendev.org/c/starlingx/config/+/889090

Test-Plan:
  PASS: AIO-SX fresh install
  PASS: AIO-DX fresh install
  PASS: create/modify/delete a partition in the
        controller-0|1 followed by a reboot and check the status
        with 'system host-disk-partition-list'.
  PASS: Restart of sysinv-conductor and/or sysinv-agent services
        during puppet manifest applying.
  PASS: AIO-SX upgrade stx 7.0 to stx 8.0
  PASS: AIO-SX Backup and Restore

Closes-Bug: 2028254

Change-Id: I2024ab841ca3edbcc140de9b4ea0fbea12044791
Signed-off-by: Gabriel de Araújo Cabral <gabriel.cabral@windriver.com>
Signed-off-by: Erickson Silva <Erickson.SilvadeOliveira@windriver.com>
2023-11-01 12:31:55 -03:00
Fabiano Correa Mercer a06a299c84 Use FQDN for MGMT network
The management network is used extensively for all internal
communication.
Since the original use of the network was a private network before
it was exposed for external communication in a distributed cloud
configuration, it was never designed to be reconfigured.
To support MGMT network reconfiguration the idea is to configure the
applications to use the hostname/FQDN instead of a static MGMT IP
address.
In this way the MGMT network can be changed and the services and
applications will still work since they are using the hostname/FQDN
and the DNS will be responsible to translate to the current MGMT
IP address.
The use of FQDN will be applied for all installation modes: AIO-SX,
AIO-DX, Standard, AIO-PLUS and DC subclouds. But given the
complexities of supporting the multi-host reconfiguration,
the MGMT network reconfiguration will focus on support for AIO-SX
only.
The DNSMASQ service must start as soon as possible to translate
the FQDN to IP address.
Test plan ( Debian only )
 - AIO-SX and AIO-DX virtualbox installation IPv4/IPv6
 - Standard virtualbox installation IPv6
 - DC virtualbox installation IPv4 ( AIO-SX/DX subclouds )
 - AIO-SX and AIO-DX installation IPv4/IPv6
 - AIO-DX plus installation IPv6
 - DC IPv6 and subcloud AIO-SX
 - AIO-DX host-swact
 - DC IPv4 virtualbox with subcloud AIO-DX and AIO-DX
 - AIO-SX to AIO-DX migration
 - netstat -tupl ( no services are using the MGMT IP address )
 - Ran sanity/regression tests
 - Backup and Restore for AIO-SX/AIO-DX

Story: 2010722
Task: 48241

Change-Id: If340354755ec401dac1b0da2c93e278e390f81a9
Signed-off-by: Fabiano Correa Mercer <fabiano.correamercer@windriver.com>
2023-10-31 20:45:40 -04:00
Matheus Guilhermino b73ab54bdd Additional mechanism for unsafe force
In some scenarios, a force operation should not override a
protective semantic check, even when --force is used.
To provide a way to bypass those semantic checks completely,
a new "--unsafe" option is introduced.

Whenever an unsafe scenario is identified, with or without using
--force, the following message is displayed in addition to the
specific warning:

"Use --force --unsafe if you wish to lock anyway."

This change includes a bypass for the following scenario (only
one identified so far):

3 hosts in the quorum:
controller-0 unlocked and enabled
controller-1 unlocked and enabled
storage-0 unlocked and enabled
Expected behavior:
Storage-0 is locked
Attempt to lock controller-1 (which is rejected)
Attempt to --force lock controller-1 (which should be rejected)
Attempt to --force --unsafe lock controller-1 (which is allowed)

Test Plan:
PASS: Fresh Install and Bootstrap (AIO-SX and Storage)
PASS: Can't lock a controller when only 2 storage
      monitors are available
PASS: Can't force lock a controller when only 2 storage
      monitors are available
PASS: Successfully unsafe lock a controller when only 2 storage
      monitors are available

Closes-bug: 2027685

Change-Id: I1d9a57c472d888b9ffc9bbe3acd87fd77f84fa52
Signed-off-by: Matheus Guilhermino <matheus.machadoguilhermino@windriver.com>
2023-10-27 17:12:04 -03:00
Igor Soares 3511174f95 Introduce Kubernetes upgrade metadata for stx apps
This commit handles Kubernetes upgrade related metadata for StarlingX
applications. The metadata retrieved is parsed, validated and
stored into the appropriated variables for future use.

The new metadata section introduced has the following form:
k8s_upgrades:
  auto_update: true/false
  timing: pre/post

This new block aims to inform to the Application Framework whether apps
should be automatically updated (auto_update: true/false) if a
Kubernetes upgrade is taking place. It also informs when applications
should be updated, either during kube-upgrade-start (timing: pre) or
during kube-upgrade-complete (timing: post).

In addition, improvements were made to the already existing metadata
section:
supported_k8s_version:
  minimum: <version>
  maximum: <version>

A bug was found on the existing method that checks the supported
Kubernetes version. An exception was being raised when comparing
different formats such as 'v1.0.0' and '1.0.0'. This bug was fixed by
standardizing the formats on the comparison code.

It is not the goal of this commit to implement the logic to check
whether an app should be updated based on the active Kubernetes version.

Test plan:
PASS: Create a test application containing new valid metadata
      Upload the test application
      Apply the test application
PASS: Create a test application without supported_k8s_version:minimum
      Upload the test application
      Check if a warning message was raised on the logs
PASS: Create a test application without supported_k8s_version:minimum
      Move test application tarball to Helm applications folder
      Wait for the auto update process to start
      Check if a warning message was raised on the logs
      Check if the application was successfully updated
PASS: Create a test application without the k8s_upgrades section
      Check if k8s_upgrades:auto_update defaults to true
      Check if k8s_upgrades:timing defaults to false
      Check if a warning message was raised on the logs
      Check if the application was successfully updated

Story: 2010929
Task: 48929

Change-Id: I54362b036b25b6f42a18a2a29e43e2936a8a328d
Signed-off-by: Igor Soares <Igor.PiresSoares@windriver.com>
2023-10-25 14:10:45 +00:00
Kyale, Eliud 703592fa1a Block host-unlock till kernel manifest completes
If the following commands are issued in quick succession,
1. system host-kernel-modify controller-0 lowlatency
2. system host-unlock controller-0

The puppet runtime manifests , which is executed asyncronously,
will not have enough time to run and will end up being run
on the next reboot leading to alarms being raised.

This feature will add reporting of kernel runtime
manifest status. The 'in progress' status will be persisted
in the ihost table and used to validate host-unlock

Story: 2010731
Task: 48684

Test plan:

PASS - AIO-SX: DM config with kernel: lowlatency
               Verify no kernel config alarms raised
               and lowlatency kernel is running

PASS - AIO-DX: DM config with kernel: lowlatency
               Verify no kernel config alarms raised
               and lowlatency kernel is running

PASS - AIO-DX: Test really fast unlock
               Verify unlock is blocked

Change-Id: I5f30e6f94eae3b287b402a15d1739d61b7d20ca9
Signed-off-by: Kyale, Eliud <Eliud.Kyale@windriver.com>
2023-10-18 14:42:50 -04:00