Commit Graph

1028 Commits

Author SHA1 Message Date
Zuul cc679681e2 Merge "Show installed software release in the subcloud" 2024-03-25 15:07:11 +00:00
Zuul 8111d098fc Merge "Copy USM scripts to software dir on fresh install" 2024-03-22 19:59:24 +00:00
Heitor Matsui 3d66eb4c49 Copy USM scripts to software dir on fresh install
This commit copies the USM scripts to a versioned script directory
under /opt/software during a fresh install, so that in the future,
when deploy precheck is supported for patches, the patches can use
the GA major release precheck if they don't contain a new precheck
script, and also for the scenario where an user wants to remove
all the patches from the system, reverting to the GA release.

Note: the file for subcloud install, miniboot.cfg, will be covered
      on another commit.

Test Plan
PASS: run "software deploy precheck" for current running release
PASS: run "software deploy precheck" for a patch that does not
      include a precheck script and that uses the GA script instead

Story: 2010676
Task: 49681

Change-Id: I1e6313789204107f56fce8da09f6b785994817ee
Signed-off-by: Heitor Matsui <heitorvieira.matsui@windriver.com>
2024-03-20 18:29:22 -03:00
Luis Eduardo Bonatti dcbeade4d9 Show installed software release in the subcloud
The command software list is not returning the correct software
release on subcloud, it is returning empty list instead.

This commit fix this issue by creating the USM metadata directories
under /opt/software for subclouds too.

Test Plan:
PASS: Software list returning proper data.

Story: 2010676
Task: 49746

Relates-to: e37f69765e
6c863b3828

Change-Id: I55554709f377b725db535bc32d316a9586996693
Signed-off-by: Luis Eduardo Bonatti <LuizEduardo.Bonatti@windriver.com>
2024-03-20 17:07:31 +00:00
Zuul f8d1d96e75 Merge "Avoid creating non-volatile node locked file while in simplex mode" 2024-03-12 17:01:01 +00:00
Eric MacDonald 3c94b0e552 Avoid creating non-volatile node locked file while in simplex mode
It is possible to lock controller-0 on a DX system before controller-1
has been configured/enabled. Due to the following recent updates this
can lead to SM disabling all controller services on that now locked
controller-0 thereby preventing any subsequent controller-0 unlock
attempts.

https://review.opendev.org/c/starlingx/metal/+/907620
https://review.opendev.org/c/starlingx/ha/+/910227

This update modifies the mtce node locked flag file management so that
the non-volatile node locked file (/etc/mtc/tmp/.node_locked) is only
created on a locked host after controller-1 is installed, provisioned
and configured.

This prevents SM from shutting down if the administrator locks
controller-0 before controller-1 is configured.

Test Plan:

PASS: Verify AIO DX Install.
PASS: Verify Standard System Install.
PASS: Verify Swact back and forth.
PASS: Verify lock/unlock of controller-0 prior to controller-1 config
PASS: Verify the non-volatile node locked flag file is not created
      while the /etc/platform/simplex file exists on the active
      controller.
PASS: Verify lock and delete of controller-1 puts the system back
      into simplex mode where the non-volatile node locked flag file
      is once again not created if controller-0 is then unlocked.
PASS: Verify an existing non-volatile node locked flag file is removed
      if present on a node that is locked without new persist option.
PASS: Verify original reported issue is resolved for DX systems.

Closes-Bug: 2051578
Change-Id: I40e9dd77aa3e5b0dc03dca3b1d3d73153d8816be
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2024-03-09 12:45:54 +00:00
Zuul a95b4ee8d3 Merge "Fix IPv6 resolution during Upgrade" 2024-03-08 20:32:58 +00:00
Fabiano Correa Mercer c6c308e982 Fix IPv6 resolution during Upgrade
During an upgrade the controller.internal may not
be defined because the active controller is running
an old release, so the IPv6 DNS resolution fails
even if the scenario is IPv6.
An additional IPv6 DNS resolution to controller is
necessary to cover the upgrade scenario.

Tests done:

IPv6 AIO-SX fresh install
IPv6 AIO-DX fresh install
IPv4 AIO-DX upgrade from previous release
    without story 2010722 to new release
    that has the story 2010722 (not master)
IPv4 STANDARD upgrade from previous release
    without story 2010722 to new release
    that has the story 2010722 (not master)
IPv6 AIO-DX upgrade from previous release
    without story 2010722 to new release
    that has the story 2010722 (not master)
IPv6 DC lab upgrade from previous release
    without story 2010722 to new release
    that has the story 2010722 (not master)

Story: 2010722
Task: 49644

Depends-On: https://review.opendev.org/c/starlingx/config/+/909866

Change-Id: Ic4acc7432f351a65ba951d8ba8790b0ab90e32bc
Signed-off-by: Fabiano Correa Mercer <fabiano.correamercer@windriver.com>
2024-02-29 17:12:07 -03:00
Lucas Ratusznei Fonseca 33a4be37bc Fix reference to LLDP neighbors API path in the documentation
This commit fixes the documentation so that the API path to get LLDP
neighbors reflect the spelling used in the code. The path in the code
is lldp_neighbours, and in the documentation was lldp_neighbors.

Test plan

[PASS] API paths in documentation are shown as */lldp_neighbours/*
       instead of */lldp_neighbors/*

Partial-Bug: #2054707
Change-Id: If1b1dc205b398cf050585c995bbc2ce985b29c40
Signed-off-by: Lucas Ratusznei Fonseca <lucas.ratuszneifonseca@windriver.com>
2024-02-22 16:32:47 -03:00
Zuul 880df77a25 Merge "Add ipsec auth server pmon configuration" 2024-02-20 17:35:21 +00:00
Eric MacDonald d9982a3b7e Mtce: Create non-volatile backup of node locked flag file
The existing /var/run/.node_locked flag file is volatile.
Meaning it is lost over a host reboot which has DOR implications.

Service Management (SM) sometimes selects and activates services
on a locked controller following a DOR (Dead Office Recovery).

This update is part one of a two-part update that solves both
of the above problems. Part two is a change to SM in the ha git.
This update can be merged without part two.

This update maintains the existing volatile node locked file because
it is looked at by other system services. So to minimize the change
and therefore patchback impact, a new non-volatile 'backup' of the
existing node locked flag file is created.

This update incorporates modifications to the mtcAgent and mtcClient,
introducing a new backup file and ensuring their synchronized
management to guarantee their simultaneous presence or absence.

Note: A design choice was made to not use a symlink of one to the
      other rather than add support to manage symlinks in the code.
      This approach was chosen for its simplicity and reliability
      in directly managing both files. At some point in the future
      volatile file could be deprecated contingent upon identifying
      and updating all services that directly reference it.

This update also removes some dead code that was adjacent to my update.

Test Plan: This test plan covers the maintenance management of
           both files to ensure they always align and the expected
           behavior exists.

PASS: Verify AIO DX Install.
PASS: Verify Storage System Install.
PASS: Verify Swact back and forth.
PASS: Verify mtcClient and mtcAgent logging.
PASS: Verify node lock/unlock soak.

Non-volatile (Nv) node locked management test cases:

PASS: Verify Nv node locked file is present when a node is locked.
      Confirmed on all node types.
PASS: Verify any system node install comes up locked with both node
      locked flag files present.
PASS: Verify mtcClient logs when a node is locked and unlocked.
PASS: Verify Nv node locked file present/absent state mirrors the
      already existing /var/run/.node_locked flag file.
PASS: Verify node locked file is present on controller-0 during
      ansible run following initial install and removed as part
      of the self-unlock.
PASS: Verify the Nv node locked file is removed over the unlock
      along with the administrative state change prior to the
      unlock reboot.
PASS: Verify both node locked files are always present or absent
      together.
PASS: Verify node locked file management while the management
      interface is down. File is still managed over cluster network.
PASS: Verify node locked file management while the cluster interface
      is down. File is still managed over management network.
PASS: Verify behavior if the new unlocked message is received by a
      mtcClient process that does not support it ; unknown command log.
PASS: Verify a node locked state is auto corrected while not in a
      locked/unlocked action change state.
      ... Manually remove either file on locked node and verify
          they are both recreated within 5 seconds.
      ... Manually create either node locked file on unlocked worker
          or storage node and verify the created files are removed
          within 5 seconds.
          Note: doing this to the new backup file on the active
                controller will cause SM to shutdown as expected.
PASS: Verify Nv node locked file is auto created on a node that
      spontaneously rebooted while it was unlocked. During the
      reboot the node was administratively locked.
      The node should come online with both node locked files present.

Partial-Bug: 2051578
Change-Id: I0c279b92491e526682d43d78c66f8736934221de
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2024-02-14 00:54:11 +00:00
Zuul 739c508e92 Merge "Add a wait time between http request retries" 2024-02-13 16:45:26 +00:00
Andy Ning 1f507e0e62 Add ipsec auth server pmon configuration
This update added ipsec auth server pmon configuration file in
mtce-control package. The pmon configuration file is only needed
on controller node, as ipsec-server is running on controllers only.

Test Plan:
PASS: In a deployed system, verify ipsec-server is running
PASS: kill the ipsec-server process, verify that it is started
      by pmon.

Story: 2010940
Task: 49484

Co-Authored-By: Andy Ning <andy.ning@windriver.com>

Change-Id: Iadb9ca6f086640d008880a21cfd97256b00ab7ab
Signed-off-by: Leonardo Mendes <Leonardo.MendesSantana@windriver.com>
2024-02-09 16:05:18 -03:00
Eric MacDonald 191c0aa6a8 Add a wait time between http request retries
Maintenance interfaces with sysinv, sm and the vim using http requests.
Request timeout's have an implicit delay between retries. However,
command failures or outright connection failures don't.

This has only become obvious in mtce's communication with the vim
where there appears to be a process startup timing change that leads
to the 'vim' not being ready to handle commands before mtcAgent
startup starts sending them after a platform services group startup
by sm.

This update adds a 10 second http retry wait as a configuration option
to mtc.conf. The mtcAgent loads this value at startup and uses it
in a new HTTP__RETRY_WAIT state of http request work FSM.

The number of retries remains unchanged. This update is only forcing
a minimum wait time between retries, regardless of cause.

Failure path testing was done using Fault Insertion Testing (FIT).

Test Plan:

PASS: Verify the reported issue is resolved by this update.
PASS: Verify http retry config value load on process startup.
PASS: Verify updated value is used over a process -sighup.
PASS: Verify default value if new mtc.conf config value is not found.
PASS: Verify http connection failure http retry handling.
PASS: Verify http request timeout failure retry handling.
PASS: Verify http request operation failure retry handling.

Regression:

PASS: Build and install ISO - Standard and AIO DX.
PASS: Verify http failures do not fail a lock operation.
PASS: Verify host unlock fails if its http done queue shows failures.
PASS: Verify host swact.
PASS: Verify handling of random and persistent http errors involving
      the need for retries.

Closes-Bug: 2047958
Change-Id: Icc758b0782be2a4f2882efd56f5de1a8dddea490
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2024-02-07 20:33:01 +00:00
Andy Ning 4af50a10bd Create a first controller flag during kickstart
This change updated kickstart to create a .first_controller flag.
This flag will be checked by controller_config to skip IPSec config
and enablement for the very first controller. IPSec is configured and
enabled by ansible bootstrap for the first controller.

Test Plan:
PASS: Install a AIO-DX system, verify the .first_controller file flag
      is created for the first controller during kickstart, but not for
      the other controller.
PASS: Verify that IPSec config and enablement is skipped for first
      controller and IPSec is configured and enabled for the second
      controller.

Story: 2010940
Task: 49547

Change-Id: Iea78885483a358dc3d8a296312c0a40c431b7ea5
Signed-off-by: Andy Ning <andy.ning@windriver.com>
2024-02-07 13:54:21 -05:00
Eric MacDonald 19346c232a Add sysinv query for host's mgmt_ip and hostname based on bootif mac
The newly introduced unauthenticated 'mgmt_ip' API in sysinv, as
referenced by the 'depends-on' link below, has been added to
initiate a query for the current host's management IP and hostname.

This information is needed by the kickstart to create a static
'ifcfg-<mgmt interface>' file so that the management interface
is automatically setup on the first boot following a fresh install.

The significance of this step lies in enabling a pre-configuration of
IPsec on the management network before host configuration takes place.

This update also
 - creates a dhcp ifcfg-pxeboot file to automatically update the
   management interface with a pxeboot dhcp address.
 - creates an initial /etc/resolv.conf file with the active
   controller's floating ip and the pxecontroller ip addresses
   which are also needed by the IPSec setup pre-configuration
   following the first reboot after node install.

Additionally, this update involves a refinement in terminology.
The variable previously named 'mgmt_dev' has been aptly renamed
'boot_dev' to more accurately reflect the nomenclature of the
interface in use.

Test Plan:

PASS: Verify AIO DX install of IPV4 and IPV6 with and without vlans
PASS: Verify IPV4 worker node install
PASS: Verify management interface 'ifcfg-<mgmt-if>' file is created
      - cases: with and without vlan
PASS: Verify pxeboot interface 'ifcfg-pxeboot' is created (ipv4 only)
PASS: Verify kickstart fails the install if it receives badly
      formatted sysinv mgmt_ip query or None.
PASS: Verify resolve.conf updated with nameserver <floating ip>
PASS: Verify resolve.conf updated with nameserver <pxecontroller ip>
PASS: Verify kickstart logging

Depends-On: https://review.opendev.org/c/starlingx/config/+/901981

Story: 2010940
Task: 49162
Change-Id: I429522305fcff66e5c78195f4bf3c5b82826c1d8
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2024-02-06 18:30:08 +00:00
Eric MacDonald 5a3a5ce8ea Stop creating guestServer.conf /etc/pmon.d link
The maintenance guestServer daemon has been deprecated for quite
some time.

However, that deprecation process left the kickstarts creating a
dangling link to a missing guestServer.conf in the /etc/pmon.d
directory.

Pmon just ignores the missing process so there is no service impact.

This update cleans this up by removing the code in the kickstarts
that create the dangling link.

Test Plan:

PASS: Verify the guestServer.conf dangling link no longer exists
      in worker nodes.
PASS: Verify pmond.log makes no mention of the deprecated guestServer
      process.

Closes-Bug: 2051389
Change-Id: I89a62d939194c65c86e3cf71b238698eb2ee97ed
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2024-01-30 14:42:03 +00:00
Zuul 25bb6a1dbf Merge "Improve maintenance power/reset control command retry handling" 2024-01-26 14:43:11 +00:00
Eric Macdonald 50dc29f6c0 Improve maintenance power/reset control command retry handling
This update improves on and drives consistency into the
maintenance power on/off and reset handling in terms of
retries and use of graceful and immediate commands.

This update maintains the 10 retries for both power-on
and power-off commands and increases the number of retries
for the reset command from 5 to 10 to line up with the
power operation commands.

This update also ensures that the first 5 retries are done
with the graceful action command while the last 5 are with
the immediate.

This update also removed a power on handling case that could
have lead to a stuck state. This case was virtually impossible
to hit based on the required sequence of intermittent command
failures but that scenario handling was fixed up anyway.

Issues have been seen with the power-off handling on some servers.
Suspect that those servers need more time to power-off. So, this
introduced a 30 seconds delay following a power-off command before
issuing the power status query to give the server some time to
power-off before retrying the power-off command.

Test Plan: Both IPMI and Redfish

PASS: Verify power on/off and reset handling support up to 10 retries
PASS: Verify graceful command is used for the first power on/off
      or reset try and the first 5 retries
PASS: Verify immediate command is used for the final 5 retries
PASS: Verify reset handling with/without retries (none/mid/max)
PASS: Verify power-on  handling with/without retries (none/mid/max)
PASS: Verify power-off handling  with/without retries (none/mid/max)
PASS: Verify power status command failure handling for power on/off
NOTE: FIT (fault insertion testing) was used to create retry scenarios

PASS: Verify power-off inter retry delay feature
PASS: Verify 30 second power-off to power query delay
PASS: Verify redfish power/reset commands used are logged by default
PASS: Verify power-off/on and reset logging

Regression:

PASS: verify power-on/off and reset handling without retries
PASS: Verify power-off handling when power is already off
PASS: Verify power-on handling when power is already on

Closes-Bug: 2031945
Signed-off-by: Eric Macdonald <eric.macdonald@windriver.com>
Change-Id: Ie39326bcb205702df48ff9dd090f461c7110dd36
2024-01-25 22:42:26 +00:00
Mohammad Issa efe618fae3 IPv4 subcloud failure due to miniboot config
systemd unable to bring up OAM bonded VLAN IF due to miniboot cfg.

The autoconf and accept_ra sysctl parameters do not exist for IPv4.
The accept_redirects does, but in normal operation we only set
these for IPv6.

Testing:
- Build successful
- Successfully bring up IPv4 System Controller & Subcloud
- Verify that ifcfg-<ifname> under interfaces.d shows
  that the sysctl parameters are set to ipv6
- Verify these tests using a miniboot installed IPv4 subcloud

Closes-Bug: 2049683

Change-Id: Ib27f86f707796682a3db6641824f9fdbe2f53534
Signed-off-by: Mohammad Issa <mohammad.issa@windriver.com>
2024-01-22 16:00:56 +00:00
Zuul c5b611f510 Merge "Remove qemu dependency from mtce-compute and mtce-control" 2024-01-08 16:07:52 +00:00
Zuul 1df22635fd Merge "Copy luks.conf to '/etc/pmon.d'" 2023-12-18 21:31:39 +00:00
Jagatguru Prasad Mishra 7436ec965f Copy luks.conf to '/etc/pmon.d'
luks.conf contains the configurations used by pmon to monitor
luks-fs-mgr service.This change links /usr/share/starlingx/pmon.d
to /etc/pmon.d. Once this is done, pmon starts monitoring the
process, creates and clears an alarm, tries to restart the service,
degrades the host in case of multiple failures.

Test Plan:
PASS: build-pkgs -c -p platform-kickstarts
PASS: build-image
PASS: AIO-SX verify if luks.conf is present at /etc/pmon.d
PASS: AIO-DX verify if luks.conf is present at /etc/pmon.d
PASS: AIO-DX alarm should be created if pmon is unable to start the
      service
PASS: AIO-DX pmon.log should contain error messages if service is down.
PASS: Standard: Pmon monitoring should not be enable on compute and
      storage nodes. Ensure conf file is not present at '/etc/pmon.d'
      on compute and storage hosts. On controller hosts this file
      should be present at /etc/pmon.d'.
PASS: AIO-DX alarm should be cleared and host availability should be
      'available' once service starts running.
PASS: AIO-DX is service is unrecoverable, host is reported to host
      watchdog.

Story: 2010872
Task: 49250

Depends-On: https://review.opendev.org/c/starlingx/integ/+/903612

Change-Id: I4eeb4b2f79b1bb017a0fe2af34f35854c89dee82
Signed-off-by: Jagatguru Prasad Mishra <jagatguruprasad.mishra@windriver.com>
2023-12-14 23:01:13 -05:00
Zuul 125601c2f9 Merge "Failure case handling of LUKS service" 2023-12-14 18:09:46 +00:00
Jagatguru Prasad Mishra 1210ed450a Failure case handling of LUKS service
luks-fs-mgr service creates and unseals the LUKS volume used to store
keys/secrets. This change handles the failure case if this essential
service is inactive. It introduces an alarm LUKS_ALARM_ID which is
raised if service is inactive which implies that there is an issue in
creating or unsealing the LUKS volume.

Test Plan:
PASS" build-pkgs -c -p mtce-common
PASS: build-pkgs -c -p mtce
PASS: build-image
PASS: AIO-SX bootstrap with luks volume status active
PASS: AIO-DX bootstrap with volume status active
PASS: Standard setup with 2 controllers and 1 compute node with luks
      volume status active. There should not be any alarm and node
      status should be unlocked/enabled/available.
PASS: AIO-DX node enable failure on the controller where luks volume
      is inactive. Node availability should be failed. A critical
      alarm with id 200.016 should be displayed with 'fm alarm-list'
PASS: AIO-SX node enable failure on the controller-0. Node availability
      should be failed. A critical alarm with id 200.016 should be
      displayed with 'fm alarm-list'
PASS: Standard- node enable failure on the node (controller-0,
      controller-1, storage-0, compute-1). Node availability
      should be failed. A critical alarm with id 200.016 should be
      displayed with 'fm alarm-list' for the failed host.
PASS: AIO-DX In service volume inactive should be detected and a
      critical alarm should be raised with ID 200.016. Node
      availability should be changed to degraded.
PASS: AIO-SX In service volume inactive  status should be detected
      and a critical alarm should be raised with ID 200.016. Node
      availability should be changed to degraded.
PASS: Standard ( 2 controller, 1 storage, 1 compute) In service
      volume inactive status should be detected and a
      critical alarm should be raised with ID 200.016. Node
      availability should be changed to degraded.
PASS: AIO-DX In service: If volume becomes active and a LUKS alarm
      is active, alarm should be cleared. Node availability should
      be changed to available.
PASS: AIO-SX In service: If volume becomes active and a  LUKS alarm is
      active, alarm should be cleared. Node availability should be
      changed to available.
PASS: Standard ( 2 controller, 1 storage, 1 compute) In service:
      If volume becomes active and a LUKS alarm is active, alarm
      should be cleared. Node availability should be changed to
      available.
PASS: AIO-SX, AIO-DX, Standard- If intest fails and node availability
      is 'failed'. After fixing the volume issue, a lock/unlock should
      make the node available.

Story: 2010872
Task: 49108

Change-Id: I4621e7c546078c3cc22fe47079ba7725fbea5c8f
Signed-off-by: Jagatguru Prasad Mishra <jagatguruprasad.mishra@windriver.com>
2023-12-06 00:34:02 -05:00
Zuul 03f7a1d7ee Merge "Move GA metadata to deployed status" 2023-12-04 18:25:34 +00:00
Davi Frossard dd0eb34208 Remove qemu dependency from mtce-compute and mtce-control
Dependency is necessary only on centos packaging.

Test Plan:
PASS - Build packages.
PASS - Build/install image on AIO-SX.

Depends-On: https://review.opendev.org/c/starlingx/virt/+/885342

Story: 2010781
Task: 48183

Change-Id: I5a6e4a7ba12c83372dd3171e054bf612c1484f7e
Signed-off-by: Davi Frossard <dbarrosf@windriver.com>
2023-12-04 14:19:28 +00:00
Zuul 1332ebb7a7 Merge "Replace a file test from fsmond" 2023-12-04 14:03:11 +00:00
Heitor Matsui 6c863b3828 Move GA metadata to deployed status
This commit changes the GA metadata status on fresh install
to "deployed" given recent technical decision changes.

Test Plan:
PASS: build and install iso, verify the correct output with
      "software list"

Story: 2010676
Task: 49166

Change-Id: Idbab8655f9f2e4e080f389fa7823f5e6744c4c74
Signed-off-by: Heitor Matsui <heitorvieira.matsui@windriver.com>
2023-11-29 11:59:17 -03:00
Teresa Ho 36814db843 Increase timeout for runtime manifest
In management network reconfiguration for AIO-SX, the runtime manifest
executed during host unlock could take more than five minutes to complete.
This commit is to extend the timeout period from five minutes to eight
minutes.

Test Plan:
PASS: AIO-SX subcloud mgmt network reconfiguration

Story: 2010722
Task: 49133

Change-Id: I6bc0bacad86e82cc1385132f9cf10b56002f385e
Signed-off-by: Teresa Ho <teresa.ho@windriver.com>
2023-11-23 16:51:22 -05:00
Heitor Matsui e37f69765e Copy GA metadata file to USM location
This commits switches the GA metadata file copy from
/opt/patching/metadata to /opt/software/metadata.

Test Plan
PASS: build iso, install and verify that "software list"
      lists the GA release

Story: 2010676
Task: 49112

Change-Id: I75b8cd6ae41a9cf9b5af0225ebcaaf0d9e0ddb4e
Signed-off-by: Heitor Matsui <heitorvieira.matsui@windriver.com>
2023-11-21 12:31:19 -03:00
Erickson Silva de Oliveira 16181a2ce8 Replace a file test from fsmond
fsmond tries to create a test file in "/.fs-test" but
it is not possible because "/" is blocked by ostree.

So the fix is to replace this path from fsmond monitoring
with /sysroot/.fs_test.

Below is a comparison of the logs:
  - Before change:
  ( 196) fsmon_service : Warn : File (/.fs-test) test failed

  - After change:
  ( 201) fsmon_service : Info : tests passed

Test Plan:
  - PASS: Build mtce package
  - PASS: Replace fsmond binary on AIO-SX
  - PASS: Check fsmond.log output

Closes-Bug: 2043712

Change-Id: Ib4bad73448735bce1dff598151fce86f867f4db7
Signed-off-by: Erickson Silva de Oliveira <Erickson.SilvadeOliveira@windriver.com>
2023-11-17 08:15:28 -03:00
Zuul 8d2883aa68 Merge "After executing PXE boot install, turn off IPv6 autoconf" 2023-11-14 20:20:46 +00:00
Andre Kantek 97052df958 After executing PXE boot install, turn off IPv6 autoconf
It was detected that the PXE boot the IPv6 autoconf is turned on
due to an error in the network config file for the PXE interface.
Instead of applying the config to the interface it is configuring
the loopback.

By leaving autoconf turned on the interface it can receive unwanted
address configuration that can create errors during the ansible
playbook execution that will follow.

Closes-Bug: 2043509

Change-Id: I48584dc6b92fca02205c4774c4624410b6a29ba8
Signed-off-by: Andre Kantek <andrefernandozanella.kantek@windriver.com>
2023-11-14 16:13:33 -03:00
Teresa Ho e616a4495d Use FQDN for MGMT network in kickstart
With the introduction of FQDN for MGMT network feature, the DNS lookup
of 'controller' resolves to 'controller.internal'.
The kickstart script uses the DNS lookup of controller to determine
whether the system is using a IPv6 or IPv4 which results in a string
instead of IP address or 0 return code. This causes a problem in
installing nodes in IPv4 when the management interface is configured
over vlan.

The fix is to use the FQDN controller.internal.

Test plan:
PASS: Install IPv4 AIO-DX with mgmt vlan
PASS: Install IPv6 AIO-DX with mgmt vlan

Story: 2010722
Task: 48682
Closes-Bug: 2042953

Signed-off-by: Teresa Ho <teresa.ho@windriver.com>
Change-Id: I5377587c8bc8c62a62f03123cabef7366df3dd94
2023-11-09 16:12:13 +00:00
Eric MacDonald 79d8644b1e Add bmc reset delay in the reset progression command handler
This update solves two issues involving bmc reset.

Issue #1: A race condition can occur if the mtcAgent finds an
          unlocked-disabled or heartbeat failing node early in
          its startup sequence, say over a swact or an SM service
          restart and needs to issue a one-time-reset. If at that
          point it has not yet established access to the BMC then
          the one-time-reset request is skipped.

Issue #2: When issue #1 race conbdition does not occur before BMC
          access is established the mtcAgent will issue its one-time
          reset to a node. If this occurs as a result of a crashdump
          then this one-time reset can interrupt the collection of
          the vmcore crashdump file.

This update solves both of these issues by introducing a bmc reset
delay following the detection and in the handling of a failed node
that 'may' need to be reset to recover from being network isolated.

The delay prevents the crashdump from being interrupted and removes
the race condition by giving maintenance more time to establish bmc
access required to send the reset command.

To handle significantly long bmc reset delay values this update
cancels the posted 'in waiting' reset if the target recovers online
before the delay expires.

It is recommended to use a bmc reset delay that is longer than a
typical node reboot time. This is so that in the typical case, where
there is no crashdump happening, we don't reset the node late in its
almost done recovery. The number of seconds till the pending reset
countdown is logged periodically.

It can take upwards of 2-3 minutes for a crashdump to complete.
To avoid the double reboot, in the typical case, the bmc reset delay
is set to 5 minutes which is longer than a typical boot time.
This means that if the node recovers online before the delay expires
then great, the reset wasn't needed and is cancelled.

However, if the node is truely isolated or the shutdown sequence
hangs then although the recovery is delayed a bit to accomodate for
the crashdump case, the node is still recovered after the bmc reset
delay period. This could lead to a double reboot if the node
recovery-to-online time is longer than the bmc reset delay.

This update implements this change by adding a new 'reset send wait'
phase to the exhisting reset progression command handler.

Some consistency driven logging improvements were also implemented.

Test Plan:

PASS: Verify failed node crashdump is not interrupted by bmc reset.
PASS: Verify bmc is accessible after the bmc reset delay.
PASS: Verify handling of a node recovery case where the node does not
      come back before bmc_reset_delay timeout.
PASS: Verify posted reset is cancelled if the node goes online before
      the bmc reset delay and uptime shows less than 5 mins.
PASS: Verify reset is not cancelled if node comes back online without
      reboot before bmc reset delay and still seeing mtcAlive on one
      or more links.Handles the cluster-host only heartbeat loss case.
      The node is still rebooted with the bmc reset delay as backup.
PASS: Verify reset progression command handling, with and
      without reboot ACKs, with and without bmc
PASS: Verify reset delay defaults to 5 minutes
PASS: Verify reset delay change over a manual change and sighup
PASS: Verify bmc reset delay of 0, 10, 60, 120, 300 (default), 500
PASS: Verify host-reset when host is already rebooting
PASS: Verify host-reboot when host is already rebooting
PASS: Verify timing of retries and bmc reset timeout
PASS: Verify posted reset throttled log countdown

Failure Mode Cases:

PASS: Verify recovery handling of failed powered off node
PASS: Verify recovery handling of failed node that never comes online
PASS: Verify recovery handling when bmc is never accessible
PASS: Verify recovery handling cluster-host network heartbeat loss
PASS: Verify recovery handling management network heartbeat loss
PASS: Verify recovery handling both heartbeat loss
PASS: Verify mtcAgent restart handling finding unlocked disabled host

Regression:

PASS: Verify build and DX system install
PASS: Verify lock/unlock (soak 10 loops)
PASS: Verify host-reboot
PASS: Verify host-reset
PASS: Verify host-reinstall
PASS: Verify reboot graceful recovery (force and no force)
PASS: Verify transient heartbeat failure handling
PASS: Verify persistent heartbeat loss handling of mgmt and/or cluster networks
PASS: Verify SM peer reset handling when standby controller is rebooted
PASS: Verify logging and issue debug ability

Closes-Bug: 2042567
Closes-Bug: 2042571
Change-Id: I195661702b0d843d0bac19f3d1ae70195fdec308
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2023-11-02 20:58:00 +00:00
Zuul 3645d5db93 Merge "Update crashDumpMgr to source config from envfile" 2023-10-19 21:01:45 +00:00
Kyle MacLeod e81d0bf4e7 Prestaged ISO: copy ostree_repo to versioned platform-backup
This commit applies to the prestaged ISO install. The kickstart.cfg is
updated to copy the prestaged ostree_repo into release-specific
/opt/platform-backup/<release> location.

A minor change is also included in miniboot.cfg to sync the patching
metadata for prepatched ISOs. This fills a potential hole in the
patching metadata sync behaviour identified during testing.
Normally the patching metadata is synchronized from the system
controller down to the subcloud. For the prestaged ISO case, this change
is necessary to ensure the patching metadata is seeded from the
prepatched ISO created via gen-prestaged-iso.sh.

Test Plan
PASS:
- Build prestaged ISO, including container images and a patch
    - Install subcloud using prestaged ISO
    - Verify contents of /opt/platform-backup/<release> are properly
      populated.
    - Verify subcloud is installed using prestaged data from
      /opt/platform-backup/<release>
    - Verify that included container images are installed
- Build prestaged ISO using a pre-patched ISO. Install subcloud, ensure
  that patching metadata is properly synchronized on installation.

Out of scope failure:
- A new bug to be raised for the following:
    - Verify that the included patch is installed on the subcloud
      - It appears that this has never worked in Debian. The --patch
        option makes sense for a Debian installation, since the patches
        are contained in ostree commits. To fully support this
        functionality we need to implement a new mechanism to do a
        sw-patch upload and apply at some point during the installation.
      - Support for the gen-prestaged-iso.sh --patch option will be
        added in a future commit

Closes-Bug: 2039282
Signed-off-by: Kyle MacLeod <kyle.macleod@windriver.com>
Change-Id: I973f4704eae09634a0c3fe2f7fbc31ac1835fcf8
2023-10-16 10:04:35 -04:00
Zuul 61e01d2000 Merge "Fix kickstarts patching" 2023-10-11 15:10:19 +00:00
Salman Rana d24e48e490 Fix kickstarts patching
Ostree doesn't manage the /var filesystem. Anything
installed there during initial filesystem setup becomes
unpatchable [1]. As a result, the kickstart install dir
/var/www/pages/feed/rel-${platform_release}/kickstart
is not updated according to patch changes. /var/www/pages/feed/rel-${platform_release}/kickstart
is currently only used for PXE boot installs.
Subcloud remote installations are using the miniboot.cfg
kickstart from the load-imported ISO
(we may want to change this in some future commit).

This commit adds kickstart update support to
pxeboot-feed.service (pxeboot_feed.sh) so that
/var/www/pages/feed/rel-${platform_release}/kickstarts
is refreshed based on the kickstart dir from
/ostree (i.e., the patched changes).

[1] https://review.opendev.org/c/starlingx/ha/+/890918

Test Plan:
1. PASS: Verify Debian build and DC system install
         (virtual lab - disk and pxe installs)
2. PASS: Verify pxe install (DC remote install) with
         patched kickstart
3. PASS: Create a patch with changes to kickstart feed:
          - modify an existing kickstart
          - create a new kickstart file
          - delete an existing file
          - create a new kickstart sub-directory
          - modify centos subdir
	 verify patch apply, ensure that changes are
         correctly applied to:
         /var/www/pages/feed/rel-${platform_release}/kickstarts
4. PASS: Revert the patch from test #3 and ensure changes
         are correctly undone in the feed dir

Closes-Bug: 2034753

Change-Id: I74804bff23a74512db6a95fa514c84a1a6ea54a8
Signed-off-by: Salman Rana <salman.rana@windriver.com>
2023-10-11 14:40:38 +00:00
Enzo Candotti 23143abbca Update crashDumpMgr to source config from envfile
This commit updates the crashDumpMgr service in order to:
- Cleanup of current service naming and packaging to follow the
  standard Linux naming convention:
    - Repackage /etc/init.d/crashDumpMgr to
      /usr/sbin/crash-dump-manager
    - Rename crashDumpMgr.service to crash-dump-manager.service
- Add EnvironmentFile to crash-dump-manager service file to source
  configuration from /etc/default/crash-dump-manager.
- Update ExecStart of crash-dump-manager service to use parameters
  from EnvironmentFile
- Update crash-dump-manager service dependencies to run after
  config.service.
- Update logrotate configuration to support the retention polices of
  the maximum files. The “rotate 1” option was removed to permit
  crash-dump-manager to manage pruning old files.
- Modify the crash-dump-manager script to enable updates to the
  max_files parameter to a lower value. If there are currently more
  files than the new max_files value, the oldest files will be
  deleted the next time a crash dump file needs to be stored, thus
  adhering to the new max_files values.

Test Plan:

PASS: Build ISO and perform a fresh install. Verify the new
crash-dump-manager service is enabled and working as expected.
PASS: Add and apply new crashdump service parameters and force a kernel
panic. Verify that after the reboot, the max_files, max_used,
min_available and max_size values are updated accordingly to the service
parameters values.
PASS: Verify that the crashdump files are rotated as expected.

Story: 2010893
Task: 48910

Change-Id: I4a81fcc6ba456a0d73067b77588ee4a125e44e62
Signed-off-by: Enzo Candotti <enzo.candotti@windriver.com>
2023-10-06 23:06:54 +00:00
Zuul df8989e2a1 Merge "Set longer shutdown time and fix power state error log" 2023-10-05 21:29:12 +00:00
Li Zhu bfbaba5731 Set longer shutdown time and fix power state error log
1.Extended the timeout to 14mins to accommodate the longer shutdown time.
2.Fixed the power state error log so that it logs the requested state
instead of the current power_state.

Test Plan:

PASS: Verify logged version is 2.2
PASS: Verify success path with no FIT delay ; HP and ZT servers
PASS: Verify timing of the loop with timeout of 14 minutes
PASS: Verify shutdown timeout handling when shutdown exceeds 14
      minutes.
PASS: Verify install completes successfully when Power Off takes
      close to but less than 14 minutes
PASS: Verify power state failure log reports proper state

Closes-Bug: 2038484

Signed-off-by: Li Zhu <li.zhu@windriver.com>
Change-Id: Ic99a06dca9962fcae43b20e00d8ebcb127a80560
2023-10-05 17:12:19 -04:00
Zuul f6ab5912b3 Merge "Wipe all LVs during kickstart" 2023-09-27 16:27:51 +00:00
Gustavo Ornaghi Antunes 00b313de49 Wipe all LVs during kickstart
Backup and Restore are not completing because the manifest is
not applied when trying drbd-cephmon turns primary,
It is occurring because the LVs are not being wiped before
being removed, so some garbage is impacting drbd-cephmon
turns primary and causes the manifest fails to not be applied.

To ensure that drbd-cephmon turns primary on first unlock,
LVs will be wiped before recreating them during kickstart
procedure.

Test Plan:
PASS: Backup and restore on AIO-DX
PASS: Install AIO-SX over the previous installation without
wiping the disks and checking the install.log to verify
if the disks are wiped during kickstart.
PASS: Install AIO-DX, reinstall Controller-1, and checking the
install.log to verify if the disks were wiped during kickstart.

Closes-Bug: #2031542

Change-Id: Ib00d77fbc9dfd62e9c94f418e29f2805f8a0c036
Signed-off-by: Gustavo Ornaghi Antunes <gustavo.ornaghiantunes@windriver.com>
2023-09-27 14:20:01 +00:00
Zuul 8c2e1c395a Merge "Remove machine-id generated from build from subcloud install" 2023-09-27 12:58:00 +00:00
Zuul 182547b31f Merge "Revert "Fix kickstarts patching"" 2023-09-27 00:18:14 +00:00
Bruce Jones 4e09b61d0d Revert "Fix kickstarts patching"
This reverts commit 0366f8552d.

Reason for revert: breaks sanity

Change-Id: Ie580ae328a80abfc2a1964157ac1b14b70dc98e9
2023-09-26 22:28:49 +00:00
Andre Kantek 7d88382c9e Remove machine-id generated from build from subcloud install
As it was done in the previous change for local installation
https://review.opendev.org/c/starlingx/metal/+/863322
This change removes the ISO embedded machine-id file to allow the
value regeneration after the first boot post install for subclouds
that use the redfish protocol when added in a system controller.

Test Plan
[PASS] install 2 subclouds from the system controller containing the
        patch and check the values in /etc/machine-id and
        /var/lib/dbus/machine-id to unique for each subcloud

Closes-Bug: 2037434

Change-Id: If7a631b5769cb499956a7e5ee33e3361a6230452
Signed-off-by: Andre Kantek <andrefernandozanella.kantek@windriver.com>
2023-09-26 12:06:39 -03:00
Zuul 495bb4ab1a Merge "Fix kickstarts patching" 2023-09-22 14:16:29 +00:00