Commit Graph

433 Commits

Author SHA1 Message Date
junfeng-li fc8b75c6f9 Fix USM endpoint append error
This is to fix the USM endpoint URL not appended properly
using .join() function.

The way .join() was used only append the endpoint resource
to a empty string.

Test Plan:

PASS: Run the host-swact

Depends-on: https://review.opendev.org/c/starlingx/update/+/911003

Task: 49660
Story: 2010676
Change-Id: Icf47494843d7ea6c9fcd73e9256d9be352d8f76f
Signed-off-by: junfeng-li <junfeng.li@windriver.com>
2024-03-04 21:49:08 +00:00
Zuul 95b6310ac1 Merge "Deploy state sync on swact" 2024-02-29 18:51:50 +00:00
junfeng-li 23f48bd545 Deploy state sync on swact
This commit is to ensure both controllers
deployment state is in synced before host
swact during platform upgrade.

If the USM deploy is not started, this host swact
pre-check is always passed.

During the pre-swact check, the SM calls
USM REST API endpoint to get the controller
sync status. If the controllers deployment state
is not in sync, the host swact is stopped.

Depends-on: https://review.opendev.org/c/starlingx/update/+/906005

Test Plan:

PASS: executed host swact when controllers are in sync
PASS: executed host swact when controllers are not in sync

Task: 49425
Story: 2010676

Change-Id: I8d262a731583f691fd0d85a33ddebcbb12f549e8
Signed-off-by: junfeng-li <junfeng.li@windriver.com>
2024-02-28 20:07:48 +00:00
Eric MacDonald 91fa44188c Add node locked gate to SM enable for DX systems
Service Management (SM) sometimes selects and activates services on a
locked controller following a dead office recovery.

This update adds a node locked check to SM's enable handler to
block enable if present much like the existing goenabled check
blocks enable if not present in the same function.

The enable gate file is /etc/mtc/tmp/.node_locked on the local host.

Maintenance manages the presence or absence of this file based on
the node's administrative state.

This update also cleans up some extra whitespace in the changed file.

Test Plan:

PASS: Verify system build.
PASS: Verify AIO SX install.
PASS: Verify AIO DX install.
PASS: Verify Standard DX system install with worker and storage.

For Both 'AIO DX' and 'Standard DX with worker and storage':

PASS: Verify SM does not activate on a locked DX controller.
PASS: ... DOR case
PASS: ... Uncontrolled Swact case
PASS: Verify Standard DX behavior over DOR with one locked controller
      while the only unlocked controller does not recover.
PASS: Verify behavior after above test case once the only unlocked
      controller does recover.
PASS: Verify lock of the standby controller and its sm logs
PASS: Verify manually creating the new Nv locked file on the active
      controller will cause SM to go disabled and shut down all
      services on that controller.
      ... If there is another unlocked controller then verify it
          takes over as an uncontrolled swact.
      ... If there is no unlocked standby controller then verify SM
          remains shutdown until the manually created Nv node locked
          file is removed. At which point SM proceeds to activate
          services on that controller again.

PASS: Verify SM ignores the node locked flag file for AIO SX systems.
PASS: Verify lock/unlock of AIO SX controller.
PASS: Verify original reported issue is resolved for AIO DX systems.

Regression:

PASS: Verify controlled swact with unlocked enabled standby.
PASS: Verify uncontrolled swact with unlocked enabled standby.
PASS: Verify standby controller lock/unlock soak loop (10).
PASS: Verify swact loop soak (10).
PASS: Verify no crash or core dumps.
PASS: Verify SM logging

Closes-Bug: 2051578
Change-Id: If8e27ef30d62096fa77c3868f4d460b18e10ade2
(cherry picked from commit 23d0d8ab2f)
2024-02-26 22:15:03 +00:00
Zuul 338161f443 Merge "Revert "Add node locked gate to SM enable"" 2024-02-23 14:26:53 +00:00
Eric MacDonald 1e62ab86f1 Revert "Add node locked gate to SM enable"
This reverts commit 23d0d8ab2f.

Reason for revert: Breaks AIO SX Enable

Change-Id: I662b8732e723f4ce5b748ef00a184ae5b8db523c
2024-02-23 14:06:10 +00:00
Zuul 031c2e223d Merge "Add node locked gate to SM enable" 2024-02-16 16:11:22 +00:00
Zuul 9367d45672 Merge "Avoid potential blocking of heartbeat thread" 2024-02-14 21:35:21 +00:00
Eric MacDonald 23d0d8ab2f Add node locked gate to SM enable
Service Management (SM) sometimes selects and activates services on a
locked controller following a dead office recovery.

This update adds a node locked check to SM's enable handler to
block enable if present much like the existing goenabled check
blocks enable if not present in the same function.

The enable gate file is /etc/mtc/tmp/.node_locked on the local host.

Maintenance manages the presence or absence of this file based on
the node's administrative state.

This update also cleans up some extra whitespace in the changed file.

Test Plan:

PASS: Verify system build.
PASS: Verify AIO DX install.
PASS: Verify Standard DX system install with worker and storage.

For Both 'AIO DX' and 'Standard DX with worker and storage':

PASS: Verify SM does not activate on a locked controller.
PASS: ... DOR case
PASS: ... Uncontrolled Swact case
PASS: Verify Standard DX behavior over DOR with one locked controller
      while the only unlocked controller does not recover.
PASS: Verify behavior after above test case once the only unlocked
      controller does recover.
PASS: Verify lock of the standby controller and its sm logs
PASS: Verify manually creating the new Nv locked file on the active
      controller will cause SM to go disabled and shut down all
      services on that controller.
      ... If there is another unlocked controller then verify it
          takes over as an uncontrolled swact.
      ... If there is no unlocked standby controller then verify SM
          remains shutdown until the manually created Nv node locked
          file is removed. At which point SM proceeds to activate
          services on that controller again.

Regression:

PASS: Verify controlled swact with unlocked enabled standby.
PASS: Verify uncontrolled swact with unlocked enabled standby.
PASS: Verify standby controller lock/unlock soak loop (10).
PASS: Verify swact loop soak (10).
PASS: Verify no crash or core dumps.

Closes-Bug: 2051578
Change-Id: I0f0e3d199586513ddce484fdcc056e1b2562b45f
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2024-02-14 13:01:03 +00:00
Zuul 2fd5ebc6e6 Merge "sm-common: add support for arm64" 2024-01-17 16:02:51 +00:00
Zuul 56b60d15a5 Merge "sm: fix the hardcoded includes for arm64" 2024-01-17 15:52:16 +00:00
Kyale, Eliud 0db57d60be Add service dependancy haproxy dnsmasq
haproxy uses dns resolution
add service dependency to sm database
to ensure that dnsmasq service is started before haproxy
and dnsmasq is disabled after haproxy is disabled

Test plan:

PASS - AIO-SX: iso install
PASS - AIO-SX: reboot test
PASS - AIO-DX: iso install
PASS - AIO-DX: swact test

Closes-Bug: #2043506

Change-Id: I494faebfe67843d34819f66a0a2fbd977657bb6b
Signed-off-by: Kyale, Eliud <Eliud.Kyale@windriver.com>
2024-01-16 09:32:53 -05:00
Jackie Huang 35d8d23563 sm-common: add support for arm64
Add support for aarch64 in sm_trap_thread_log.

Test Plan:
PASS: build-pkgs on x86-64 host
PASS: build-image on x86-64 host
PASS: build-pkgs on arm64 host
PASS: build-image on arm64 host
PASS: Deploy AIO-SX on x86-64 targets and check sm service
PASS: Deploy AIO-SX on arm64 targets and check sm service
PASS: Deploy AIO-DX on arm64 targets and check sm service
PASS: Deploy std (2+2+2) on arm64 targets and check sm service

Story: 2010739
Task: 48017

Change-Id: Iebea29e6df900f63d0dce24cf1a139f60c1cf6f8
Signed-off-by: Jackie Huang <jackie.huang@windriver.com>
2023-11-28 16:26:23 +08:00
Jackie Huang 15a8ffeee0 sm: fix the hardcoded includes for arm64
The includes path in Makefile is hardcoded with x86_64,
use dpkg-architecture to check the host arch and replace
the hardcoded name.

Test Plan:
PASS: build-pkgs on x86-64 host
PASS: build-image on x86-64 host
PASS: build-pkgs on arm64 host
PASS: build-image on arm64 host
PASS: Deploy AIO-SX on x86-64 targets and check sm service
PASS: Deploy AIO-SX on arm64 targets and check sm service
PASS: Deploy AIO-DX on arm64 targets and check sm service
PASS: Deploy std (2+2+2) on arm64 targets and check sm service

Story: 2010739
Task: 48017

Change-Id: Ie22477b7ec7df63377f666186d95201cd16f5809
Signed-off-by: Jackie Huang <jackie.huang@windriver.com>
2023-11-28 16:26:23 +08:00
Bin Qian d91b069daf Avoid potential blocking of heartbeat thread
This is to avoid waiting for hbs cluster query for sending SM alive
pulse. When a hbs cluster query or alive pulse is being sent, do not
queue the subsequent alive pulse, as current request being sent is good
enough to update hbs agent.
Also move the function retrieving sock address to initial from inside
the query sending procedure. The function getaddrinfo to avoid indirectly
calling malloc, which invokes malloc_atfork to potentially a blocking call.

TCs:
   This could improve in extreme situation only, passed regression.

Closes-bug: 2025504

Change-Id: I520b42f0330b670e301279c2e42670d40361adc5
Signed-off-by: Bin Qian <bin.qian@windriver.com>
2023-11-17 21:01:08 +00:00
Zuul 4b800442ed Merge "Use FQDN for MGMT network" 2023-11-02 20:22:20 +00:00
Steven Webster 6793f3840f Use controller-services service group for admin-ip
This commit resolves an issue seen when attempting to upgrade
/ patch from a previous StarlingX release to the current
StarlingX development load in which the distributed cloud
admin network feature is present.

1. Currently, the admin-services SM service group is created
   if the system is detected to be a subcloud.  This presents
   a problem during upgrade, because the <N> system has no
   concept of the admin-services group, while the upgraded
   <N+1> system does.  During upgrade when the system is
   swacted to the <N+1> system, an alarm will be raised as
   the <N> system has no admin-services group (designed to be
   an N+M redundancy model).

2. A potential solution to the above problem is to only provision
   the admin-services group if an operator has actually configured
   an admin interface / network after the full upgrade has completed.
   However, this presents the same sort of problem, as an
   interface-network association is done on a host-by-host basis,
   requiring a lock of each host to provision a new admin interface.
   Since there is no mechanism to provision a new SM service group
   at runtime, this leads to the same situation of one host being
   aware of the admin-services group, while the other does not.
   That is, a user might configure and admin inteface-network on
   host <X>, but the service group would not be present on host <Y>,
   leading to the same alarm.

The solution here is to do away with the new admin-services service
group, and leverage the existing controller-services group for the
admin-ip service.

Test-Plan:

- Ensure no alarms when upgrading / patching a subcloud implementing
  the admin network feature.

- Ensure a system utilizing the admin network can become online /
  in-sync.

- Swact between controllers utilizing the admin-ip SM service to
  ensure the floating IP is correctly assigned to the active
  controller (using the controller-services service group).

Regression:

- Install a DC system using the management network.  Create the
  admin interface, network on the subcloud and ensure a user can
  update the subcloud to use the admin network.

Story: 2010319
Task: 47278

Change-Id: Ic36e83622c6ab5d15fd537be69d3314cb675c724
Signed-off-by: Steven Webster <steven.webster@windriver.com>
2023-10-30 01:52:58 -04:00
Fabiano Correa Mercer c5fb81828a Use FQDN for MGMT network
The management network is used extensively for all internal
communication.
Since the original use of the network was a private network before
it was exposed for external communication in a distributed cloud
configuration, it was never designed to be reconfigured.

To support MGMT network reconfiguration the idea is to configure the
applications to use the hostname/FQDN instead of a static MGMT IP
address.

In this way the MGMT network can be changed and the services and
applications will still work since they are using the hostname/FQDN
and the DNS will be responsible to translate to the current MGMT
IP address.

The use of FQDN will be applied for all installation modes: AIO-SX,
AIO-DX, Standard, AIO-PLUS and DC subclouds. But given the
complexities of supporting the multi-host reconfiguration,
the MGMT network reconfiguration will focus on support for AIO-SX
only.

The DNSMASQ service must start as soon as possible to translate
the FQDN to IP address, for this reason the dnsmasq will start
as soon the management-ip is ready.

Test plan ( Debian only )
 - AIO-SX and AIO-DX virtualbox installation IPv4/IPv6
 - Standard virtualbox installation IPv6
 - DC virtualbox installation IPv4 ( AIO-SX/DX subclouds )
 - AIO-SX and AIO-DX installation IPv4/IPv6
 - AIO-DX plus installation IPv6
 - DC IPv6 and subcloud AIO-SX
 - AIO-DX host-swact
 - DC IPv4 virtualbox with subcloud AIO-DX and AIO-DX
 - AIO-SX to AIO-DX migration
 - netstat -tupl ( no services are using the MGMT IP address )
 - Ran sanity/regression tests

Story: 2010722
Task: 48889
Depends-On: https://review.opendev.org/c/starlingx/config/+/886208

Change-Id: If118132410a5a3db4c3a9d0ba029f4d45521574d
Signed-off-by: Fabiano Correa Mercer <fabiano.correamercer@windriver.com>
2023-10-26 12:01:08 -03:00
Kyale, Eliud efe4a7a370 IF_STATE_MASK fix for SM_FAILOVER_HEARTBEAT_ALIVE
The SM_FAILOVER_IF_STATE_MASK change from 0xF to 0x3F

mask was clearing the HEARTBEAT ALIVE flag.
SM_FAILOVER_HEARTBEAT_ALIVE = (0x1 << 4), // 16

This change restores previous system behavior. Tester performs a
cable pull on the oam ports. The expected behavior is an alarm
being raised. Instead the standby controller ended up getting rebooted.

oam interface testing was simulated by bringing the ip link down for 1
second.

For example:

sudo ip link set <oam> down; sleep 1 ; sudo ip link set <oam> up

-----------------
Before change
-----------------
- Heartbeat loss on oam interface resulted in standby controller reboot

-----------------
After change:
-----------------

- Heartbeat loss on oam interface resulted in alarm raised
- Logs indicate the health score of controller-1 drops by 1 point

Test plan:

PASS - AIO-SX: iso install

PASS - AIO-DX: iso install
               drop oam interface on standby
               verify standby controller-1 is not rebooted
               by active controller-0
               restore oam interface

PASS - AIO-DX: system host-swact . swact back and forth

Closes-Bug: 2037579

Change-Id: I4f1ffc1169d4df090f71377e5aa8247e1cd17fc3
Signed-off-by: Kyale, Eliud <Eliud.Kyale@windriver.com>
2023-10-02 07:45:36 -04:00
Matheus Guilhermino fd75850f12 Deployment Optimizations: SM throttling disable
The SM throttling is a feature that limits the number of parallel
service-enabling processes at a time. The SM throttling mechanism is
always ON, during and after startup. Whenever SM controlled services
transition to enabled status, it takes part of the process.

By default, the SM throttling allows a maximum of 2 parallel service
enabling processes at a time, but the throttling size is configurable,
by means of the field ENABLING_THROTTLE from the CONFIGURATION table in
the SM-DB database. Hence, it is possible to disable the SM throttling
by increasing the throttling size to a reasonable big enough value,
such way to enable full capacity of parallel service enabling.

This commit improves system performance by disabling the SM throttling
for AIO-SX systems, while still keeping the SM throttling mechanism,
should it ever be needed as a fallback, for robustness reasons.

In order to evaluate the SM throttling feature, and its costs in terms
of performance, the throttling size was modified such way to disable the
feature, from value 2 to 1000, and a series of tests were conducted to
evaluate stability and performance benefits.

Test Plan:
  - Fresh Install and bootstrap (PASS)
  - Lock/Unlock (PASS)
  - Restart SM service (PASS)

Story: 2010802
Task: 48312

Change-Id: Ie96115293049e9939bc43feb2ad11432dd318323
Signed-off-by: Matheus Guilhermino <matheus.machadoguilhermino@windriver.com>
2023-09-14 14:31:20 -03:00
Luis Eduardo Bonatti e35510e1cc Fix /var are not updated according to the patch changes
Ostree doesn't manage the /var filesystem. Anything
installed there during initial filesystem setup, becomes
unpatchble. This commit changes the sm-patch.sql deploy path
to a place that ostree handles, /usr/share/sm/patches in this
case and symlinks it to /var/lib/sm/patches/sm-patch.sql.

Test Plan:
PASS: ISO install symlink created
PASS: sm-patch.sql installed to /usr/share/sm/patches
PASS: PATCH apply and changes applied to /var/lib/sm/
patches/sm-patch.sql on stx8

Closes-Bug: 2030890

Change-Id: I07047e5383e8ae9e57687cd1e852c2efc0eb755f
Signed-off-by: Luis Eduardo Bonatti <LuizEduardo.Bonatti@windriver.com>
2023-08-10 14:52:02 +00:00
Steven Webster 4a96509146 Disable admin network failover behaviour
A requirement for a subcloud's admin network is that its
subnet information be able to be updated without host
lock / unlock.

Accordingly, the service domain interface and admin-ip
service in SM must be provisioned / deprovisioned at
runtime.

In an AIO-DX system this can cause issues in certain
circumstances as the disablement / enablement must be
done via puppet and can be affected by the ordering a
user performs each action as well as the timing of the
currently running manifests on each host.

This commit disables the failover behaviour for the admin
network, as link flapping and heartbeat losses are expected
as the service domain interface is provisioned/deprovisioned.

Also in this commit is the disablement of heartbeat messages
on service domain interface de-provision to prevent log
spamming, as well as a couple other minor issues that were
found while testing.

Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/889872

Test plan:

- No uncontrolled swacts while re-configuring admin subnets
  or reverting to the management subnet (deleting the admin
  address pool) dozens of times.

- Alarms still generated on interface down / heartbeat loss

- Switching back and forth between admin network / mgmt
  network via dcmanager.

Story: 2010319
Task: 47707

Change-Id: I761b5b20b6de198ef763b2d3480e6f7cd380f952
Signed-off-by: Steven Webster <steven.webster@windriver.com>
2023-08-01 12:08:56 -04:00
Roger Ferraz ac8f60b120 starlingx/ha README improvement
This story shall update the README file of a few most used StarlingX
repos.

Test Plan: N/A

Story: 2010814
Task: 48377

Change-Id: I34b14275e6b61e1be6d659701be568e208e380e4
Signed-off-by: Roger Ferraz <rogerio.ferraz@encora.com>
2023-07-19 12:28:24 -03:00
Lucas Borges 42fb99a393 Removing sysinv-conductor dependency from rabbitmq
The Sysinv components utilize ZeroMQ for communication
among each other. The service dependency between
sysinv-conductor and rabbit are removed on Service Manager
improving the time it takes to swact, by 20%.

Test Plan:

PASS: Bootstrap AIO-SX, AIO-DX and Standard.
PASS: Lock/unlock/swact in all environments
PASS: Boostrap DC and perform lock/unlock/swact
PASS: Add subcloud in DC environment
PASS: Test DC orchestration (dcmanger dcorch)
PASS: Restart sysinv only
PASS: Restart sysinv, then dcorch and dcmanager
PASS: After restart sysinv, dcorch and dcmanager run manage/unmanage
PASS: After restart sysinv, run subcloud backup
PASS: After all restart of services verify the whole system is working

Refer to review
https://review.opendev.org/c/starlingx/config/+/859571
for sysinv/ZeroMQ change details

Closes-bug: 2022083
Signed-off-by: Lucas Borges <lucas.borges@windriver.com>
Change-Id: I4014c05c914fc946946b14519f28a85067b06b34
2023-06-02 16:37:36 -03:00
Zuul fc1936a70f Merge "Shorten rabbit failure recovery delay" 2023-05-09 18:59:27 +00:00
Bin Qian a85ffc695e Shorten rabbit failure recovery delay
In rare cases, when system running slowly with significant scheduling
delay, rabbit disable action timeout continually. As final resort sm
reboots the impacted controller for recovery after failure count reaches
MAX_TRANSITION_FAILURES. As rabbit service disable timeout is set to 60
seconds, this result a significant delay before reboot for recovery.

This change updates MAX_TRANSITION_FAILURES of rabbit service from
16 to 5 to reduce the delay of recovery of rabbit failure.

TCs passed:
    Install a DX system
    Observed service group recovery escalated to reboot after 5 forced
    rabbit disable failure.

Closes-bug: 2016168
Signed-off-by: Bin Qian <bin.qian@windriver.com>
Change-Id: I660a64f0e78b6564456eb26245b672d2549f9a3b
2023-05-09 03:48:48 +00:00
Davlet Panech e601f7ce3e Fix github mirroring for this repo
Updating the rsa ssh host key based on:
https://github.blog/2023-03-23-we-updated-our-rsa-ssh-host-key/

Note: In the future, StarlingX should have a zuul job and
secret setup for all repos so we do not need to do this
for every repo.

Needed to rename the secret, because zuul fails if like-named
secrets have diffent values in different branches of the same
repo.

Partial-Bug: #2015246
Change-Id: Iedfe334611d14e7e6b5a3b2108501d0b2fdf1e13
Signed-off-by: Davlet Panech <davlet.panech@windriver.com>
2023-04-28 12:38:51 -04:00
Kyale, Eliud 9e2ff82411 Add failover state of peer to heartbeat msg
- add failover state to heartbeat message ( 4 bits )
- add logic to survived_state to use peer's
  failover state to determine whether to exit survived state
  and enter normal state
- throttle peer is normal events with a threshold of 10
  used to ensure the peer is normal and stable
- change fsm->send_event() log to debug from info log level
- a few logging improvements; debug send_event logs
- update copyright year 2023

Test plan:
PASS - AIO-DX: iso install
PASS - AIO-DX: crash the sm as indicated in bug
               and observe swact to standby
PASS - AIO-DX: manual swact
PASS - AIO-DX: power off active controller
PASS - AIO-SX: install and basic sanity check
PASS - AIO-SX: upgrade test to verify sm heartbeat
               messages changes still function when
               controllers are running different loads

Closes-Bug: 2012519

Signed-off-by: Kyale, Eliud <Eliud.Kyale@windriver.com>
Change-Id: I1f86dcb8c9d9dbaf436b9240867f61adc405e88c
2023-04-14 08:14:07 -04:00
Zuul 002f5a79db Merge "Update rule of disable & standby dependency" 2023-04-04 15:43:16 +00:00
Bin Qian c81032a572 Update rule of disable & standby dependency
This change is to update the service disabling and going standby
dependency check.
The 2 specific rules are
1. "service a" has a disable action dependency to "service b", with
   targeted "service b" state of disabled, disable action of
   "service a" is considered as "dependency met" only when "service b"
    is in disabled stated, or enabled-standby state.
2. "service a" has a go-standby action dependency "to service b", with
   targeted "service b" state of disabled, go-standby action of
   "service a" is considered as "dependency met" only when "service b"
   is in disabled stated, or enabled-standby state.

TCs:
   passed: Perform repeatedly host-swact operations, with adding long
           delay in xxx-fs ocf-script in disable action, observed that
           all xxx-fs services are disabled before drbd-xxx services
           start disabling.

Closes-Bug: 2012570

Signed-off-by: Bin Qian <bin.qian@windriver.com>
Change-Id: Ie9717d3b2b73dc7d623e1b980b3387c6c4e6d991
2023-03-27 19:39:09 +00:00
Fabiano Mercer 3d1d82b0a2 Keep platform-nfs-ip for upgrade process
The platform-nfs-ip service is not necessary for fresh installs
because it is just an alias for the controller IP.
But for old releases like StarlingX rel. 6 or 7 the
platform-nfs-ip uses a specific IP, If for some reason an error
occurs during the upgrade process, the upgrade will be aborted
and the nodes will downgrade to the old release again.
At this moment the nodes will try to communicate with the
previous platform-nfs-ip configured in /etc/hosts.
But if the active controller is using the new Release
this IP doesn't exist anymore and the downgrade will fail.
For this reason the platform-nfs-ip service will be available
just for upgrade operations and will be deprovisioned for fresh
installs or at the end of the upgrade process
( upgrade-activate phase ).

Test plan
PASS Fresh install on AIO-SX
     Fresh install on AIO-DX
PASS Upgrade AIO-DX system from CENTOS Rel 7 to DEBIAN Rel 8
PASS Reboot controller-0 during upgrade of AIO-DX
     controller-1 was the active one with the new release ( Rel 8 )
     controller-0 using old release.
     reboot controller-0 and check if it could connect to
     controller-1 using old platform-nfs-ip.
PASS Upgrade-abort during AIO-DX upgrade
     controller-1 was the active controller and already upgraded
     controller-0 was upgraded but locked.
     Abort the upgrade and downgrade to old release ( Rel 7 )

Partial-Bug: #2012387

Signed-off-by: Fabiano Mercer <fabiano.correamercer@windriver.com>
Change-Id: I704e15fffc6e7efa7b1fea56164a21af02222dd6
2023-03-22 14:53:01 -03:00
Kyale, Eliud b65eb7b2f6 Add PTHREAD_PRIO_PROTECT to sm mutexes
- rename mutexes from generic name '_mutex'
- create common util functions for initializing and destroying mutexes
- add mutex initialize/finalize functions
  and run them in sm_main_process_initialize
- update copyright info to 2023

Test plan:

PASS - AIO-DX: iso install
PASS - AIO-DX: verify ha swact
PASS - AIO-DX: failover swact test

PASS - AIO-DX: run pi_stress (rt-tests)
               to confirm priority inheritance POSIX attribute is working
-------------------------------------------------------------------
sysadmin@localhost:~$ sudo pi_stress --uniprocessor --duration=10s
Starting PI Stress Test
Number of thread groups: 3
Duration of test run: 10 seconds
Number of inversions per group: unlimited
     Admin thread SCHED_FIFO priority 4
3 groups of 3 threads will be created
      High thread SCHED_FIFO priority 3
       Med thread SCHED_FIFO priority 2
       Low thread SCHED_FIFO priority 1
Current Inversions: 992034
Stopping test
Total inversion performed: 992038
Test Duration: 0 days, 0 hours, 0 minutes, 11 seconds
-------------------------------------------------------------------

PASS - AIO-DX: valgrind helgrind
                * test for inconsistence lock ordering
                * test race conditions
                  [ detected outside scope of jira ]
                * test for deadlocks

Task: 47503
Story: 2010609

Signed-off-by: Kyale, Eliud <Eliud.Kyale@windriver.com>
Change-Id: Ic77f08ca7c3a687b1cc219ac9cba5711979206e8
2023-03-02 20:49:17 +00:00
Zuul 48ec1759ed Merge "Add admin network support to SM" 2023-02-16 21:02:00 +00:00
Zuul 32927b3139 Merge "Update debian package versions to use git commits" 2023-02-15 16:10:38 +00:00
Al Bailey 4a424d77cc Fix zuul pep8 failures related to bugbear
Feb 13 released a new version of bugbear that raises new
error codes.  We are setting an upper limit for bugbear
to be from before that version was released.

bandit is also reporting a hashlib error.  The bandit
job is non-voting, but the error is now being suppressed.

Story: 2010531
Task: 47313
Signed-off-by: Al Bailey <al.bailey@windriver.com>
Change-Id: Iffb790c67e658c7a40c697e364cb34f4c4f9ec6c
2023-02-14 16:47:42 +00:00
Steven Webster db1eea124d Add admin network support to SM
Add SM support for the DC admin network

This commit adds SM support for the DC admin network.

The admin network is intended to be used between a subcloud
and system controller. Because the (existing) management network
is so embedded in other parts of the StarlingX system, it makes
it prohibitively hard to re-configure this network after initial
installation.  The admin network is intended to be isolated from
the management network, allowing re-configuration of the network
parameters in the case that the physical network between subcloud
and system controller has been changed.

In the case of admin network usage, the management network still
exist but is a private network in the context of a subcloud.

This specific commit provides for admin-ip and admin-interface
services to be added to the SM database and be recognized in
processing similar to the management, cluster-host, oam, etc
networks.

Since there is a requirement for the admin IP subnet information
to be allowed to change at runtime, in-service updating of SM
information relating to the admin-ip service (floating IP), as
well as unicast heartbeating between peers is also added in this
commit.

Testing:

AIO-SX:
    - admin-ip service is enabled when the admin network is
      created.
    - admin-ip service is not enabled when the admin network is
      not created.
    - floating-ip is updated on the admin interface when admin
      addr-pool information is changed.
AIO-DX:
    - admin-ip service is enabled when the admin network is
      created.
    - admin-ip service is not enabled when the admin network
      is not created.
    - floating-ip is updated on the active-controller when the
      admin addr-pool information is changed.
    - When a peer admin interface is down, an alarm is raised.
    - When a peer admin IP is not correct (changed), an alarm
      is raised.
    - Swact between controllers.
    - Inactive controller admin interface goes down
	Result: A 400.005 major communication loss fault is generated
               for the inactive controller entity
    - Inactive controller admin interface comes back up
        Result: The fault is cleared
    - Inactive controller admin IP address is removed/changed
        Result: Two 400.005 major communication loss faults are
                generated for both controller entities
    - Inactive controller admin node IP address is re-applied
        Result: The faults are cleared
    - Active admin interface goes down
        Result: A 400.005 major communication loss fault is generated
                for the inactive controller entity.  A swact is not
                issued.
    - Active admin interface comes back up
        Result: The fault is cleared
    - Active admin node IP address is removed/changed
        Result: Two 400.005 major communication loss faults are
                generated for both controller entities.
                A swact is not issued.
    - Active admin floating IP address is removed/changed
        Result: A 400.001 critical admin-services / admin-ip alarm
                is raised.
                A swact occurs.
                The floating admin IP is applied to the newly active
                controller. Alarms are cleared.
    - After the above test, the newly active controller swacts back
      to the previously active controller.
        Result: No alarms.
                The floating IP is applied to the newly active
		controller.
    - The cable for the management interface on the active controller
      is pulled
        Result: A swact occurs
    - The cable for the OAM interface on the active controller
      is pulled
        Result: A swact occurs
    - The cable for the Admin interface on the active controller
      is pulled
        Result: A swact occurs. 400.005 alarms are raised.
    - The mgmt, cluster-host, oam interfaces are all brought down/up at
      the same time.  The admin interface is also brought down,
      but not brought back up
      back up.
        Result: A swact occurs, with multiple controller-services
	related to the mgmt interface being in degraded state.

Story: 2010319
Task: 47278

Signed-off-by: Steven Webster <steven.webster@windriver.com>
Change-Id: I65df52600f4d5c499dceed32739cab414d36847a
2023-02-14 15:14:28 +00:00
Luis Sampaio 5d4ba83910 Update debian package versions to use git commits
The Debian packaging has been changed to reflect all the
git commits under the directory, and not just the commits
to the metadata folder.

This ensures that any new code submissions under those
directories will increment the versions.

Test Plan:
PASS: build-pkgs -p sm-common
PASS: build-pkgs -p sm-db
PASS: build-pkgs -p sm
PASS: build-pkgs -p sm-api
PASS: build-pkgs -p sm-client
PASS: build-pkgs -p sm-tools
PASS: build-pkgs -p stx-ocf-scripts

Story: 2010550
Task: 47341
Signed-off-by: Luis Sampaio <luis.sampaio@windriver.com>
Change-Id: I54cde0fe252c3bcef669969a1b0675a2df8b3d69
2023-02-10 10:14:48 -08:00
Al Bailey 88b1d3b5c6 Update zuul jobs from python2 to python3
- removed the old and unused devstack job.
 - needed to update pylint to python3
 - needed to enable bindep to install the packages
required for installing mysqlclient

No test plan is provided, because the purpose of this
change is to determine if zuul is broken for this repo,
and then fix it.

Story: 2010531
Task: 47313

Signed-off-by: Al Bailey <al.bailey@windriver.com>
Change-Id: I6655792d692644f2c6320b0ef1d66283e17d7bb3
2023-02-07 20:20:57 +00:00
Alyson Deives Pereira da832e0ad6 Remove disable dependence between ceph-manager and sysinv-conductor
When the system is unstable, using a lot of CPU, it takes
more time for the communication between the components to happen,
such as the communication between mgr-restful-plugin and ceph-mgr.
This communication failure may result on a failed audit by SM,
which then restarts the mgr-restful-plugin.

Change [1] resolved this issue by increasing the timeout and retries
of mgr-restful-plugin in SM database.

However, there is a disable dependence chain between mgr-restful,
ceph-manager, and sysinv-conductor which results on sysinv-conductor
being restarted if mgr-restful-plugin or ceph-manager is also disabled
by SM. This can impact platform-integ-apps apply or any other action
being executed by sysinv-conductor.

The ceph manager -> sysinv-conductor dependence is not necessary
anymore after the changes [2] and [3] were merged, thus this change
removes this dependence.

TEST PLAN:
PASS: AIO-SX: bootstrap, unlock and apply platform-integ-apps
PASS: Force ceph-manager to be restarted by SM, and verify that
      sysinv-conductor keeps running

Related-Bug: 2000080

[1] https://review.opendev.org/c/starlingx/ha/+/868118
[2] https://review.opendev.org/c/starlingx/utilities/+/856320
[3] https://review.opendev.org/c/starlingx/utilities/+/860570

Signed-off-by: Alyson Deives Pereira <alyson.deivespereira@windriver.com>
Change-Id: I949ccebd509b8099870b3dfda252a60b6b423715
2023-01-30 19:22:18 +00:00
Zuul 9ccc492135 Merge "Increase retries and timeouts on "audit-enabled" of mgr-restful-plugin in SM database" 2023-01-25 15:49:33 +00:00
Rafael Falcao d312729809 Remove guest-agent related queries from sm database
The guest-agent service it is currently being activated
in setups where stx-openstack is applied but it's not
being used since we went to containerized openstack.
Since this service is no longer being used we are currently
removing the service and all related queries that are on
the create_sm_db file.

Test Plan:
PASS: Perform a fresh install on a duplex environment and
check that no error log related to guest-agent is appearing
in the sm log file and no 400.02 alarm was raised on
fm alarm-list.

Closes-Bug: 2003117

Signed-off-by: Rafael Falcao <rafael.vieirafalcao@windriver.com>
Change-Id: I145bd8a45c12319facc4d1eff90b785a33a1d2c0
2023-01-17 19:46:49 -03:00
Erickson Silva de Oliveira b4fb57c610 Increase retries and timeouts on "audit-enabled" of
mgr-restful-plugin in SM database

When the system is unstable, using a lot of CPU, it takes
more time for the communication between the components to happen.

So it's necessary to increase the maximum of retries and timeouts
in the "audit-enable" of the mgr-restful-plugin to prevent errors
from happening.

Test Plan:
PASS: mgr-restful-plugin restarted by SM (AIO-SX)

Closes-bug: 2000080

Signed-off-by: Erickson Silva de Oliveira <Erickson.SilvadeOliveira@windriver.com>
Change-Id: I0f8462fef20196a3bb913fa7d374a86a2c6565f1
2023-01-05 18:17:54 +00:00
Zuul cba72cdef0 Merge "Set max open file limit for sm service" 2023-01-05 16:49:13 +00:00
Alyson Deives Pereira 25ddaa31f9 Set max open file limit for sm service
The sm process initializes with the default limit
for the maximum number of open files. If somehow this default value is
set to a high value, performance degradation can occur. For instance,
the sm_service_action_run method from sm_service_action.c contains a
for loop that closes file descriptors up to this limit.

To avoid performance degradation, this change sets the limit for
the maximum open files to 1024 using the systemd property LimitNOFILE.

TEST PLAN:
PASS: Confirm that 1024 is the max open file limit in
  /proc/<sm_pid>/limits.

Story: 2010087
Task: 47102

Signed-off-by: Alyson Deives Pereira <alyson.deivespereira@windriver.com>
Change-Id: Iff848da7461a8b644057e9f58c69fa2d78499226
2023-01-03 17:51:01 -03:00
Al Bailey 3608cdb14e Update tox.ini to work with tox 4
This change will allow this repo to pass zuul now
that this has merged:
https://review.opendev.org/c/zuul/zuul-jobs/+/866943

Tox 4 deprecated whitelist_externals.
Replace whitelist_externals with allowlist_externals

Partial-Bug: #2000399

Signed-off-by: Al Bailey <al.bailey@windriver.com>
Change-Id: I04352e372c8ea2008f7ab0e7cdfa1d8b975e4251
2022-12-26 22:09:27 +00:00
Zuul cf4015918c Merge "Update SM lsb script for quick start" 2022-12-05 20:00:06 +00:00
Bin Qian 88aeba251b Update SM lsb script for quick start
pidof command returns subprocess id when SM main process terminates.
This result a false postive that SM is already running so the start
action is skipped.

Make changes to the SM lsb script to distingrish if a subprocess ID
is returned, and attempt to kill it to speed up recovery of SM.

Revert the change to extend startuptime to 15 seconds back to 5.

Test Cases:
    kill SM process, observe SM process starts immediately after the
    subprocess is killed. SM is recovered within 2 seconds.
    (calculated by last and first logging of SM)

Change-Id: Ida834e7dd31a493ee6193b4d8ee73ebd97513de2
Closes-Bug: 1998349
Signed-off-by: Bin Qian <bin.qian@windriver.com>
2022-12-05 16:10:55 +00:00
Zuul 8cd67d8663 Merge "Increase SM rabbit enable timeout" 2022-12-01 22:30:08 +00:00
Victor Romano b3c3c5a71e Increase SM rabbit enable timeout
On DC with central cloud as standard, sometimes occurs
that RabbitMQ doesn't start properly and keeps restarting
by sm. Part of the solution is increasing the startup time from
60 to 90, so the sm doesn't restart the service before
the service is being initialized. This, alongside [1],
fixes the problem.

Test Plan:
PASS - Run sm-restart service rabbit successfully check rabbit was
  running as expected. This test was used to recreate the bug.
PASS - Reboot the host successfully and check rabbit was running as
  expected.
PASS - Lock/unlock and check if rabbit was running as expected.

Regression test:
PASS - Install and bootstrap DC
PASS - Install and bootstrap DX

[1]
https://review.opendev.org/c/starlingx/config-files/+/865693

Closes-bug: 1997966

Signed-off-by: Victor Romano <victor.gluzromano@windriver.com>
Change-Id: I1425324c72ae41c66558709dead6f1ff92a3230a
2022-12-01 13:45:46 +00:00
Bin Qian 8b5ee400b5 Update pmon SM wait time to 15 seconds
SM's pmon.conf file has the startuptime = 5 seconds but in Debian
it's frequently taking longer than that for the SM process to
start and produce its PID file.
In order to make SM recovery process smooth, increase this timeout
to 15 seconds.

Closes-bug: 1998349

Signed-off-by: Bin Qian <bin.qian@windriver.com>
Change-Id: I64476e394e346c9b8cf5b5aca2ad04ba463b9728
2022-11-30 15:39:20 +00:00