Commit Graph

252 Commits

Author SHA1 Message Date
Eric MacDonald 91fa44188c Add node locked gate to SM enable for DX systems
Service Management (SM) sometimes selects and activates services on a
locked controller following a dead office recovery.

This update adds a node locked check to SM's enable handler to
block enable if present much like the existing goenabled check
blocks enable if not present in the same function.

The enable gate file is /etc/mtc/tmp/.node_locked on the local host.

Maintenance manages the presence or absence of this file based on
the node's administrative state.

This update also cleans up some extra whitespace in the changed file.

Test Plan:

PASS: Verify system build.
PASS: Verify AIO SX install.
PASS: Verify AIO DX install.
PASS: Verify Standard DX system install with worker and storage.

For Both 'AIO DX' and 'Standard DX with worker and storage':

PASS: Verify SM does not activate on a locked DX controller.
PASS: ... DOR case
PASS: ... Uncontrolled Swact case
PASS: Verify Standard DX behavior over DOR with one locked controller
      while the only unlocked controller does not recover.
PASS: Verify behavior after above test case once the only unlocked
      controller does recover.
PASS: Verify lock of the standby controller and its sm logs
PASS: Verify manually creating the new Nv locked file on the active
      controller will cause SM to go disabled and shut down all
      services on that controller.
      ... If there is another unlocked controller then verify it
          takes over as an uncontrolled swact.
      ... If there is no unlocked standby controller then verify SM
          remains shutdown until the manually created Nv node locked
          file is removed. At which point SM proceeds to activate
          services on that controller again.

PASS: Verify SM ignores the node locked flag file for AIO SX systems.
PASS: Verify lock/unlock of AIO SX controller.
PASS: Verify original reported issue is resolved for AIO DX systems.

Regression:

PASS: Verify controlled swact with unlocked enabled standby.
PASS: Verify uncontrolled swact with unlocked enabled standby.
PASS: Verify standby controller lock/unlock soak loop (10).
PASS: Verify swact loop soak (10).
PASS: Verify no crash or core dumps.
PASS: Verify SM logging

Closes-Bug: 2051578
Change-Id: If8e27ef30d62096fa77c3868f4d460b18e10ade2
(cherry picked from commit 23d0d8ab2f)
2024-02-26 22:15:03 +00:00
Zuul 338161f443 Merge "Revert "Add node locked gate to SM enable"" 2024-02-23 14:26:53 +00:00
Eric MacDonald 1e62ab86f1 Revert "Add node locked gate to SM enable"
This reverts commit 23d0d8ab2f.

Reason for revert: Breaks AIO SX Enable

Change-Id: I662b8732e723f4ce5b748ef00a184ae5b8db523c
2024-02-23 14:06:10 +00:00
Zuul 031c2e223d Merge "Add node locked gate to SM enable" 2024-02-16 16:11:22 +00:00
Zuul 9367d45672 Merge "Avoid potential blocking of heartbeat thread" 2024-02-14 21:35:21 +00:00
Eric MacDonald 23d0d8ab2f Add node locked gate to SM enable
Service Management (SM) sometimes selects and activates services on a
locked controller following a dead office recovery.

This update adds a node locked check to SM's enable handler to
block enable if present much like the existing goenabled check
blocks enable if not present in the same function.

The enable gate file is /etc/mtc/tmp/.node_locked on the local host.

Maintenance manages the presence or absence of this file based on
the node's administrative state.

This update also cleans up some extra whitespace in the changed file.

Test Plan:

PASS: Verify system build.
PASS: Verify AIO DX install.
PASS: Verify Standard DX system install with worker and storage.

For Both 'AIO DX' and 'Standard DX with worker and storage':

PASS: Verify SM does not activate on a locked controller.
PASS: ... DOR case
PASS: ... Uncontrolled Swact case
PASS: Verify Standard DX behavior over DOR with one locked controller
      while the only unlocked controller does not recover.
PASS: Verify behavior after above test case once the only unlocked
      controller does recover.
PASS: Verify lock of the standby controller and its sm logs
PASS: Verify manually creating the new Nv locked file on the active
      controller will cause SM to go disabled and shut down all
      services on that controller.
      ... If there is another unlocked controller then verify it
          takes over as an uncontrolled swact.
      ... If there is no unlocked standby controller then verify SM
          remains shutdown until the manually created Nv node locked
          file is removed. At which point SM proceeds to activate
          services on that controller again.

Regression:

PASS: Verify controlled swact with unlocked enabled standby.
PASS: Verify uncontrolled swact with unlocked enabled standby.
PASS: Verify standby controller lock/unlock soak loop (10).
PASS: Verify swact loop soak (10).
PASS: Verify no crash or core dumps.

Closes-Bug: 2051578
Change-Id: I0f0e3d199586513ddce484fdcc056e1b2562b45f
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2024-02-14 13:01:03 +00:00
Zuul 2fd5ebc6e6 Merge "sm-common: add support for arm64" 2024-01-17 16:02:51 +00:00
Zuul 56b60d15a5 Merge "sm: fix the hardcoded includes for arm64" 2024-01-17 15:52:16 +00:00
Kyale, Eliud 0db57d60be Add service dependancy haproxy dnsmasq
haproxy uses dns resolution
add service dependency to sm database
to ensure that dnsmasq service is started before haproxy
and dnsmasq is disabled after haproxy is disabled

Test plan:

PASS - AIO-SX: iso install
PASS - AIO-SX: reboot test
PASS - AIO-DX: iso install
PASS - AIO-DX: swact test

Closes-Bug: #2043506

Change-Id: I494faebfe67843d34819f66a0a2fbd977657bb6b
Signed-off-by: Kyale, Eliud <Eliud.Kyale@windriver.com>
2024-01-16 09:32:53 -05:00
Jackie Huang 35d8d23563 sm-common: add support for arm64
Add support for aarch64 in sm_trap_thread_log.

Test Plan:
PASS: build-pkgs on x86-64 host
PASS: build-image on x86-64 host
PASS: build-pkgs on arm64 host
PASS: build-image on arm64 host
PASS: Deploy AIO-SX on x86-64 targets and check sm service
PASS: Deploy AIO-SX on arm64 targets and check sm service
PASS: Deploy AIO-DX on arm64 targets and check sm service
PASS: Deploy std (2+2+2) on arm64 targets and check sm service

Story: 2010739
Task: 48017

Change-Id: Iebea29e6df900f63d0dce24cf1a139f60c1cf6f8
Signed-off-by: Jackie Huang <jackie.huang@windriver.com>
2023-11-28 16:26:23 +08:00
Jackie Huang 15a8ffeee0 sm: fix the hardcoded includes for arm64
The includes path in Makefile is hardcoded with x86_64,
use dpkg-architecture to check the host arch and replace
the hardcoded name.

Test Plan:
PASS: build-pkgs on x86-64 host
PASS: build-image on x86-64 host
PASS: build-pkgs on arm64 host
PASS: build-image on arm64 host
PASS: Deploy AIO-SX on x86-64 targets and check sm service
PASS: Deploy AIO-SX on arm64 targets and check sm service
PASS: Deploy AIO-DX on arm64 targets and check sm service
PASS: Deploy std (2+2+2) on arm64 targets and check sm service

Story: 2010739
Task: 48017

Change-Id: Ie22477b7ec7df63377f666186d95201cd16f5809
Signed-off-by: Jackie Huang <jackie.huang@windriver.com>
2023-11-28 16:26:23 +08:00
Bin Qian d91b069daf Avoid potential blocking of heartbeat thread
This is to avoid waiting for hbs cluster query for sending SM alive
pulse. When a hbs cluster query or alive pulse is being sent, do not
queue the subsequent alive pulse, as current request being sent is good
enough to update hbs agent.
Also move the function retrieving sock address to initial from inside
the query sending procedure. The function getaddrinfo to avoid indirectly
calling malloc, which invokes malloc_atfork to potentially a blocking call.

TCs:
   This could improve in extreme situation only, passed regression.

Closes-bug: 2025504

Change-Id: I520b42f0330b670e301279c2e42670d40361adc5
Signed-off-by: Bin Qian <bin.qian@windriver.com>
2023-11-17 21:01:08 +00:00
Zuul 4b800442ed Merge "Use FQDN for MGMT network" 2023-11-02 20:22:20 +00:00
Steven Webster 6793f3840f Use controller-services service group for admin-ip
This commit resolves an issue seen when attempting to upgrade
/ patch from a previous StarlingX release to the current
StarlingX development load in which the distributed cloud
admin network feature is present.

1. Currently, the admin-services SM service group is created
   if the system is detected to be a subcloud.  This presents
   a problem during upgrade, because the <N> system has no
   concept of the admin-services group, while the upgraded
   <N+1> system does.  During upgrade when the system is
   swacted to the <N+1> system, an alarm will be raised as
   the <N> system has no admin-services group (designed to be
   an N+M redundancy model).

2. A potential solution to the above problem is to only provision
   the admin-services group if an operator has actually configured
   an admin interface / network after the full upgrade has completed.
   However, this presents the same sort of problem, as an
   interface-network association is done on a host-by-host basis,
   requiring a lock of each host to provision a new admin interface.
   Since there is no mechanism to provision a new SM service group
   at runtime, this leads to the same situation of one host being
   aware of the admin-services group, while the other does not.
   That is, a user might configure and admin inteface-network on
   host <X>, but the service group would not be present on host <Y>,
   leading to the same alarm.

The solution here is to do away with the new admin-services service
group, and leverage the existing controller-services group for the
admin-ip service.

Test-Plan:

- Ensure no alarms when upgrading / patching a subcloud implementing
  the admin network feature.

- Ensure a system utilizing the admin network can become online /
  in-sync.

- Swact between controllers utilizing the admin-ip SM service to
  ensure the floating IP is correctly assigned to the active
  controller (using the controller-services service group).

Regression:

- Install a DC system using the management network.  Create the
  admin interface, network on the subcloud and ensure a user can
  update the subcloud to use the admin network.

Story: 2010319
Task: 47278

Change-Id: Ic36e83622c6ab5d15fd537be69d3314cb675c724
Signed-off-by: Steven Webster <steven.webster@windriver.com>
2023-10-30 01:52:58 -04:00
Fabiano Correa Mercer c5fb81828a Use FQDN for MGMT network
The management network is used extensively for all internal
communication.
Since the original use of the network was a private network before
it was exposed for external communication in a distributed cloud
configuration, it was never designed to be reconfigured.

To support MGMT network reconfiguration the idea is to configure the
applications to use the hostname/FQDN instead of a static MGMT IP
address.

In this way the MGMT network can be changed and the services and
applications will still work since they are using the hostname/FQDN
and the DNS will be responsible to translate to the current MGMT
IP address.

The use of FQDN will be applied for all installation modes: AIO-SX,
AIO-DX, Standard, AIO-PLUS and DC subclouds. But given the
complexities of supporting the multi-host reconfiguration,
the MGMT network reconfiguration will focus on support for AIO-SX
only.

The DNSMASQ service must start as soon as possible to translate
the FQDN to IP address, for this reason the dnsmasq will start
as soon the management-ip is ready.

Test plan ( Debian only )
 - AIO-SX and AIO-DX virtualbox installation IPv4/IPv6
 - Standard virtualbox installation IPv6
 - DC virtualbox installation IPv4 ( AIO-SX/DX subclouds )
 - AIO-SX and AIO-DX installation IPv4/IPv6
 - AIO-DX plus installation IPv6
 - DC IPv6 and subcloud AIO-SX
 - AIO-DX host-swact
 - DC IPv4 virtualbox with subcloud AIO-DX and AIO-DX
 - AIO-SX to AIO-DX migration
 - netstat -tupl ( no services are using the MGMT IP address )
 - Ran sanity/regression tests

Story: 2010722
Task: 48889
Depends-On: https://review.opendev.org/c/starlingx/config/+/886208

Change-Id: If118132410a5a3db4c3a9d0ba029f4d45521574d
Signed-off-by: Fabiano Correa Mercer <fabiano.correamercer@windriver.com>
2023-10-26 12:01:08 -03:00
Kyale, Eliud efe4a7a370 IF_STATE_MASK fix for SM_FAILOVER_HEARTBEAT_ALIVE
The SM_FAILOVER_IF_STATE_MASK change from 0xF to 0x3F

mask was clearing the HEARTBEAT ALIVE flag.
SM_FAILOVER_HEARTBEAT_ALIVE = (0x1 << 4), // 16

This change restores previous system behavior. Tester performs a
cable pull on the oam ports. The expected behavior is an alarm
being raised. Instead the standby controller ended up getting rebooted.

oam interface testing was simulated by bringing the ip link down for 1
second.

For example:

sudo ip link set <oam> down; sleep 1 ; sudo ip link set <oam> up

-----------------
Before change
-----------------
- Heartbeat loss on oam interface resulted in standby controller reboot

-----------------
After change:
-----------------

- Heartbeat loss on oam interface resulted in alarm raised
- Logs indicate the health score of controller-1 drops by 1 point

Test plan:

PASS - AIO-SX: iso install

PASS - AIO-DX: iso install
               drop oam interface on standby
               verify standby controller-1 is not rebooted
               by active controller-0
               restore oam interface

PASS - AIO-DX: system host-swact . swact back and forth

Closes-Bug: 2037579

Change-Id: I4f1ffc1169d4df090f71377e5aa8247e1cd17fc3
Signed-off-by: Kyale, Eliud <Eliud.Kyale@windriver.com>
2023-10-02 07:45:36 -04:00
Matheus Guilhermino fd75850f12 Deployment Optimizations: SM throttling disable
The SM throttling is a feature that limits the number of parallel
service-enabling processes at a time. The SM throttling mechanism is
always ON, during and after startup. Whenever SM controlled services
transition to enabled status, it takes part of the process.

By default, the SM throttling allows a maximum of 2 parallel service
enabling processes at a time, but the throttling size is configurable,
by means of the field ENABLING_THROTTLE from the CONFIGURATION table in
the SM-DB database. Hence, it is possible to disable the SM throttling
by increasing the throttling size to a reasonable big enough value,
such way to enable full capacity of parallel service enabling.

This commit improves system performance by disabling the SM throttling
for AIO-SX systems, while still keeping the SM throttling mechanism,
should it ever be needed as a fallback, for robustness reasons.

In order to evaluate the SM throttling feature, and its costs in terms
of performance, the throttling size was modified such way to disable the
feature, from value 2 to 1000, and a series of tests were conducted to
evaluate stability and performance benefits.

Test Plan:
  - Fresh Install and bootstrap (PASS)
  - Lock/Unlock (PASS)
  - Restart SM service (PASS)

Story: 2010802
Task: 48312

Change-Id: Ie96115293049e9939bc43feb2ad11432dd318323
Signed-off-by: Matheus Guilhermino <matheus.machadoguilhermino@windriver.com>
2023-09-14 14:31:20 -03:00
Luis Eduardo Bonatti e35510e1cc Fix /var are not updated according to the patch changes
Ostree doesn't manage the /var filesystem. Anything
installed there during initial filesystem setup, becomes
unpatchble. This commit changes the sm-patch.sql deploy path
to a place that ostree handles, /usr/share/sm/patches in this
case and symlinks it to /var/lib/sm/patches/sm-patch.sql.

Test Plan:
PASS: ISO install symlink created
PASS: sm-patch.sql installed to /usr/share/sm/patches
PASS: PATCH apply and changes applied to /var/lib/sm/
patches/sm-patch.sql on stx8

Closes-Bug: 2030890

Change-Id: I07047e5383e8ae9e57687cd1e852c2efc0eb755f
Signed-off-by: Luis Eduardo Bonatti <LuizEduardo.Bonatti@windriver.com>
2023-08-10 14:52:02 +00:00
Steven Webster 4a96509146 Disable admin network failover behaviour
A requirement for a subcloud's admin network is that its
subnet information be able to be updated without host
lock / unlock.

Accordingly, the service domain interface and admin-ip
service in SM must be provisioned / deprovisioned at
runtime.

In an AIO-DX system this can cause issues in certain
circumstances as the disablement / enablement must be
done via puppet and can be affected by the ordering a
user performs each action as well as the timing of the
currently running manifests on each host.

This commit disables the failover behaviour for the admin
network, as link flapping and heartbeat losses are expected
as the service domain interface is provisioned/deprovisioned.

Also in this commit is the disablement of heartbeat messages
on service domain interface de-provision to prevent log
spamming, as well as a couple other minor issues that were
found while testing.

Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/889872

Test plan:

- No uncontrolled swacts while re-configuring admin subnets
  or reverting to the management subnet (deleting the admin
  address pool) dozens of times.

- Alarms still generated on interface down / heartbeat loss

- Switching back and forth between admin network / mgmt
  network via dcmanager.

Story: 2010319
Task: 47707

Change-Id: I761b5b20b6de198ef763b2d3480e6f7cd380f952
Signed-off-by: Steven Webster <steven.webster@windriver.com>
2023-08-01 12:08:56 -04:00
Lucas Borges 42fb99a393 Removing sysinv-conductor dependency from rabbitmq
The Sysinv components utilize ZeroMQ for communication
among each other. The service dependency between
sysinv-conductor and rabbit are removed on Service Manager
improving the time it takes to swact, by 20%.

Test Plan:

PASS: Bootstrap AIO-SX, AIO-DX and Standard.
PASS: Lock/unlock/swact in all environments
PASS: Boostrap DC and perform lock/unlock/swact
PASS: Add subcloud in DC environment
PASS: Test DC orchestration (dcmanger dcorch)
PASS: Restart sysinv only
PASS: Restart sysinv, then dcorch and dcmanager
PASS: After restart sysinv, dcorch and dcmanager run manage/unmanage
PASS: After restart sysinv, run subcloud backup
PASS: After all restart of services verify the whole system is working

Refer to review
https://review.opendev.org/c/starlingx/config/+/859571
for sysinv/ZeroMQ change details

Closes-bug: 2022083
Signed-off-by: Lucas Borges <lucas.borges@windriver.com>
Change-Id: I4014c05c914fc946946b14519f28a85067b06b34
2023-06-02 16:37:36 -03:00
Zuul fc1936a70f Merge "Shorten rabbit failure recovery delay" 2023-05-09 18:59:27 +00:00
Bin Qian a85ffc695e Shorten rabbit failure recovery delay
In rare cases, when system running slowly with significant scheduling
delay, rabbit disable action timeout continually. As final resort sm
reboots the impacted controller for recovery after failure count reaches
MAX_TRANSITION_FAILURES. As rabbit service disable timeout is set to 60
seconds, this result a significant delay before reboot for recovery.

This change updates MAX_TRANSITION_FAILURES of rabbit service from
16 to 5 to reduce the delay of recovery of rabbit failure.

TCs passed:
    Install a DX system
    Observed service group recovery escalated to reboot after 5 forced
    rabbit disable failure.

Closes-bug: 2016168
Signed-off-by: Bin Qian <bin.qian@windriver.com>
Change-Id: I660a64f0e78b6564456eb26245b672d2549f9a3b
2023-05-09 03:48:48 +00:00
Kyale, Eliud 9e2ff82411 Add failover state of peer to heartbeat msg
- add failover state to heartbeat message ( 4 bits )
- add logic to survived_state to use peer's
  failover state to determine whether to exit survived state
  and enter normal state
- throttle peer is normal events with a threshold of 10
  used to ensure the peer is normal and stable
- change fsm->send_event() log to debug from info log level
- a few logging improvements; debug send_event logs
- update copyright year 2023

Test plan:
PASS - AIO-DX: iso install
PASS - AIO-DX: crash the sm as indicated in bug
               and observe swact to standby
PASS - AIO-DX: manual swact
PASS - AIO-DX: power off active controller
PASS - AIO-SX: install and basic sanity check
PASS - AIO-SX: upgrade test to verify sm heartbeat
               messages changes still function when
               controllers are running different loads

Closes-Bug: 2012519

Signed-off-by: Kyale, Eliud <Eliud.Kyale@windriver.com>
Change-Id: I1f86dcb8c9d9dbaf436b9240867f61adc405e88c
2023-04-14 08:14:07 -04:00
Zuul 002f5a79db Merge "Update rule of disable & standby dependency" 2023-04-04 15:43:16 +00:00
Bin Qian c81032a572 Update rule of disable & standby dependency
This change is to update the service disabling and going standby
dependency check.
The 2 specific rules are
1. "service a" has a disable action dependency to "service b", with
   targeted "service b" state of disabled, disable action of
   "service a" is considered as "dependency met" only when "service b"
    is in disabled stated, or enabled-standby state.
2. "service a" has a go-standby action dependency "to service b", with
   targeted "service b" state of disabled, go-standby action of
   "service a" is considered as "dependency met" only when "service b"
   is in disabled stated, or enabled-standby state.

TCs:
   passed: Perform repeatedly host-swact operations, with adding long
           delay in xxx-fs ocf-script in disable action, observed that
           all xxx-fs services are disabled before drbd-xxx services
           start disabling.

Closes-Bug: 2012570

Signed-off-by: Bin Qian <bin.qian@windriver.com>
Change-Id: Ie9717d3b2b73dc7d623e1b980b3387c6c4e6d991
2023-03-27 19:39:09 +00:00
Fabiano Mercer 3d1d82b0a2 Keep platform-nfs-ip for upgrade process
The platform-nfs-ip service is not necessary for fresh installs
because it is just an alias for the controller IP.
But for old releases like StarlingX rel. 6 or 7 the
platform-nfs-ip uses a specific IP, If for some reason an error
occurs during the upgrade process, the upgrade will be aborted
and the nodes will downgrade to the old release again.
At this moment the nodes will try to communicate with the
previous platform-nfs-ip configured in /etc/hosts.
But if the active controller is using the new Release
this IP doesn't exist anymore and the downgrade will fail.
For this reason the platform-nfs-ip service will be available
just for upgrade operations and will be deprovisioned for fresh
installs or at the end of the upgrade process
( upgrade-activate phase ).

Test plan
PASS Fresh install on AIO-SX
     Fresh install on AIO-DX
PASS Upgrade AIO-DX system from CENTOS Rel 7 to DEBIAN Rel 8
PASS Reboot controller-0 during upgrade of AIO-DX
     controller-1 was the active one with the new release ( Rel 8 )
     controller-0 using old release.
     reboot controller-0 and check if it could connect to
     controller-1 using old platform-nfs-ip.
PASS Upgrade-abort during AIO-DX upgrade
     controller-1 was the active controller and already upgraded
     controller-0 was upgraded but locked.
     Abort the upgrade and downgrade to old release ( Rel 7 )

Partial-Bug: #2012387

Signed-off-by: Fabiano Mercer <fabiano.correamercer@windriver.com>
Change-Id: I704e15fffc6e7efa7b1fea56164a21af02222dd6
2023-03-22 14:53:01 -03:00
Kyale, Eliud b65eb7b2f6 Add PTHREAD_PRIO_PROTECT to sm mutexes
- rename mutexes from generic name '_mutex'
- create common util functions for initializing and destroying mutexes
- add mutex initialize/finalize functions
  and run them in sm_main_process_initialize
- update copyright info to 2023

Test plan:

PASS - AIO-DX: iso install
PASS - AIO-DX: verify ha swact
PASS - AIO-DX: failover swact test

PASS - AIO-DX: run pi_stress (rt-tests)
               to confirm priority inheritance POSIX attribute is working
-------------------------------------------------------------------
sysadmin@localhost:~$ sudo pi_stress --uniprocessor --duration=10s
Starting PI Stress Test
Number of thread groups: 3
Duration of test run: 10 seconds
Number of inversions per group: unlimited
     Admin thread SCHED_FIFO priority 4
3 groups of 3 threads will be created
      High thread SCHED_FIFO priority 3
       Med thread SCHED_FIFO priority 2
       Low thread SCHED_FIFO priority 1
Current Inversions: 992034
Stopping test
Total inversion performed: 992038
Test Duration: 0 days, 0 hours, 0 minutes, 11 seconds
-------------------------------------------------------------------

PASS - AIO-DX: valgrind helgrind
                * test for inconsistence lock ordering
                * test race conditions
                  [ detected outside scope of jira ]
                * test for deadlocks

Task: 47503
Story: 2010609

Signed-off-by: Kyale, Eliud <Eliud.Kyale@windriver.com>
Change-Id: Ic77f08ca7c3a687b1cc219ac9cba5711979206e8
2023-03-02 20:49:17 +00:00
Zuul 48ec1759ed Merge "Add admin network support to SM" 2023-02-16 21:02:00 +00:00
Steven Webster db1eea124d Add admin network support to SM
Add SM support for the DC admin network

This commit adds SM support for the DC admin network.

The admin network is intended to be used between a subcloud
and system controller. Because the (existing) management network
is so embedded in other parts of the StarlingX system, it makes
it prohibitively hard to re-configure this network after initial
installation.  The admin network is intended to be isolated from
the management network, allowing re-configuration of the network
parameters in the case that the physical network between subcloud
and system controller has been changed.

In the case of admin network usage, the management network still
exist but is a private network in the context of a subcloud.

This specific commit provides for admin-ip and admin-interface
services to be added to the SM database and be recognized in
processing similar to the management, cluster-host, oam, etc
networks.

Since there is a requirement for the admin IP subnet information
to be allowed to change at runtime, in-service updating of SM
information relating to the admin-ip service (floating IP), as
well as unicast heartbeating between peers is also added in this
commit.

Testing:

AIO-SX:
    - admin-ip service is enabled when the admin network is
      created.
    - admin-ip service is not enabled when the admin network is
      not created.
    - floating-ip is updated on the admin interface when admin
      addr-pool information is changed.
AIO-DX:
    - admin-ip service is enabled when the admin network is
      created.
    - admin-ip service is not enabled when the admin network
      is not created.
    - floating-ip is updated on the active-controller when the
      admin addr-pool information is changed.
    - When a peer admin interface is down, an alarm is raised.
    - When a peer admin IP is not correct (changed), an alarm
      is raised.
    - Swact between controllers.
    - Inactive controller admin interface goes down
	Result: A 400.005 major communication loss fault is generated
               for the inactive controller entity
    - Inactive controller admin interface comes back up
        Result: The fault is cleared
    - Inactive controller admin IP address is removed/changed
        Result: Two 400.005 major communication loss faults are
                generated for both controller entities
    - Inactive controller admin node IP address is re-applied
        Result: The faults are cleared
    - Active admin interface goes down
        Result: A 400.005 major communication loss fault is generated
                for the inactive controller entity.  A swact is not
                issued.
    - Active admin interface comes back up
        Result: The fault is cleared
    - Active admin node IP address is removed/changed
        Result: Two 400.005 major communication loss faults are
                generated for both controller entities.
                A swact is not issued.
    - Active admin floating IP address is removed/changed
        Result: A 400.001 critical admin-services / admin-ip alarm
                is raised.
                A swact occurs.
                The floating admin IP is applied to the newly active
                controller. Alarms are cleared.
    - After the above test, the newly active controller swacts back
      to the previously active controller.
        Result: No alarms.
                The floating IP is applied to the newly active
		controller.
    - The cable for the management interface on the active controller
      is pulled
        Result: A swact occurs
    - The cable for the OAM interface on the active controller
      is pulled
        Result: A swact occurs
    - The cable for the Admin interface on the active controller
      is pulled
        Result: A swact occurs. 400.005 alarms are raised.
    - The mgmt, cluster-host, oam interfaces are all brought down/up at
      the same time.  The admin interface is also brought down,
      but not brought back up
      back up.
        Result: A swact occurs, with multiple controller-services
	related to the mgmt interface being in degraded state.

Story: 2010319
Task: 47278

Signed-off-by: Steven Webster <steven.webster@windriver.com>
Change-Id: I65df52600f4d5c499dceed32739cab414d36847a
2023-02-14 15:14:28 +00:00
Luis Sampaio 5d4ba83910 Update debian package versions to use git commits
The Debian packaging has been changed to reflect all the
git commits under the directory, and not just the commits
to the metadata folder.

This ensures that any new code submissions under those
directories will increment the versions.

Test Plan:
PASS: build-pkgs -p sm-common
PASS: build-pkgs -p sm-db
PASS: build-pkgs -p sm
PASS: build-pkgs -p sm-api
PASS: build-pkgs -p sm-client
PASS: build-pkgs -p sm-tools
PASS: build-pkgs -p stx-ocf-scripts

Story: 2010550
Task: 47341
Signed-off-by: Luis Sampaio <luis.sampaio@windriver.com>
Change-Id: I54cde0fe252c3bcef669969a1b0675a2df8b3d69
2023-02-10 10:14:48 -08:00
Alyson Deives Pereira da832e0ad6 Remove disable dependence between ceph-manager and sysinv-conductor
When the system is unstable, using a lot of CPU, it takes
more time for the communication between the components to happen,
such as the communication between mgr-restful-plugin and ceph-mgr.
This communication failure may result on a failed audit by SM,
which then restarts the mgr-restful-plugin.

Change [1] resolved this issue by increasing the timeout and retries
of mgr-restful-plugin in SM database.

However, there is a disable dependence chain between mgr-restful,
ceph-manager, and sysinv-conductor which results on sysinv-conductor
being restarted if mgr-restful-plugin or ceph-manager is also disabled
by SM. This can impact platform-integ-apps apply or any other action
being executed by sysinv-conductor.

The ceph manager -> sysinv-conductor dependence is not necessary
anymore after the changes [2] and [3] were merged, thus this change
removes this dependence.

TEST PLAN:
PASS: AIO-SX: bootstrap, unlock and apply platform-integ-apps
PASS: Force ceph-manager to be restarted by SM, and verify that
      sysinv-conductor keeps running

Related-Bug: 2000080

[1] https://review.opendev.org/c/starlingx/ha/+/868118
[2] https://review.opendev.org/c/starlingx/utilities/+/856320
[3] https://review.opendev.org/c/starlingx/utilities/+/860570

Signed-off-by: Alyson Deives Pereira <alyson.deivespereira@windriver.com>
Change-Id: I949ccebd509b8099870b3dfda252a60b6b423715
2023-01-30 19:22:18 +00:00
Zuul 9ccc492135 Merge "Increase retries and timeouts on "audit-enabled" of mgr-restful-plugin in SM database" 2023-01-25 15:49:33 +00:00
Rafael Falcao d312729809 Remove guest-agent related queries from sm database
The guest-agent service it is currently being activated
in setups where stx-openstack is applied but it's not
being used since we went to containerized openstack.
Since this service is no longer being used we are currently
removing the service and all related queries that are on
the create_sm_db file.

Test Plan:
PASS: Perform a fresh install on a duplex environment and
check that no error log related to guest-agent is appearing
in the sm log file and no 400.02 alarm was raised on
fm alarm-list.

Closes-Bug: 2003117

Signed-off-by: Rafael Falcao <rafael.vieirafalcao@windriver.com>
Change-Id: I145bd8a45c12319facc4d1eff90b785a33a1d2c0
2023-01-17 19:46:49 -03:00
Erickson Silva de Oliveira b4fb57c610 Increase retries and timeouts on "audit-enabled" of
mgr-restful-plugin in SM database

When the system is unstable, using a lot of CPU, it takes
more time for the communication between the components to happen.

So it's necessary to increase the maximum of retries and timeouts
in the "audit-enable" of the mgr-restful-plugin to prevent errors
from happening.

Test Plan:
PASS: mgr-restful-plugin restarted by SM (AIO-SX)

Closes-bug: 2000080

Signed-off-by: Erickson Silva de Oliveira <Erickson.SilvadeOliveira@windriver.com>
Change-Id: I0f8462fef20196a3bb913fa7d374a86a2c6565f1
2023-01-05 18:17:54 +00:00
Alyson Deives Pereira 25ddaa31f9 Set max open file limit for sm service
The sm process initializes with the default limit
for the maximum number of open files. If somehow this default value is
set to a high value, performance degradation can occur. For instance,
the sm_service_action_run method from sm_service_action.c contains a
for loop that closes file descriptors up to this limit.

To avoid performance degradation, this change sets the limit for
the maximum open files to 1024 using the systemd property LimitNOFILE.

TEST PLAN:
PASS: Confirm that 1024 is the max open file limit in
  /proc/<sm_pid>/limits.

Story: 2010087
Task: 47102

Signed-off-by: Alyson Deives Pereira <alyson.deivespereira@windriver.com>
Change-Id: Iff848da7461a8b644057e9f58c69fa2d78499226
2023-01-03 17:51:01 -03:00
Zuul cf4015918c Merge "Update SM lsb script for quick start" 2022-12-05 20:00:06 +00:00
Bin Qian 88aeba251b Update SM lsb script for quick start
pidof command returns subprocess id when SM main process terminates.
This result a false postive that SM is already running so the start
action is skipped.

Make changes to the SM lsb script to distingrish if a subprocess ID
is returned, and attempt to kill it to speed up recovery of SM.

Revert the change to extend startuptime to 15 seconds back to 5.

Test Cases:
    kill SM process, observe SM process starts immediately after the
    subprocess is killed. SM is recovered within 2 seconds.
    (calculated by last and first logging of SM)

Change-Id: Ida834e7dd31a493ee6193b4d8ee73ebd97513de2
Closes-Bug: 1998349
Signed-off-by: Bin Qian <bin.qian@windriver.com>
2022-12-05 16:10:55 +00:00
Zuul 8cd67d8663 Merge "Increase SM rabbit enable timeout" 2022-12-01 22:30:08 +00:00
Victor Romano b3c3c5a71e Increase SM rabbit enable timeout
On DC with central cloud as standard, sometimes occurs
that RabbitMQ doesn't start properly and keeps restarting
by sm. Part of the solution is increasing the startup time from
60 to 90, so the sm doesn't restart the service before
the service is being initialized. This, alongside [1],
fixes the problem.

Test Plan:
PASS - Run sm-restart service rabbit successfully check rabbit was
  running as expected. This test was used to recreate the bug.
PASS - Reboot the host successfully and check rabbit was running as
  expected.
PASS - Lock/unlock and check if rabbit was running as expected.

Regression test:
PASS - Install and bootstrap DC
PASS - Install and bootstrap DX

[1]
https://review.opendev.org/c/starlingx/config-files/+/865693

Closes-bug: 1997966

Signed-off-by: Victor Romano <victor.gluzromano@windriver.com>
Change-Id: I1425324c72ae41c66558709dead6f1ff92a3230a
2022-12-01 13:45:46 +00:00
Bin Qian 8b5ee400b5 Update pmon SM wait time to 15 seconds
SM's pmon.conf file has the startuptime = 5 seconds but in Debian
it's frequently taking longer than that for the SM process to
start and produce its PID file.
In order to make SM recovery process smooth, increase this timeout
to 15 seconds.

Closes-bug: 1998349

Signed-off-by: Bin Qian <bin.qian@windriver.com>
Change-Id: I64476e394e346c9b8cf5b5aca2ad04ba463b9728
2022-11-30 15:39:20 +00:00
Heitor Matsui 8fa9032706 Increase sysinv-inv service timeout
When swacting from controller-0 to controller-1 during
the upgrade, swact failed multiple times with sysinv-inv
service timeout messages, therefore blocking the upgrade
from continuing.

This commit increases sysinv-inv service timeout.

Test Plan
PASS: run AIO-DX upgrade successfully
PASS: run AIO-SX upgrade successfully
PASS: install/bootstrap/unlock

Closes-bug: 1995713

Change-Id: Id612f27f20060715b6e771f9bb73a41472967d05
Signed-off-by: Heitor Matsui <HeitorVieira.Matsui@windriver.com>
2022-11-07 10:26:10 -03:00
Fabiano Mercer f411a3c27f Removing unnecessary IP for platform-nfs.
It is not necessary to allocate a specific IP address
for NFS.
It will use the floating management IP address, so there is
no reason to use the platform-nfs-ip service anymore.
With this change, another IP address is available to
configure an additional Worker node.
( i.e: subnet /29 )

Story: 2010351
Task: 46502

Test plan ( Debian only )
PASS Installed 2+2 system with subnet configuration (/29)
     (2controllers, 2 workers)
PASS system host-lock/unlock/swact between controllers
PASS Installed 2+2 system using IPv6 subnet


Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/859903
Change-Id: I1e8c85341b8c8e337794d344585deff2904daeb5
2022-10-25 10:07:00 -03:00
Zuul 48d17d8059 Merge "Debian: Remove conf files from etc-pmon.d" 2022-09-30 19:10:32 +00:00
Zuul 31cfcabeaf Merge "debian: Remove package preset install for ha" 2022-09-29 16:34:51 +00:00
Charles Short e21e923161 debian: Remove package preset install for ha
Remove the installation of per-package preset installs
since they are centrally managed now by the ISO install
for the following packages:

- sm-api
- sm-common
- sm-eru
- sm

Story: 2009968
Task: 46406

Test Plan

PASS Build package
PASS Build ISO
PASS Check for non-existant preset file in /etc/systemd/system-preset

Depends-On: https://review.opendev.org/c/starlingx/integ/+/853653

Signed-off-by: Charles Short <charles.short@windriver.com>
Change-Id: I537a013c77c7603e48cbf92837a4174282d70032
2022-09-27 08:16:38 +00:00
Leonardo Fagundes Luz Serrano f58f11ccba Debian: Remove conf files from etc-pmon.d
Removed conf files from /etc/pmon.d/
as they are being moved to another location.

This is part of an effort to allow pmon conf files
to be selected at runtime by kickstarts.

The change is debian-only, since centos support
will be dropped soon.
Centos' pmon conf files remain in /etc/pmon.d/

Test Plan:
PASS - deb doesn't install anything to /etc/pmon.d/
PASS - rpm files unchanged
PASS - AIOSX unlocked-enabled-available
PASS - Standard 2+2 unlocked-enabled-available

Story: 2010211
Task: 46304

Depends-On: https://review.opendev.org/c/starlingx/metal/+/855095

Signed-off-by: Leonardo Fagundes Luz Serrano <Leonardo.FagundesLuzSerrano@windriver.com>
Change-Id: Ie2f8e73f8664746d0213e98bba17e56d98d93b4f
2022-09-26 13:40:41 +00:00
Davi Frossard 605555272b Fix: add /var/lib/sm directory into sm-common package
sm-eru process failing in all non controller nodes after
fresh install, raises alarm 200.006

Test-plan (centos):
[PASS] build, install and verify alarms

Story: 2010087
Task: 46409

Signed-off-by: Davi Frossard <dbarrosf@windriver.com>
Change-Id: I2d48679488fe0b72bdb245044771c1e9059f66ec
2022-09-26 13:31:38 +00:00
Leonardo Fagundes Luz Serrano 199c1a8ba8 Duplicate pmon.d conf files to another location
Created a duplicate install of /etc/pmon.d/*.conf files
to /usr/share/starlingx/pmon.d/

This is part of an effort to allow pmon conf files
to be selected at runtime by kickstarts.

Test Plan:
PASS: duplicate conf on deb

Story: 2010211
Task: 46114

Signed-off-by: Leonardo Fagundes Luz Serrano <Leonardo.FagundesLuzSerrano@windriver.com>
Change-Id: Ia3774ac59b1df40aa1726bb182390b4b58812141
2022-08-30 16:26:27 -03:00
Davi Frossard bd9e560d4b Remove sm-watchdog service since NFS is now stable
sm-watchdog was introduced as a workaround because of NFS hung. Another
clean fix is already provided, but the sm-watchdog was not removed.

Test plan:
[centos] build, install and unlock.
[debian] build, install and unlock.

Story: 2010087
Task: 46007

Signed-off-by: Davi Frossard <dbarrosf@windriver.com>
Change-Id: I29fffff4e8982dc504f104f49c6586f7c74527fb
2022-08-19 19:57:43 +00:00
Yue Tao ff17c2554d sm-common: a temporary fix to build mtce prior to sm-common
sm-db build-depends on sm-common-libs which run-depends on mtce, if
sm-db builts prior to mtce, a failure is:

"sm-common-libs : Depends: mtce-pmon but it is not installable"

This is a temporary fix to make the build order as mtce, sm-common,
sm-db to bypass it.

Once it has been resolved on build system, the fix can be reverted.

Story: 2009101
Task: 44976

Test Plan:
Pass: build-pkgs -a without failure of sm-db.

Signed-off-by: Yue Tao <yue.tao@windriver.com>
Change-Id: I76c16dc87143db9b8a0484d74071afb6e8740745
2022-04-07 09:17:12 +08:00