During the first unlock after a fresh install, the sysinv-inv starts
the WSGIService that uses the FQDN: "controller.internal"
For this reason it needs that DNSMasq is ready to resolve the IP
address.
Test done:
AIO-SX fresh install
AIO-DX fresh install
AIO-DX host-swact
Story: 2010722
Task: 50221
Depends-On: https://review.opendev.org/c/starlingx/config/+/920694
Change-Id: If255441f12da370bd48641d7c521aea5f3012af2
Signed-off-by: Fabiano Correa Mercer <fabiano.correamercer@windriver.com>
This commit updates the name of the services for the new
Rook Ceph.
drbd-rookmon -> drbd-rook
rookmon-fs -> rook-fs
rook-mon-exit > rook-mon-exit (no change)
Test Plan:
PASS: AIO-DX -> Standby controller locked and ceph-rook as
storage-backend + controller-fs add ceph-float=<size> +
checking if everything is created correctly: lv, drbd and
SM services.
PASS: Perform host-swact after the above test and confirm
primary/secondary DRBD change on 'drbd-ceph'.
PASS: AIO-DX -> Standby controller locked + controllerfs-delete
ceph + checking if everything is deleted correctly: lv, drbd
and SM services
Story: 2011117
Task: 50097
Change-Id: Ib896ae271f4e649853af950aebe33111948e639e
Co-Authored-By: Robert Church <robert.church@windriver.com>
Signed-off-by: Gabriel de Araújo Cabral <gabriel.cabral@windriver.com>
This commit sets a dependency between ipsec-config service and
management-ipv4/ipv6 services. The disable action may be performed on
ipsec-config service after management-ipv4/ipv6 dependent services are
on disabled state.
This fix is needed due to the verification step present on monitor
function for ipsec-config service. This function checks if the
floating IP is present on system ip tables and swanctl configuration
and verifies if the conditions are satisfied for active and standby
controllers.
It is expected that floating IP is present on the active controller
system where ipsec-config is on enabled-active state and not present
on standby controller system where ipsec-config is on disabled state.
The floating IP is added and removed by management-ipv4/ipv6 service
per their start and stop actions. In the previous service dependency
configuration, this would cause an error during ipsec-config audit-
disabled action and 400.001 alarm was present on system.
Therefore, this commit fixes this service dependency relation between
ipsec-config and management-ip services.
Test Plan:
PASS: Full build, system install, bootstrap and unlock of a DX system
with unlocked enabled available state. No 400.001 alarms present
on system.
PASS: On a DX system with unlocked enabled available state, perform a
host-swact on controller-0. Observe that ipsec-config service
changes its state on controllers, from disabled to enabled-active
on active controller and from enabled-active to disabled on
standby controller. No errors are reported on daemon-ocf.log
related to ipsec-config or management-ip services. No 400.001
alarms present on system.
PASS: On a DX system with unlocked enabled available state, perform a
host-lock and host-unlock on controller-1. Observe that system
boots with ipsec-config service on disabled state. No errors are
reported on daemon-ocf.log related to ipsec-config or
management-ip services. No 400.001 alarms present on system.
Story: 2010940
Task: 50196
Change-Id: Idd2f487b4589e1f66d79ea5c4f13c36e67c302be
Signed-off-by: Manoel Benedito Neto <Manoel.BeneditoNeto@windriver.com>
This change increased action timeout for some SM services.
The reason for the increase is, when IPsec for mgmt network is enabled,
CPU usage is going up 3-4 times during system installation, causing
actions for some SM services timeout or failed, this in turn triggers
uncontrolled swacts.
New timeouts are set for the following services. They are set to 4
times of the original values. This is based on the performance
measurement that indicates CPU usage could go up 3-4 times during
system installation. With these new values, multiple installation tests
are successful without uncontrolled swact seen. There will be follow up
tunings that potentially decreases these numbers.
ceph-mon
sysinv-inv
horizon
rabbit
Test Plan:
PASS: DX deployment, verify deployment is successful, both controllers
are in unlocked| enabled| available states after deployemt
complete.
PASS: Multi nodes system deployment, verify deployment is successful,
all hosts are in unlocked| enabled| available states after
deployemt complete.
PASS: Swact controllers multi times, verify swact is successful and
system is stable.
PASS: Lock/unlock hosts multi times with either controller-0 or
controller-1 as active controller, veriry the host locked and
unlocked comes back normally, and system is stable.
Story: 2010940
Task: 50187
Change-Id: I0ecd9cc82415b5a232040b6707c1f945c4f16d08
Signed-off-by: Andy Ning <andy.ning@windriver.com>
Add dcorch-engine-worker service into SM database for its management
in HA.
Depends-On: https://review.opendev.org/c/starlingx/distcloud/+/917792
Story: 2011106
Task: 50016
Change-Id: I6cdbf6a754af9339fc7db8aa453a4c49e8277613
Signed-off-by: lzhu1 <li.zhu@windriver.com>
This commit adds ipsec-config service to sm-db. This service is
responsible to manage swanctl configuration by creating symbolic
links between swanctl.conf and different conf files.
Test Plan:
PASS: Build a new debian iso containing the changes.
PASS: Bootstrap, install and unlock a DX system with unlocked enable
available status and IPsec enabled. Observe that ipsec-config
service data is present on sm-db tables.
Story: 2010940
Task: 49998
Depends-On: https://review.opendev.org/c/starlingx/config/+/916841
Change-Id: Ia1544134b7d4d49897153c064b996a1f67b7599b
Signed-off-by: Manoel Benedito Neto <Manoel.BeneditoNeto@windriver.com>
StarlingX stopped supporting CentOS builds in the after release 7.0.
This update will strip CentOS from our code base. It will also remove
references to the failed OpenSUSE feature as well.
Story: 2011110
Task: 49952
Change-Id: I1bed2fde10326ecb75b45376efea8480e0f23675
Signed-off-by: Scott Little <scott.little@windriver.com>
This change splits the IP service for each platform network into ipv4
and ipv6 t support dual-stack. It still supporting single-stack (when
there is only ipv4 or ipv6)
Test Plan:
[PASS] install, lock, unlock and swact for the following setups:
- AIO-SX (IPv4 and IPv6)
- AIO-DX (IPv4 and IPv6)
- Standard (IPv4 and IPv6)
- DC (SysCtrl=AIO-DX, subcloud=AIO-SX)
[PASS] Add dual-stack configuration and validate services operation
with lock, unlock and swact:
- AIO-SX (IPv4 and IPv6)
- AIO-DX (IPv4 and IPv6)
- Standard (IPv4 and IPv6)
- DC (SysCtrl=AIO-DX, subcloud=AIO-SX), using the admin network
Story: 2011027
Task: 49761
Change-Id: Ic6451cae04769409babd2d6507c3677d1cce5617
Signed-off-by: Andre Kantek <andrefernandozanella.kantek@windriver.com>
This commit is to change the swact precheck logic. The new
logic will unblock the swact precheck for host state sync state
if the USM endpoint is not present. This change is needed for
the release that doesn't have USM endpoint present.
Test Plan:
PASS: swact back to controller 0 after legacy upgrade to 24.09
PaSS: swact between controllers in 24.09
Task: 49826
Story: 2010676
Change-Id: I6824f00589057f4d48c8df04425431cb22361e8e
Signed-off-by: junfeng-li <junfeng.li@windriver.com>
This is to fix the USM endpoint URL not appended properly
using .join() function.
The way .join() was used only append the endpoint resource
to a empty string.
Test Plan:
PASS: Run the host-swact
Depends-on: https://review.opendev.org/c/starlingx/update/+/911003
Task: 49660
Story: 2010676
Change-Id: Icf47494843d7ea6c9fcd73e9256d9be352d8f76f
Signed-off-by: junfeng-li <junfeng.li@windriver.com>
This commit is to ensure both controllers
deployment state is in synced before host
swact during platform upgrade.
If the USM deploy is not started, this host swact
pre-check is always passed.
During the pre-swact check, the SM calls
USM REST API endpoint to get the controller
sync status. If the controllers deployment state
is not in sync, the host swact is stopped.
Depends-on: https://review.opendev.org/c/starlingx/update/+/906005
Test Plan:
PASS: executed host swact when controllers are in sync
PASS: executed host swact when controllers are not in sync
Task: 49425
Story: 2010676
Change-Id: I8d262a731583f691fd0d85a33ddebcbb12f549e8
Signed-off-by: junfeng-li <junfeng.li@windriver.com>
Service Management (SM) sometimes selects and activates services on a
locked controller following a dead office recovery.
This update adds a node locked check to SM's enable handler to
block enable if present much like the existing goenabled check
blocks enable if not present in the same function.
The enable gate file is /etc/mtc/tmp/.node_locked on the local host.
Maintenance manages the presence or absence of this file based on
the node's administrative state.
This update also cleans up some extra whitespace in the changed file.
Test Plan:
PASS: Verify system build.
PASS: Verify AIO SX install.
PASS: Verify AIO DX install.
PASS: Verify Standard DX system install with worker and storage.
For Both 'AIO DX' and 'Standard DX with worker and storage':
PASS: Verify SM does not activate on a locked DX controller.
PASS: ... DOR case
PASS: ... Uncontrolled Swact case
PASS: Verify Standard DX behavior over DOR with one locked controller
while the only unlocked controller does not recover.
PASS: Verify behavior after above test case once the only unlocked
controller does recover.
PASS: Verify lock of the standby controller and its sm logs
PASS: Verify manually creating the new Nv locked file on the active
controller will cause SM to go disabled and shut down all
services on that controller.
... If there is another unlocked controller then verify it
takes over as an uncontrolled swact.
... If there is no unlocked standby controller then verify SM
remains shutdown until the manually created Nv node locked
file is removed. At which point SM proceeds to activate
services on that controller again.
PASS: Verify SM ignores the node locked flag file for AIO SX systems.
PASS: Verify lock/unlock of AIO SX controller.
PASS: Verify original reported issue is resolved for AIO DX systems.
Regression:
PASS: Verify controlled swact with unlocked enabled standby.
PASS: Verify uncontrolled swact with unlocked enabled standby.
PASS: Verify standby controller lock/unlock soak loop (10).
PASS: Verify swact loop soak (10).
PASS: Verify no crash or core dumps.
PASS: Verify SM logging
Closes-Bug: 2051578
Change-Id: If8e27ef30d62096fa77c3868f4d460b18e10ade2
(cherry picked from commit 23d0d8ab2f)
Service Management (SM) sometimes selects and activates services on a
locked controller following a dead office recovery.
This update adds a node locked check to SM's enable handler to
block enable if present much like the existing goenabled check
blocks enable if not present in the same function.
The enable gate file is /etc/mtc/tmp/.node_locked on the local host.
Maintenance manages the presence or absence of this file based on
the node's administrative state.
This update also cleans up some extra whitespace in the changed file.
Test Plan:
PASS: Verify system build.
PASS: Verify AIO DX install.
PASS: Verify Standard DX system install with worker and storage.
For Both 'AIO DX' and 'Standard DX with worker and storage':
PASS: Verify SM does not activate on a locked controller.
PASS: ... DOR case
PASS: ... Uncontrolled Swact case
PASS: Verify Standard DX behavior over DOR with one locked controller
while the only unlocked controller does not recover.
PASS: Verify behavior after above test case once the only unlocked
controller does recover.
PASS: Verify lock of the standby controller and its sm logs
PASS: Verify manually creating the new Nv locked file on the active
controller will cause SM to go disabled and shut down all
services on that controller.
... If there is another unlocked controller then verify it
takes over as an uncontrolled swact.
... If there is no unlocked standby controller then verify SM
remains shutdown until the manually created Nv node locked
file is removed. At which point SM proceeds to activate
services on that controller again.
Regression:
PASS: Verify controlled swact with unlocked enabled standby.
PASS: Verify uncontrolled swact with unlocked enabled standby.
PASS: Verify standby controller lock/unlock soak loop (10).
PASS: Verify swact loop soak (10).
PASS: Verify no crash or core dumps.
Closes-Bug: 2051578
Change-Id: I0f0e3d199586513ddce484fdcc056e1b2562b45f
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
haproxy uses dns resolution
add service dependency to sm database
to ensure that dnsmasq service is started before haproxy
and dnsmasq is disabled after haproxy is disabled
Test plan:
PASS - AIO-SX: iso install
PASS - AIO-SX: reboot test
PASS - AIO-DX: iso install
PASS - AIO-DX: swact test
Closes-Bug: #2043506
Change-Id: I494faebfe67843d34819f66a0a2fbd977657bb6b
Signed-off-by: Kyale, Eliud <Eliud.Kyale@windriver.com>
Add support for aarch64 in sm_trap_thread_log.
Test Plan:
PASS: build-pkgs on x86-64 host
PASS: build-image on x86-64 host
PASS: build-pkgs on arm64 host
PASS: build-image on arm64 host
PASS: Deploy AIO-SX on x86-64 targets and check sm service
PASS: Deploy AIO-SX on arm64 targets and check sm service
PASS: Deploy AIO-DX on arm64 targets and check sm service
PASS: Deploy std (2+2+2) on arm64 targets and check sm service
Story: 2010739
Task: 48017
Change-Id: Iebea29e6df900f63d0dce24cf1a139f60c1cf6f8
Signed-off-by: Jackie Huang <jackie.huang@windriver.com>
The includes path in Makefile is hardcoded with x86_64,
use dpkg-architecture to check the host arch and replace
the hardcoded name.
Test Plan:
PASS: build-pkgs on x86-64 host
PASS: build-image on x86-64 host
PASS: build-pkgs on arm64 host
PASS: build-image on arm64 host
PASS: Deploy AIO-SX on x86-64 targets and check sm service
PASS: Deploy AIO-SX on arm64 targets and check sm service
PASS: Deploy AIO-DX on arm64 targets and check sm service
PASS: Deploy std (2+2+2) on arm64 targets and check sm service
Story: 2010739
Task: 48017
Change-Id: Ie22477b7ec7df63377f666186d95201cd16f5809
Signed-off-by: Jackie Huang <jackie.huang@windriver.com>
This is to avoid waiting for hbs cluster query for sending SM alive
pulse. When a hbs cluster query or alive pulse is being sent, do not
queue the subsequent alive pulse, as current request being sent is good
enough to update hbs agent.
Also move the function retrieving sock address to initial from inside
the query sending procedure. The function getaddrinfo to avoid indirectly
calling malloc, which invokes malloc_atfork to potentially a blocking call.
TCs:
This could improve in extreme situation only, passed regression.
Closes-bug: 2025504
Change-Id: I520b42f0330b670e301279c2e42670d40361adc5
Signed-off-by: Bin Qian <bin.qian@windriver.com>
This commit resolves an issue seen when attempting to upgrade
/ patch from a previous StarlingX release to the current
StarlingX development load in which the distributed cloud
admin network feature is present.
1. Currently, the admin-services SM service group is created
if the system is detected to be a subcloud. This presents
a problem during upgrade, because the <N> system has no
concept of the admin-services group, while the upgraded
<N+1> system does. During upgrade when the system is
swacted to the <N+1> system, an alarm will be raised as
the <N> system has no admin-services group (designed to be
an N+M redundancy model).
2. A potential solution to the above problem is to only provision
the admin-services group if an operator has actually configured
an admin interface / network after the full upgrade has completed.
However, this presents the same sort of problem, as an
interface-network association is done on a host-by-host basis,
requiring a lock of each host to provision a new admin interface.
Since there is no mechanism to provision a new SM service group
at runtime, this leads to the same situation of one host being
aware of the admin-services group, while the other does not.
That is, a user might configure and admin inteface-network on
host <X>, but the service group would not be present on host <Y>,
leading to the same alarm.
The solution here is to do away with the new admin-services service
group, and leverage the existing controller-services group for the
admin-ip service.
Test-Plan:
- Ensure no alarms when upgrading / patching a subcloud implementing
the admin network feature.
- Ensure a system utilizing the admin network can become online /
in-sync.
- Swact between controllers utilizing the admin-ip SM service to
ensure the floating IP is correctly assigned to the active
controller (using the controller-services service group).
Regression:
- Install a DC system using the management network. Create the
admin interface, network on the subcloud and ensure a user can
update the subcloud to use the admin network.
Story: 2010319
Task: 47278
Change-Id: Ic36e83622c6ab5d15fd537be69d3314cb675c724
Signed-off-by: Steven Webster <steven.webster@windriver.com>
The management network is used extensively for all internal
communication.
Since the original use of the network was a private network before
it was exposed for external communication in a distributed cloud
configuration, it was never designed to be reconfigured.
To support MGMT network reconfiguration the idea is to configure the
applications to use the hostname/FQDN instead of a static MGMT IP
address.
In this way the MGMT network can be changed and the services and
applications will still work since they are using the hostname/FQDN
and the DNS will be responsible to translate to the current MGMT
IP address.
The use of FQDN will be applied for all installation modes: AIO-SX,
AIO-DX, Standard, AIO-PLUS and DC subclouds. But given the
complexities of supporting the multi-host reconfiguration,
the MGMT network reconfiguration will focus on support for AIO-SX
only.
The DNSMASQ service must start as soon as possible to translate
the FQDN to IP address, for this reason the dnsmasq will start
as soon the management-ip is ready.
Test plan ( Debian only )
- AIO-SX and AIO-DX virtualbox installation IPv4/IPv6
- Standard virtualbox installation IPv6
- DC virtualbox installation IPv4 ( AIO-SX/DX subclouds )
- AIO-SX and AIO-DX installation IPv4/IPv6
- AIO-DX plus installation IPv6
- DC IPv6 and subcloud AIO-SX
- AIO-DX host-swact
- DC IPv4 virtualbox with subcloud AIO-DX and AIO-DX
- AIO-SX to AIO-DX migration
- netstat -tupl ( no services are using the MGMT IP address )
- Ran sanity/regression tests
Story: 2010722
Task: 48889
Depends-On: https://review.opendev.org/c/starlingx/config/+/886208
Change-Id: If118132410a5a3db4c3a9d0ba029f4d45521574d
Signed-off-by: Fabiano Correa Mercer <fabiano.correamercer@windriver.com>
The SM_FAILOVER_IF_STATE_MASK change from 0xF to 0x3F
mask was clearing the HEARTBEAT ALIVE flag.
SM_FAILOVER_HEARTBEAT_ALIVE = (0x1 << 4), // 16
This change restores previous system behavior. Tester performs a
cable pull on the oam ports. The expected behavior is an alarm
being raised. Instead the standby controller ended up getting rebooted.
oam interface testing was simulated by bringing the ip link down for 1
second.
For example:
sudo ip link set <oam> down; sleep 1 ; sudo ip link set <oam> up
-----------------
Before change
-----------------
- Heartbeat loss on oam interface resulted in standby controller reboot
-----------------
After change:
-----------------
- Heartbeat loss on oam interface resulted in alarm raised
- Logs indicate the health score of controller-1 drops by 1 point
Test plan:
PASS - AIO-SX: iso install
PASS - AIO-DX: iso install
drop oam interface on standby
verify standby controller-1 is not rebooted
by active controller-0
restore oam interface
PASS - AIO-DX: system host-swact . swact back and forth
Closes-Bug: 2037579
Change-Id: I4f1ffc1169d4df090f71377e5aa8247e1cd17fc3
Signed-off-by: Kyale, Eliud <Eliud.Kyale@windriver.com>
The SM throttling is a feature that limits the number of parallel
service-enabling processes at a time. The SM throttling mechanism is
always ON, during and after startup. Whenever SM controlled services
transition to enabled status, it takes part of the process.
By default, the SM throttling allows a maximum of 2 parallel service
enabling processes at a time, but the throttling size is configurable,
by means of the field ENABLING_THROTTLE from the CONFIGURATION table in
the SM-DB database. Hence, it is possible to disable the SM throttling
by increasing the throttling size to a reasonable big enough value,
such way to enable full capacity of parallel service enabling.
This commit improves system performance by disabling the SM throttling
for AIO-SX systems, while still keeping the SM throttling mechanism,
should it ever be needed as a fallback, for robustness reasons.
In order to evaluate the SM throttling feature, and its costs in terms
of performance, the throttling size was modified such way to disable the
feature, from value 2 to 1000, and a series of tests were conducted to
evaluate stability and performance benefits.
Test Plan:
- Fresh Install and bootstrap (PASS)
- Lock/Unlock (PASS)
- Restart SM service (PASS)
Story: 2010802
Task: 48312
Change-Id: Ie96115293049e9939bc43feb2ad11432dd318323
Signed-off-by: Matheus Guilhermino <matheus.machadoguilhermino@windriver.com>
Ostree doesn't manage the /var filesystem. Anything
installed there during initial filesystem setup, becomes
unpatchble. This commit changes the sm-patch.sql deploy path
to a place that ostree handles, /usr/share/sm/patches in this
case and symlinks it to /var/lib/sm/patches/sm-patch.sql.
Test Plan:
PASS: ISO install symlink created
PASS: sm-patch.sql installed to /usr/share/sm/patches
PASS: PATCH apply and changes applied to /var/lib/sm/
patches/sm-patch.sql on stx8
Closes-Bug: 2030890
Change-Id: I07047e5383e8ae9e57687cd1e852c2efc0eb755f
Signed-off-by: Luis Eduardo Bonatti <LuizEduardo.Bonatti@windriver.com>
A requirement for a subcloud's admin network is that its
subnet information be able to be updated without host
lock / unlock.
Accordingly, the service domain interface and admin-ip
service in SM must be provisioned / deprovisioned at
runtime.
In an AIO-DX system this can cause issues in certain
circumstances as the disablement / enablement must be
done via puppet and can be affected by the ordering a
user performs each action as well as the timing of the
currently running manifests on each host.
This commit disables the failover behaviour for the admin
network, as link flapping and heartbeat losses are expected
as the service domain interface is provisioned/deprovisioned.
Also in this commit is the disablement of heartbeat messages
on service domain interface de-provision to prevent log
spamming, as well as a couple other minor issues that were
found while testing.
Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/889872
Test plan:
- No uncontrolled swacts while re-configuring admin subnets
or reverting to the management subnet (deleting the admin
address pool) dozens of times.
- Alarms still generated on interface down / heartbeat loss
- Switching back and forth between admin network / mgmt
network via dcmanager.
Story: 2010319
Task: 47707
Change-Id: I761b5b20b6de198ef763b2d3480e6f7cd380f952
Signed-off-by: Steven Webster <steven.webster@windriver.com>
This story shall update the README file of a few most used StarlingX
repos.
Test Plan: N/A
Story: 2010814
Task: 48377
Change-Id: I34b14275e6b61e1be6d659701be568e208e380e4
Signed-off-by: Roger Ferraz <rogerio.ferraz@encora.com>
The Sysinv components utilize ZeroMQ for communication
among each other. The service dependency between
sysinv-conductor and rabbit are removed on Service Manager
improving the time it takes to swact, by 20%.
Test Plan:
PASS: Bootstrap AIO-SX, AIO-DX and Standard.
PASS: Lock/unlock/swact in all environments
PASS: Boostrap DC and perform lock/unlock/swact
PASS: Add subcloud in DC environment
PASS: Test DC orchestration (dcmanger dcorch)
PASS: Restart sysinv only
PASS: Restart sysinv, then dcorch and dcmanager
PASS: After restart sysinv, dcorch and dcmanager run manage/unmanage
PASS: After restart sysinv, run subcloud backup
PASS: After all restart of services verify the whole system is working
Refer to review
https://review.opendev.org/c/starlingx/config/+/859571
for sysinv/ZeroMQ change details
Closes-bug: 2022083
Signed-off-by: Lucas Borges <lucas.borges@windriver.com>
Change-Id: I4014c05c914fc946946b14519f28a85067b06b34
In rare cases, when system running slowly with significant scheduling
delay, rabbit disable action timeout continually. As final resort sm
reboots the impacted controller for recovery after failure count reaches
MAX_TRANSITION_FAILURES. As rabbit service disable timeout is set to 60
seconds, this result a significant delay before reboot for recovery.
This change updates MAX_TRANSITION_FAILURES of rabbit service from
16 to 5 to reduce the delay of recovery of rabbit failure.
TCs passed:
Install a DX system
Observed service group recovery escalated to reboot after 5 forced
rabbit disable failure.
Closes-bug: 2016168
Signed-off-by: Bin Qian <bin.qian@windriver.com>
Change-Id: I660a64f0e78b6564456eb26245b672d2549f9a3b
Updating the rsa ssh host key based on:
https://github.blog/2023-03-23-we-updated-our-rsa-ssh-host-key/
Note: In the future, StarlingX should have a zuul job and
secret setup for all repos so we do not need to do this
for every repo.
Needed to rename the secret, because zuul fails if like-named
secrets have diffent values in different branches of the same
repo.
Partial-Bug: #2015246
Change-Id: Iedfe334611d14e7e6b5a3b2108501d0b2fdf1e13
Signed-off-by: Davlet Panech <davlet.panech@windriver.com>
- add failover state to heartbeat message ( 4 bits )
- add logic to survived_state to use peer's
failover state to determine whether to exit survived state
and enter normal state
- throttle peer is normal events with a threshold of 10
used to ensure the peer is normal and stable
- change fsm->send_event() log to debug from info log level
- a few logging improvements; debug send_event logs
- update copyright year 2023
Test plan:
PASS - AIO-DX: iso install
PASS - AIO-DX: crash the sm as indicated in bug
and observe swact to standby
PASS - AIO-DX: manual swact
PASS - AIO-DX: power off active controller
PASS - AIO-SX: install and basic sanity check
PASS - AIO-SX: upgrade test to verify sm heartbeat
messages changes still function when
controllers are running different loads
Closes-Bug: 2012519
Signed-off-by: Kyale, Eliud <Eliud.Kyale@windriver.com>
Change-Id: I1f86dcb8c9d9dbaf436b9240867f61adc405e88c
This change is to update the service disabling and going standby
dependency check.
The 2 specific rules are
1. "service a" has a disable action dependency to "service b", with
targeted "service b" state of disabled, disable action of
"service a" is considered as "dependency met" only when "service b"
is in disabled stated, or enabled-standby state.
2. "service a" has a go-standby action dependency "to service b", with
targeted "service b" state of disabled, go-standby action of
"service a" is considered as "dependency met" only when "service b"
is in disabled stated, or enabled-standby state.
TCs:
passed: Perform repeatedly host-swact operations, with adding long
delay in xxx-fs ocf-script in disable action, observed that
all xxx-fs services are disabled before drbd-xxx services
start disabling.
Closes-Bug: 2012570
Signed-off-by: Bin Qian <bin.qian@windriver.com>
Change-Id: Ie9717d3b2b73dc7d623e1b980b3387c6c4e6d991
The platform-nfs-ip service is not necessary for fresh installs
because it is just an alias for the controller IP.
But for old releases like StarlingX rel. 6 or 7 the
platform-nfs-ip uses a specific IP, If for some reason an error
occurs during the upgrade process, the upgrade will be aborted
and the nodes will downgrade to the old release again.
At this moment the nodes will try to communicate with the
previous platform-nfs-ip configured in /etc/hosts.
But if the active controller is using the new Release
this IP doesn't exist anymore and the downgrade will fail.
For this reason the platform-nfs-ip service will be available
just for upgrade operations and will be deprovisioned for fresh
installs or at the end of the upgrade process
( upgrade-activate phase ).
Test plan
PASS Fresh install on AIO-SX
Fresh install on AIO-DX
PASS Upgrade AIO-DX system from CENTOS Rel 7 to DEBIAN Rel 8
PASS Reboot controller-0 during upgrade of AIO-DX
controller-1 was the active one with the new release ( Rel 8 )
controller-0 using old release.
reboot controller-0 and check if it could connect to
controller-1 using old platform-nfs-ip.
PASS Upgrade-abort during AIO-DX upgrade
controller-1 was the active controller and already upgraded
controller-0 was upgraded but locked.
Abort the upgrade and downgrade to old release ( Rel 7 )
Partial-Bug: #2012387
Signed-off-by: Fabiano Mercer <fabiano.correamercer@windriver.com>
Change-Id: I704e15fffc6e7efa7b1fea56164a21af02222dd6
- rename mutexes from generic name '_mutex'
- create common util functions for initializing and destroying mutexes
- add mutex initialize/finalize functions
and run them in sm_main_process_initialize
- update copyright info to 2023
Test plan:
PASS - AIO-DX: iso install
PASS - AIO-DX: verify ha swact
PASS - AIO-DX: failover swact test
PASS - AIO-DX: run pi_stress (rt-tests)
to confirm priority inheritance POSIX attribute is working
-------------------------------------------------------------------
sysadmin@localhost:~$ sudo pi_stress --uniprocessor --duration=10s
Starting PI Stress Test
Number of thread groups: 3
Duration of test run: 10 seconds
Number of inversions per group: unlimited
Admin thread SCHED_FIFO priority 4
3 groups of 3 threads will be created
High thread SCHED_FIFO priority 3
Med thread SCHED_FIFO priority 2
Low thread SCHED_FIFO priority 1
Current Inversions: 992034
Stopping test
Total inversion performed: 992038
Test Duration: 0 days, 0 hours, 0 minutes, 11 seconds
-------------------------------------------------------------------
PASS - AIO-DX: valgrind helgrind
* test for inconsistence lock ordering
* test race conditions
[ detected outside scope of jira ]
* test for deadlocks
Task: 47503
Story: 2010609
Signed-off-by: Kyale, Eliud <Eliud.Kyale@windriver.com>
Change-Id: Ic77f08ca7c3a687b1cc219ac9cba5711979206e8
Feb 13 released a new version of bugbear that raises new
error codes. We are setting an upper limit for bugbear
to be from before that version was released.
bandit is also reporting a hashlib error. The bandit
job is non-voting, but the error is now being suppressed.
Story: 2010531
Task: 47313
Signed-off-by: Al Bailey <al.bailey@windriver.com>
Change-Id: Iffb790c67e658c7a40c697e364cb34f4c4f9ec6c
Add SM support for the DC admin network
This commit adds SM support for the DC admin network.
The admin network is intended to be used between a subcloud
and system controller. Because the (existing) management network
is so embedded in other parts of the StarlingX system, it makes
it prohibitively hard to re-configure this network after initial
installation. The admin network is intended to be isolated from
the management network, allowing re-configuration of the network
parameters in the case that the physical network between subcloud
and system controller has been changed.
In the case of admin network usage, the management network still
exist but is a private network in the context of a subcloud.
This specific commit provides for admin-ip and admin-interface
services to be added to the SM database and be recognized in
processing similar to the management, cluster-host, oam, etc
networks.
Since there is a requirement for the admin IP subnet information
to be allowed to change at runtime, in-service updating of SM
information relating to the admin-ip service (floating IP), as
well as unicast heartbeating between peers is also added in this
commit.
Testing:
AIO-SX:
- admin-ip service is enabled when the admin network is
created.
- admin-ip service is not enabled when the admin network is
not created.
- floating-ip is updated on the admin interface when admin
addr-pool information is changed.
AIO-DX:
- admin-ip service is enabled when the admin network is
created.
- admin-ip service is not enabled when the admin network
is not created.
- floating-ip is updated on the active-controller when the
admin addr-pool information is changed.
- When a peer admin interface is down, an alarm is raised.
- When a peer admin IP is not correct (changed), an alarm
is raised.
- Swact between controllers.
- Inactive controller admin interface goes down
Result: A 400.005 major communication loss fault is generated
for the inactive controller entity
- Inactive controller admin interface comes back up
Result: The fault is cleared
- Inactive controller admin IP address is removed/changed
Result: Two 400.005 major communication loss faults are
generated for both controller entities
- Inactive controller admin node IP address is re-applied
Result: The faults are cleared
- Active admin interface goes down
Result: A 400.005 major communication loss fault is generated
for the inactive controller entity. A swact is not
issued.
- Active admin interface comes back up
Result: The fault is cleared
- Active admin node IP address is removed/changed
Result: Two 400.005 major communication loss faults are
generated for both controller entities.
A swact is not issued.
- Active admin floating IP address is removed/changed
Result: A 400.001 critical admin-services / admin-ip alarm
is raised.
A swact occurs.
The floating admin IP is applied to the newly active
controller. Alarms are cleared.
- After the above test, the newly active controller swacts back
to the previously active controller.
Result: No alarms.
The floating IP is applied to the newly active
controller.
- The cable for the management interface on the active controller
is pulled
Result: A swact occurs
- The cable for the OAM interface on the active controller
is pulled
Result: A swact occurs
- The cable for the Admin interface on the active controller
is pulled
Result: A swact occurs. 400.005 alarms are raised.
- The mgmt, cluster-host, oam interfaces are all brought down/up at
the same time. The admin interface is also brought down,
but not brought back up
back up.
Result: A swact occurs, with multiple controller-services
related to the mgmt interface being in degraded state.
Story: 2010319
Task: 47278
Signed-off-by: Steven Webster <steven.webster@windriver.com>
Change-Id: I65df52600f4d5c499dceed32739cab414d36847a