Service Management (SM) sometimes selects and activates services on a
locked controller following a dead office recovery.
This update adds a node locked check to SM's enable handler to
block enable if present much like the existing goenabled check
blocks enable if not present in the same function.
The enable gate file is /etc/mtc/tmp/.node_locked on the local host.
Maintenance manages the presence or absence of this file based on
the node's administrative state.
This update also cleans up some extra whitespace in the changed file.
Test Plan:
PASS: Verify system build.
PASS: Verify AIO SX install.
PASS: Verify AIO DX install.
PASS: Verify Standard DX system install with worker and storage.
For Both 'AIO DX' and 'Standard DX with worker and storage':
PASS: Verify SM does not activate on a locked DX controller.
PASS: ... DOR case
PASS: ... Uncontrolled Swact case
PASS: Verify Standard DX behavior over DOR with one locked controller
while the only unlocked controller does not recover.
PASS: Verify behavior after above test case once the only unlocked
controller does recover.
PASS: Verify lock of the standby controller and its sm logs
PASS: Verify manually creating the new Nv locked file on the active
controller will cause SM to go disabled and shut down all
services on that controller.
... If there is another unlocked controller then verify it
takes over as an uncontrolled swact.
... If there is no unlocked standby controller then verify SM
remains shutdown until the manually created Nv node locked
file is removed. At which point SM proceeds to activate
services on that controller again.
PASS: Verify SM ignores the node locked flag file for AIO SX systems.
PASS: Verify lock/unlock of AIO SX controller.
PASS: Verify original reported issue is resolved for AIO DX systems.
Regression:
PASS: Verify controlled swact with unlocked enabled standby.
PASS: Verify uncontrolled swact with unlocked enabled standby.
PASS: Verify standby controller lock/unlock soak loop (10).
PASS: Verify swact loop soak (10).
PASS: Verify no crash or core dumps.
PASS: Verify SM logging
Closes-Bug: 2051578
Change-Id: If8e27ef30d62096fa77c3868f4d460b18e10ade2
(cherry picked from commit 23d0d8ab2f)
Service Management (SM) sometimes selects and activates services on a
locked controller following a dead office recovery.
This update adds a node locked check to SM's enable handler to
block enable if present much like the existing goenabled check
blocks enable if not present in the same function.
The enable gate file is /etc/mtc/tmp/.node_locked on the local host.
Maintenance manages the presence or absence of this file based on
the node's administrative state.
This update also cleans up some extra whitespace in the changed file.
Test Plan:
PASS: Verify system build.
PASS: Verify AIO DX install.
PASS: Verify Standard DX system install with worker and storage.
For Both 'AIO DX' and 'Standard DX with worker and storage':
PASS: Verify SM does not activate on a locked controller.
PASS: ... DOR case
PASS: ... Uncontrolled Swact case
PASS: Verify Standard DX behavior over DOR with one locked controller
while the only unlocked controller does not recover.
PASS: Verify behavior after above test case once the only unlocked
controller does recover.
PASS: Verify lock of the standby controller and its sm logs
PASS: Verify manually creating the new Nv locked file on the active
controller will cause SM to go disabled and shut down all
services on that controller.
... If there is another unlocked controller then verify it
takes over as an uncontrolled swact.
... If there is no unlocked standby controller then verify SM
remains shutdown until the manually created Nv node locked
file is removed. At which point SM proceeds to activate
services on that controller again.
Regression:
PASS: Verify controlled swact with unlocked enabled standby.
PASS: Verify uncontrolled swact with unlocked enabled standby.
PASS: Verify standby controller lock/unlock soak loop (10).
PASS: Verify swact loop soak (10).
PASS: Verify no crash or core dumps.
Closes-Bug: 2051578
Change-Id: I0f0e3d199586513ddce484fdcc056e1b2562b45f
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
haproxy uses dns resolution
add service dependency to sm database
to ensure that dnsmasq service is started before haproxy
and dnsmasq is disabled after haproxy is disabled
Test plan:
PASS - AIO-SX: iso install
PASS - AIO-SX: reboot test
PASS - AIO-DX: iso install
PASS - AIO-DX: swact test
Closes-Bug: #2043506
Change-Id: I494faebfe67843d34819f66a0a2fbd977657bb6b
Signed-off-by: Kyale, Eliud <Eliud.Kyale@windriver.com>
Add support for aarch64 in sm_trap_thread_log.
Test Plan:
PASS: build-pkgs on x86-64 host
PASS: build-image on x86-64 host
PASS: build-pkgs on arm64 host
PASS: build-image on arm64 host
PASS: Deploy AIO-SX on x86-64 targets and check sm service
PASS: Deploy AIO-SX on arm64 targets and check sm service
PASS: Deploy AIO-DX on arm64 targets and check sm service
PASS: Deploy std (2+2+2) on arm64 targets and check sm service
Story: 2010739
Task: 48017
Change-Id: Iebea29e6df900f63d0dce24cf1a139f60c1cf6f8
Signed-off-by: Jackie Huang <jackie.huang@windriver.com>
The includes path in Makefile is hardcoded with x86_64,
use dpkg-architecture to check the host arch and replace
the hardcoded name.
Test Plan:
PASS: build-pkgs on x86-64 host
PASS: build-image on x86-64 host
PASS: build-pkgs on arm64 host
PASS: build-image on arm64 host
PASS: Deploy AIO-SX on x86-64 targets and check sm service
PASS: Deploy AIO-SX on arm64 targets and check sm service
PASS: Deploy AIO-DX on arm64 targets and check sm service
PASS: Deploy std (2+2+2) on arm64 targets and check sm service
Story: 2010739
Task: 48017
Change-Id: Ie22477b7ec7df63377f666186d95201cd16f5809
Signed-off-by: Jackie Huang <jackie.huang@windriver.com>
This is to avoid waiting for hbs cluster query for sending SM alive
pulse. When a hbs cluster query or alive pulse is being sent, do not
queue the subsequent alive pulse, as current request being sent is good
enough to update hbs agent.
Also move the function retrieving sock address to initial from inside
the query sending procedure. The function getaddrinfo to avoid indirectly
calling malloc, which invokes malloc_atfork to potentially a blocking call.
TCs:
This could improve in extreme situation only, passed regression.
Closes-bug: 2025504
Change-Id: I520b42f0330b670e301279c2e42670d40361adc5
Signed-off-by: Bin Qian <bin.qian@windriver.com>
This commit resolves an issue seen when attempting to upgrade
/ patch from a previous StarlingX release to the current
StarlingX development load in which the distributed cloud
admin network feature is present.
1. Currently, the admin-services SM service group is created
if the system is detected to be a subcloud. This presents
a problem during upgrade, because the <N> system has no
concept of the admin-services group, while the upgraded
<N+1> system does. During upgrade when the system is
swacted to the <N+1> system, an alarm will be raised as
the <N> system has no admin-services group (designed to be
an N+M redundancy model).
2. A potential solution to the above problem is to only provision
the admin-services group if an operator has actually configured
an admin interface / network after the full upgrade has completed.
However, this presents the same sort of problem, as an
interface-network association is done on a host-by-host basis,
requiring a lock of each host to provision a new admin interface.
Since there is no mechanism to provision a new SM service group
at runtime, this leads to the same situation of one host being
aware of the admin-services group, while the other does not.
That is, a user might configure and admin inteface-network on
host <X>, but the service group would not be present on host <Y>,
leading to the same alarm.
The solution here is to do away with the new admin-services service
group, and leverage the existing controller-services group for the
admin-ip service.
Test-Plan:
- Ensure no alarms when upgrading / patching a subcloud implementing
the admin network feature.
- Ensure a system utilizing the admin network can become online /
in-sync.
- Swact between controllers utilizing the admin-ip SM service to
ensure the floating IP is correctly assigned to the active
controller (using the controller-services service group).
Regression:
- Install a DC system using the management network. Create the
admin interface, network on the subcloud and ensure a user can
update the subcloud to use the admin network.
Story: 2010319
Task: 47278
Change-Id: Ic36e83622c6ab5d15fd537be69d3314cb675c724
Signed-off-by: Steven Webster <steven.webster@windriver.com>
The management network is used extensively for all internal
communication.
Since the original use of the network was a private network before
it was exposed for external communication in a distributed cloud
configuration, it was never designed to be reconfigured.
To support MGMT network reconfiguration the idea is to configure the
applications to use the hostname/FQDN instead of a static MGMT IP
address.
In this way the MGMT network can be changed and the services and
applications will still work since they are using the hostname/FQDN
and the DNS will be responsible to translate to the current MGMT
IP address.
The use of FQDN will be applied for all installation modes: AIO-SX,
AIO-DX, Standard, AIO-PLUS and DC subclouds. But given the
complexities of supporting the multi-host reconfiguration,
the MGMT network reconfiguration will focus on support for AIO-SX
only.
The DNSMASQ service must start as soon as possible to translate
the FQDN to IP address, for this reason the dnsmasq will start
as soon the management-ip is ready.
Test plan ( Debian only )
- AIO-SX and AIO-DX virtualbox installation IPv4/IPv6
- Standard virtualbox installation IPv6
- DC virtualbox installation IPv4 ( AIO-SX/DX subclouds )
- AIO-SX and AIO-DX installation IPv4/IPv6
- AIO-DX plus installation IPv6
- DC IPv6 and subcloud AIO-SX
- AIO-DX host-swact
- DC IPv4 virtualbox with subcloud AIO-DX and AIO-DX
- AIO-SX to AIO-DX migration
- netstat -tupl ( no services are using the MGMT IP address )
- Ran sanity/regression tests
Story: 2010722
Task: 48889
Depends-On: https://review.opendev.org/c/starlingx/config/+/886208
Change-Id: If118132410a5a3db4c3a9d0ba029f4d45521574d
Signed-off-by: Fabiano Correa Mercer <fabiano.correamercer@windriver.com>
The SM_FAILOVER_IF_STATE_MASK change from 0xF to 0x3F
mask was clearing the HEARTBEAT ALIVE flag.
SM_FAILOVER_HEARTBEAT_ALIVE = (0x1 << 4), // 16
This change restores previous system behavior. Tester performs a
cable pull on the oam ports. The expected behavior is an alarm
being raised. Instead the standby controller ended up getting rebooted.
oam interface testing was simulated by bringing the ip link down for 1
second.
For example:
sudo ip link set <oam> down; sleep 1 ; sudo ip link set <oam> up
-----------------
Before change
-----------------
- Heartbeat loss on oam interface resulted in standby controller reboot
-----------------
After change:
-----------------
- Heartbeat loss on oam interface resulted in alarm raised
- Logs indicate the health score of controller-1 drops by 1 point
Test plan:
PASS - AIO-SX: iso install
PASS - AIO-DX: iso install
drop oam interface on standby
verify standby controller-1 is not rebooted
by active controller-0
restore oam interface
PASS - AIO-DX: system host-swact . swact back and forth
Closes-Bug: 2037579
Change-Id: I4f1ffc1169d4df090f71377e5aa8247e1cd17fc3
Signed-off-by: Kyale, Eliud <Eliud.Kyale@windriver.com>
The SM throttling is a feature that limits the number of parallel
service-enabling processes at a time. The SM throttling mechanism is
always ON, during and after startup. Whenever SM controlled services
transition to enabled status, it takes part of the process.
By default, the SM throttling allows a maximum of 2 parallel service
enabling processes at a time, but the throttling size is configurable,
by means of the field ENABLING_THROTTLE from the CONFIGURATION table in
the SM-DB database. Hence, it is possible to disable the SM throttling
by increasing the throttling size to a reasonable big enough value,
such way to enable full capacity of parallel service enabling.
This commit improves system performance by disabling the SM throttling
for AIO-SX systems, while still keeping the SM throttling mechanism,
should it ever be needed as a fallback, for robustness reasons.
In order to evaluate the SM throttling feature, and its costs in terms
of performance, the throttling size was modified such way to disable the
feature, from value 2 to 1000, and a series of tests were conducted to
evaluate stability and performance benefits.
Test Plan:
- Fresh Install and bootstrap (PASS)
- Lock/Unlock (PASS)
- Restart SM service (PASS)
Story: 2010802
Task: 48312
Change-Id: Ie96115293049e9939bc43feb2ad11432dd318323
Signed-off-by: Matheus Guilhermino <matheus.machadoguilhermino@windriver.com>
Ostree doesn't manage the /var filesystem. Anything
installed there during initial filesystem setup, becomes
unpatchble. This commit changes the sm-patch.sql deploy path
to a place that ostree handles, /usr/share/sm/patches in this
case and symlinks it to /var/lib/sm/patches/sm-patch.sql.
Test Plan:
PASS: ISO install symlink created
PASS: sm-patch.sql installed to /usr/share/sm/patches
PASS: PATCH apply and changes applied to /var/lib/sm/
patches/sm-patch.sql on stx8
Closes-Bug: 2030890
Change-Id: I07047e5383e8ae9e57687cd1e852c2efc0eb755f
Signed-off-by: Luis Eduardo Bonatti <LuizEduardo.Bonatti@windriver.com>
A requirement for a subcloud's admin network is that its
subnet information be able to be updated without host
lock / unlock.
Accordingly, the service domain interface and admin-ip
service in SM must be provisioned / deprovisioned at
runtime.
In an AIO-DX system this can cause issues in certain
circumstances as the disablement / enablement must be
done via puppet and can be affected by the ordering a
user performs each action as well as the timing of the
currently running manifests on each host.
This commit disables the failover behaviour for the admin
network, as link flapping and heartbeat losses are expected
as the service domain interface is provisioned/deprovisioned.
Also in this commit is the disablement of heartbeat messages
on service domain interface de-provision to prevent log
spamming, as well as a couple other minor issues that were
found while testing.
Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/889872
Test plan:
- No uncontrolled swacts while re-configuring admin subnets
or reverting to the management subnet (deleting the admin
address pool) dozens of times.
- Alarms still generated on interface down / heartbeat loss
- Switching back and forth between admin network / mgmt
network via dcmanager.
Story: 2010319
Task: 47707
Change-Id: I761b5b20b6de198ef763b2d3480e6f7cd380f952
Signed-off-by: Steven Webster <steven.webster@windriver.com>
The Sysinv components utilize ZeroMQ for communication
among each other. The service dependency between
sysinv-conductor and rabbit are removed on Service Manager
improving the time it takes to swact, by 20%.
Test Plan:
PASS: Bootstrap AIO-SX, AIO-DX and Standard.
PASS: Lock/unlock/swact in all environments
PASS: Boostrap DC and perform lock/unlock/swact
PASS: Add subcloud in DC environment
PASS: Test DC orchestration (dcmanger dcorch)
PASS: Restart sysinv only
PASS: Restart sysinv, then dcorch and dcmanager
PASS: After restart sysinv, dcorch and dcmanager run manage/unmanage
PASS: After restart sysinv, run subcloud backup
PASS: After all restart of services verify the whole system is working
Refer to review
https://review.opendev.org/c/starlingx/config/+/859571
for sysinv/ZeroMQ change details
Closes-bug: 2022083
Signed-off-by: Lucas Borges <lucas.borges@windriver.com>
Change-Id: I4014c05c914fc946946b14519f28a85067b06b34
In rare cases, when system running slowly with significant scheduling
delay, rabbit disable action timeout continually. As final resort sm
reboots the impacted controller for recovery after failure count reaches
MAX_TRANSITION_FAILURES. As rabbit service disable timeout is set to 60
seconds, this result a significant delay before reboot for recovery.
This change updates MAX_TRANSITION_FAILURES of rabbit service from
16 to 5 to reduce the delay of recovery of rabbit failure.
TCs passed:
Install a DX system
Observed service group recovery escalated to reboot after 5 forced
rabbit disable failure.
Closes-bug: 2016168
Signed-off-by: Bin Qian <bin.qian@windriver.com>
Change-Id: I660a64f0e78b6564456eb26245b672d2549f9a3b
- add failover state to heartbeat message ( 4 bits )
- add logic to survived_state to use peer's
failover state to determine whether to exit survived state
and enter normal state
- throttle peer is normal events with a threshold of 10
used to ensure the peer is normal and stable
- change fsm->send_event() log to debug from info log level
- a few logging improvements; debug send_event logs
- update copyright year 2023
Test plan:
PASS - AIO-DX: iso install
PASS - AIO-DX: crash the sm as indicated in bug
and observe swact to standby
PASS - AIO-DX: manual swact
PASS - AIO-DX: power off active controller
PASS - AIO-SX: install and basic sanity check
PASS - AIO-SX: upgrade test to verify sm heartbeat
messages changes still function when
controllers are running different loads
Closes-Bug: 2012519
Signed-off-by: Kyale, Eliud <Eliud.Kyale@windriver.com>
Change-Id: I1f86dcb8c9d9dbaf436b9240867f61adc405e88c
This change is to update the service disabling and going standby
dependency check.
The 2 specific rules are
1. "service a" has a disable action dependency to "service b", with
targeted "service b" state of disabled, disable action of
"service a" is considered as "dependency met" only when "service b"
is in disabled stated, or enabled-standby state.
2. "service a" has a go-standby action dependency "to service b", with
targeted "service b" state of disabled, go-standby action of
"service a" is considered as "dependency met" only when "service b"
is in disabled stated, or enabled-standby state.
TCs:
passed: Perform repeatedly host-swact operations, with adding long
delay in xxx-fs ocf-script in disable action, observed that
all xxx-fs services are disabled before drbd-xxx services
start disabling.
Closes-Bug: 2012570
Signed-off-by: Bin Qian <bin.qian@windriver.com>
Change-Id: Ie9717d3b2b73dc7d623e1b980b3387c6c4e6d991
The platform-nfs-ip service is not necessary for fresh installs
because it is just an alias for the controller IP.
But for old releases like StarlingX rel. 6 or 7 the
platform-nfs-ip uses a specific IP, If for some reason an error
occurs during the upgrade process, the upgrade will be aborted
and the nodes will downgrade to the old release again.
At this moment the nodes will try to communicate with the
previous platform-nfs-ip configured in /etc/hosts.
But if the active controller is using the new Release
this IP doesn't exist anymore and the downgrade will fail.
For this reason the platform-nfs-ip service will be available
just for upgrade operations and will be deprovisioned for fresh
installs or at the end of the upgrade process
( upgrade-activate phase ).
Test plan
PASS Fresh install on AIO-SX
Fresh install on AIO-DX
PASS Upgrade AIO-DX system from CENTOS Rel 7 to DEBIAN Rel 8
PASS Reboot controller-0 during upgrade of AIO-DX
controller-1 was the active one with the new release ( Rel 8 )
controller-0 using old release.
reboot controller-0 and check if it could connect to
controller-1 using old platform-nfs-ip.
PASS Upgrade-abort during AIO-DX upgrade
controller-1 was the active controller and already upgraded
controller-0 was upgraded but locked.
Abort the upgrade and downgrade to old release ( Rel 7 )
Partial-Bug: #2012387
Signed-off-by: Fabiano Mercer <fabiano.correamercer@windriver.com>
Change-Id: I704e15fffc6e7efa7b1fea56164a21af02222dd6
- rename mutexes from generic name '_mutex'
- create common util functions for initializing and destroying mutexes
- add mutex initialize/finalize functions
and run them in sm_main_process_initialize
- update copyright info to 2023
Test plan:
PASS - AIO-DX: iso install
PASS - AIO-DX: verify ha swact
PASS - AIO-DX: failover swact test
PASS - AIO-DX: run pi_stress (rt-tests)
to confirm priority inheritance POSIX attribute is working
-------------------------------------------------------------------
sysadmin@localhost:~$ sudo pi_stress --uniprocessor --duration=10s
Starting PI Stress Test
Number of thread groups: 3
Duration of test run: 10 seconds
Number of inversions per group: unlimited
Admin thread SCHED_FIFO priority 4
3 groups of 3 threads will be created
High thread SCHED_FIFO priority 3
Med thread SCHED_FIFO priority 2
Low thread SCHED_FIFO priority 1
Current Inversions: 992034
Stopping test
Total inversion performed: 992038
Test Duration: 0 days, 0 hours, 0 minutes, 11 seconds
-------------------------------------------------------------------
PASS - AIO-DX: valgrind helgrind
* test for inconsistence lock ordering
* test race conditions
[ detected outside scope of jira ]
* test for deadlocks
Task: 47503
Story: 2010609
Signed-off-by: Kyale, Eliud <Eliud.Kyale@windriver.com>
Change-Id: Ic77f08ca7c3a687b1cc219ac9cba5711979206e8
Add SM support for the DC admin network
This commit adds SM support for the DC admin network.
The admin network is intended to be used between a subcloud
and system controller. Because the (existing) management network
is so embedded in other parts of the StarlingX system, it makes
it prohibitively hard to re-configure this network after initial
installation. The admin network is intended to be isolated from
the management network, allowing re-configuration of the network
parameters in the case that the physical network between subcloud
and system controller has been changed.
In the case of admin network usage, the management network still
exist but is a private network in the context of a subcloud.
This specific commit provides for admin-ip and admin-interface
services to be added to the SM database and be recognized in
processing similar to the management, cluster-host, oam, etc
networks.
Since there is a requirement for the admin IP subnet information
to be allowed to change at runtime, in-service updating of SM
information relating to the admin-ip service (floating IP), as
well as unicast heartbeating between peers is also added in this
commit.
Testing:
AIO-SX:
- admin-ip service is enabled when the admin network is
created.
- admin-ip service is not enabled when the admin network is
not created.
- floating-ip is updated on the admin interface when admin
addr-pool information is changed.
AIO-DX:
- admin-ip service is enabled when the admin network is
created.
- admin-ip service is not enabled when the admin network
is not created.
- floating-ip is updated on the active-controller when the
admin addr-pool information is changed.
- When a peer admin interface is down, an alarm is raised.
- When a peer admin IP is not correct (changed), an alarm
is raised.
- Swact between controllers.
- Inactive controller admin interface goes down
Result: A 400.005 major communication loss fault is generated
for the inactive controller entity
- Inactive controller admin interface comes back up
Result: The fault is cleared
- Inactive controller admin IP address is removed/changed
Result: Two 400.005 major communication loss faults are
generated for both controller entities
- Inactive controller admin node IP address is re-applied
Result: The faults are cleared
- Active admin interface goes down
Result: A 400.005 major communication loss fault is generated
for the inactive controller entity. A swact is not
issued.
- Active admin interface comes back up
Result: The fault is cleared
- Active admin node IP address is removed/changed
Result: Two 400.005 major communication loss faults are
generated for both controller entities.
A swact is not issued.
- Active admin floating IP address is removed/changed
Result: A 400.001 critical admin-services / admin-ip alarm
is raised.
A swact occurs.
The floating admin IP is applied to the newly active
controller. Alarms are cleared.
- After the above test, the newly active controller swacts back
to the previously active controller.
Result: No alarms.
The floating IP is applied to the newly active
controller.
- The cable for the management interface on the active controller
is pulled
Result: A swact occurs
- The cable for the OAM interface on the active controller
is pulled
Result: A swact occurs
- The cable for the Admin interface on the active controller
is pulled
Result: A swact occurs. 400.005 alarms are raised.
- The mgmt, cluster-host, oam interfaces are all brought down/up at
the same time. The admin interface is also brought down,
but not brought back up
back up.
Result: A swact occurs, with multiple controller-services
related to the mgmt interface being in degraded state.
Story: 2010319
Task: 47278
Signed-off-by: Steven Webster <steven.webster@windriver.com>
Change-Id: I65df52600f4d5c499dceed32739cab414d36847a
The Debian packaging has been changed to reflect all the
git commits under the directory, and not just the commits
to the metadata folder.
This ensures that any new code submissions under those
directories will increment the versions.
Test Plan:
PASS: build-pkgs -p sm-common
PASS: build-pkgs -p sm-db
PASS: build-pkgs -p sm
PASS: build-pkgs -p sm-api
PASS: build-pkgs -p sm-client
PASS: build-pkgs -p sm-tools
PASS: build-pkgs -p stx-ocf-scripts
Story: 2010550
Task: 47341
Signed-off-by: Luis Sampaio <luis.sampaio@windriver.com>
Change-Id: I54cde0fe252c3bcef669969a1b0675a2df8b3d69
When the system is unstable, using a lot of CPU, it takes
more time for the communication between the components to happen,
such as the communication between mgr-restful-plugin and ceph-mgr.
This communication failure may result on a failed audit by SM,
which then restarts the mgr-restful-plugin.
Change [1] resolved this issue by increasing the timeout and retries
of mgr-restful-plugin in SM database.
However, there is a disable dependence chain between mgr-restful,
ceph-manager, and sysinv-conductor which results on sysinv-conductor
being restarted if mgr-restful-plugin or ceph-manager is also disabled
by SM. This can impact platform-integ-apps apply or any other action
being executed by sysinv-conductor.
The ceph manager -> sysinv-conductor dependence is not necessary
anymore after the changes [2] and [3] were merged, thus this change
removes this dependence.
TEST PLAN:
PASS: AIO-SX: bootstrap, unlock and apply platform-integ-apps
PASS: Force ceph-manager to be restarted by SM, and verify that
sysinv-conductor keeps running
Related-Bug: 2000080
[1] https://review.opendev.org/c/starlingx/ha/+/868118
[2] https://review.opendev.org/c/starlingx/utilities/+/856320
[3] https://review.opendev.org/c/starlingx/utilities/+/860570
Signed-off-by: Alyson Deives Pereira <alyson.deivespereira@windriver.com>
Change-Id: I949ccebd509b8099870b3dfda252a60b6b423715
The guest-agent service it is currently being activated
in setups where stx-openstack is applied but it's not
being used since we went to containerized openstack.
Since this service is no longer being used we are currently
removing the service and all related queries that are on
the create_sm_db file.
Test Plan:
PASS: Perform a fresh install on a duplex environment and
check that no error log related to guest-agent is appearing
in the sm log file and no 400.02 alarm was raised on
fm alarm-list.
Closes-Bug: 2003117
Signed-off-by: Rafael Falcao <rafael.vieirafalcao@windriver.com>
Change-Id: I145bd8a45c12319facc4d1eff90b785a33a1d2c0
mgr-restful-plugin in SM database
When the system is unstable, using a lot of CPU, it takes
more time for the communication between the components to happen.
So it's necessary to increase the maximum of retries and timeouts
in the "audit-enable" of the mgr-restful-plugin to prevent errors
from happening.
Test Plan:
PASS: mgr-restful-plugin restarted by SM (AIO-SX)
Closes-bug: 2000080
Signed-off-by: Erickson Silva de Oliveira <Erickson.SilvadeOliveira@windriver.com>
Change-Id: I0f8462fef20196a3bb913fa7d374a86a2c6565f1
The sm process initializes with the default limit
for the maximum number of open files. If somehow this default value is
set to a high value, performance degradation can occur. For instance,
the sm_service_action_run method from sm_service_action.c contains a
for loop that closes file descriptors up to this limit.
To avoid performance degradation, this change sets the limit for
the maximum open files to 1024 using the systemd property LimitNOFILE.
TEST PLAN:
PASS: Confirm that 1024 is the max open file limit in
/proc/<sm_pid>/limits.
Story: 2010087
Task: 47102
Signed-off-by: Alyson Deives Pereira <alyson.deivespereira@windriver.com>
Change-Id: Iff848da7461a8b644057e9f58c69fa2d78499226
pidof command returns subprocess id when SM main process terminates.
This result a false postive that SM is already running so the start
action is skipped.
Make changes to the SM lsb script to distingrish if a subprocess ID
is returned, and attempt to kill it to speed up recovery of SM.
Revert the change to extend startuptime to 15 seconds back to 5.
Test Cases:
kill SM process, observe SM process starts immediately after the
subprocess is killed. SM is recovered within 2 seconds.
(calculated by last and first logging of SM)
Change-Id: Ida834e7dd31a493ee6193b4d8ee73ebd97513de2
Closes-Bug: 1998349
Signed-off-by: Bin Qian <bin.qian@windriver.com>
On DC with central cloud as standard, sometimes occurs
that RabbitMQ doesn't start properly and keeps restarting
by sm. Part of the solution is increasing the startup time from
60 to 90, so the sm doesn't restart the service before
the service is being initialized. This, alongside [1],
fixes the problem.
Test Plan:
PASS - Run sm-restart service rabbit successfully check rabbit was
running as expected. This test was used to recreate the bug.
PASS - Reboot the host successfully and check rabbit was running as
expected.
PASS - Lock/unlock and check if rabbit was running as expected.
Regression test:
PASS - Install and bootstrap DC
PASS - Install and bootstrap DX
[1]
https://review.opendev.org/c/starlingx/config-files/+/865693
Closes-bug: 1997966
Signed-off-by: Victor Romano <victor.gluzromano@windriver.com>
Change-Id: I1425324c72ae41c66558709dead6f1ff92a3230a
SM's pmon.conf file has the startuptime = 5 seconds but in Debian
it's frequently taking longer than that for the SM process to
start and produce its PID file.
In order to make SM recovery process smooth, increase this timeout
to 15 seconds.
Closes-bug: 1998349
Signed-off-by: Bin Qian <bin.qian@windriver.com>
Change-Id: I64476e394e346c9b8cf5b5aca2ad04ba463b9728
When swacting from controller-0 to controller-1 during
the upgrade, swact failed multiple times with sysinv-inv
service timeout messages, therefore blocking the upgrade
from continuing.
This commit increases sysinv-inv service timeout.
Test Plan
PASS: run AIO-DX upgrade successfully
PASS: run AIO-SX upgrade successfully
PASS: install/bootstrap/unlock
Closes-bug: 1995713
Change-Id: Id612f27f20060715b6e771f9bb73a41472967d05
Signed-off-by: Heitor Matsui <HeitorVieira.Matsui@windriver.com>
It is not necessary to allocate a specific IP address
for NFS.
It will use the floating management IP address, so there is
no reason to use the platform-nfs-ip service anymore.
With this change, another IP address is available to
configure an additional Worker node.
( i.e: subnet /29 )
Story: 2010351
Task: 46502
Test plan ( Debian only )
PASS Installed 2+2 system with subnet configuration (/29)
(2controllers, 2 workers)
PASS system host-lock/unlock/swact between controllers
PASS Installed 2+2 system using IPv6 subnet
Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/859903
Change-Id: I1e8c85341b8c8e337794d344585deff2904daeb5
Remove the installation of per-package preset installs
since they are centrally managed now by the ISO install
for the following packages:
- sm-api
- sm-common
- sm-eru
- sm
Story: 2009968
Task: 46406
Test Plan
PASS Build package
PASS Build ISO
PASS Check for non-existant preset file in /etc/systemd/system-preset
Depends-On: https://review.opendev.org/c/starlingx/integ/+/853653
Signed-off-by: Charles Short <charles.short@windriver.com>
Change-Id: I537a013c77c7603e48cbf92837a4174282d70032
Removed conf files from /etc/pmon.d/
as they are being moved to another location.
This is part of an effort to allow pmon conf files
to be selected at runtime by kickstarts.
The change is debian-only, since centos support
will be dropped soon.
Centos' pmon conf files remain in /etc/pmon.d/
Test Plan:
PASS - deb doesn't install anything to /etc/pmon.d/
PASS - rpm files unchanged
PASS - AIOSX unlocked-enabled-available
PASS - Standard 2+2 unlocked-enabled-available
Story: 2010211
Task: 46304
Depends-On: https://review.opendev.org/c/starlingx/metal/+/855095
Signed-off-by: Leonardo Fagundes Luz Serrano <Leonardo.FagundesLuzSerrano@windriver.com>
Change-Id: Ie2f8e73f8664746d0213e98bba17e56d98d93b4f
Created a duplicate install of /etc/pmon.d/*.conf files
to /usr/share/starlingx/pmon.d/
This is part of an effort to allow pmon conf files
to be selected at runtime by kickstarts.
Test Plan:
PASS: duplicate conf on deb
Story: 2010211
Task: 46114
Signed-off-by: Leonardo Fagundes Luz Serrano <Leonardo.FagundesLuzSerrano@windriver.com>
Change-Id: Ia3774ac59b1df40aa1726bb182390b4b58812141
sm-watchdog was introduced as a workaround because of NFS hung. Another
clean fix is already provided, but the sm-watchdog was not removed.
Test plan:
[centos] build, install and unlock.
[debian] build, install and unlock.
Story: 2010087
Task: 46007
Signed-off-by: Davi Frossard <dbarrosf@windriver.com>
Change-Id: I29fffff4e8982dc504f104f49c6586f7c74527fb
sm-db build-depends on sm-common-libs which run-depends on mtce, if
sm-db builts prior to mtce, a failure is:
"sm-common-libs : Depends: mtce-pmon but it is not installable"
This is a temporary fix to make the build order as mtce, sm-common,
sm-db to bypass it.
Once it has been resolved on build system, the fix can be reverted.
Story: 2009101
Task: 44976
Test Plan:
Pass: build-pkgs -a without failure of sm-db.
Signed-off-by: Yue Tao <yue.tao@windriver.com>
Change-Id: I76c16dc87143db9b8a0484d74071afb6e8740745