starlingx/ha - ha - OpenDev: Free Software Needs Free Tools

Commit Graph

Author	SHA1	Message	Date
Zuul	3a0fa03806	Merge "Sysinv-inv depends on DNSMasq"	2024-06-03 19:41:47 +00:00
Fabiano Correa Mercer	094ee57df8	Sysinv-inv depends on DNSMasq During the first unlock after a fresh install, the sysinv-inv starts the WSGIService that uses the FQDN: "controller.internal" For this reason it needs that DNSMasq is ready to resolve the IP address. Test done: AIO-SX fresh install AIO-DX fresh install AIO-DX host-swact Story: 2010722 Task: 50221 Depends-On: https://review.opendev.org/c/starlingx/config/+/920694 Change-Id: If255441f12da370bd48641d7c521aea5f3012af2 Signed-off-by: Fabiano Correa Mercer <fabiano.correamercer@windriver.com>	2024-06-03 10:52:27 -03:00
Zuul	bbb077583d	Merge "Update Rook Ceph service names in SM"	2024-05-31 15:42:45 +00:00
gcabral	8ba47d9196	Update Rook Ceph service names in SM This commit updates the name of the services for the new Rook Ceph. drbd-rookmon -> drbd-rook rookmon-fs -> rook-fs rook-mon-exit > rook-mon-exit (no change) Test Plan: PASS: AIO-DX -> Standby controller locked and ceph-rook as storage-backend + controller-fs add ceph-float=<size> + checking if everything is created correctly: lv, drbd and SM services. PASS: Perform host-swact after the above test and confirm primary/secondary DRBD change on 'drbd-ceph'. PASS: AIO-DX -> Standby controller locked + controllerfs-delete ceph + checking if everything is deleted correctly: lv, drbd and SM services Story: 2011117 Task: 50097 Change-Id: Ib896ae271f4e649853af950aebe33111948e639e Co-Authored-By: Robert Church <robert.church@windriver.com> Signed-off-by: Gabriel de Araújo Cabral <gabriel.cabral@windriver.com>	2024-05-28 11:10:44 +00:00
Zuul	b08d386c89	Merge "Increase action timeout for some SM services"	2024-05-27 18:52:47 +00:00
Manoel Benedito Neto	e3382f7289	Fix ipsec-config service dependency This commit sets a dependency between ipsec-config service and management-ipv4/ipv6 services. The disable action may be performed on ipsec-config service after management-ipv4/ipv6 dependent services are on disabled state. This fix is needed due to the verification step present on monitor function for ipsec-config service. This function checks if the floating IP is present on system ip tables and swanctl configuration and verifies if the conditions are satisfied for active and standby controllers. It is expected that floating IP is present on the active controller system where ipsec-config is on enabled-active state and not present on standby controller system where ipsec-config is on disabled state. The floating IP is added and removed by management-ipv4/ipv6 service per their start and stop actions. In the previous service dependency configuration, this would cause an error during ipsec-config audit- disabled action and 400.001 alarm was present on system. Therefore, this commit fixes this service dependency relation between ipsec-config and management-ip services. Test Plan: PASS: Full build, system install, bootstrap and unlock of a DX system with unlocked enabled available state. No 400.001 alarms present on system. PASS: On a DX system with unlocked enabled available state, perform a host-swact on controller-0. Observe that ipsec-config service changes its state on controllers, from disabled to enabled-active on active controller and from enabled-active to disabled on standby controller. No errors are reported on daemon-ocf.log related to ipsec-config or management-ip services. No 400.001 alarms present on system. PASS: On a DX system with unlocked enabled available state, perform a host-lock and host-unlock on controller-1. Observe that system boots with ipsec-config service on disabled state. No errors are reported on daemon-ocf.log related to ipsec-config or management-ip services. No 400.001 alarms present on system. Story: 2010940 Task: 50196 Change-Id: Idd2f487b4589e1f66d79ea5c4f13c36e67c302be Signed-off-by: Manoel Benedito Neto <Manoel.BeneditoNeto@windriver.com>	2024-05-27 11:41:46 -03:00
Andy Ning	9d35cc3248	Increase action timeout for some SM services This change increased action timeout for some SM services. The reason for the increase is, when IPsec for mgmt network is enabled, CPU usage is going up 3-4 times during system installation, causing actions for some SM services timeout or failed, this in turn triggers uncontrolled swacts. New timeouts are set for the following services. They are set to 4 times of the original values. This is based on the performance measurement that indicates CPU usage could go up 3-4 times during system installation. With these new values, multiple installation tests are successful without uncontrolled swact seen. There will be follow up tunings that potentially decreases these numbers. ceph-mon sysinv-inv horizon rabbit Test Plan: PASS: DX deployment, verify deployment is successful, both controllers are in unlocked\| enabled\| available states after deployemt complete. PASS: Multi nodes system deployment, verify deployment is successful, all hosts are in unlocked\| enabled\| available states after deployemt complete. PASS: Swact controllers multi times, verify swact is successful and system is stable. PASS: Lock/unlock hosts multi times with either controller-0 or controller-1 as active controller, veriry the host locked and unlocked comes back normally, and system is stable. Story: 2010940 Task: 50187 Change-Id: I0ecd9cc82415b5a232040b6707c1f945c4f16d08 Signed-off-by: Andy Ning <andy.ning@windriver.com>	2024-05-24 13:44:19 -04:00
Li Zhu	09d63cb790	SM management for dcorch-engine-worker service Add dcorch-engine-worker service into SM database for its management in HA. Depends-On: https://review.opendev.org/c/starlingx/distcloud/+/917792 Story: 2011106 Task: 50016 Change-Id: I6cdbf6a754af9339fc7db8aa453a4c49e8277613 Signed-off-by: lzhu1 <li.zhu@windriver.com>	2024-05-15 20:50:53 +00:00
Manoel Benedito Neto	683fd05d50	Add and configure IPsec Config Service This commit adds ipsec-config service to sm-db. This service is responsible to manage swanctl configuration by creating symbolic links between swanctl.conf and different conf files. Test Plan: PASS: Build a new debian iso containing the changes. PASS: Bootstrap, install and unlock a DX system with unlocked enable available status and IPsec enabled. Observe that ipsec-config service data is present on sm-db tables. Story: 2010940 Task: 49998 Depends-On: https://review.opendev.org/c/starlingx/config/+/916841 Change-Id: Ia1544134b7d4d49897153c064b996a1f67b7599b Signed-off-by: Manoel Benedito Neto <Manoel.BeneditoNeto@windriver.com>	2024-05-03 12:11:59 -04:00
Zuul	f7fb00619b	Merge "Remove CentOS/OpenSUSE build support"	2024-04-29 12:56:35 +00:00
Scott Little	cb7602eafa	Remove CentOS/OpenSUSE build support StarlingX stopped supporting CentOS builds in the after release 7.0. This update will strip CentOS from our code base. It will also remove references to the failed OpenSUSE feature as well. Story: 2011110 Task: 49952 Change-Id: I1bed2fde10326ecb75b45376efea8480e0f23675 Signed-off-by: Scott Little <scott.little@windriver.com>	2024-04-26 14:10:39 -04:00
Zuul	afbab889d1	Merge "Split IP services in IPv4 and IPv6 for dual-stack support"	2024-04-25 18:38:37 +00:00
Andre Kantek	64278ce1e6	Split IP services in IPv4 and IPv6 for dual-stack support This change splits the IP service for each platform network into ipv4 and ipv6 t support dual-stack. It still supporting single-stack (when there is only ipv4 or ipv6) Test Plan: [PASS] install, lock, unlock and swact for the following setups: - AIO-SX (IPv4 and IPv6) - AIO-DX (IPv4 and IPv6) - Standard (IPv4 and IPv6) - DC (SysCtrl=AIO-DX, subcloud=AIO-SX) [PASS] Add dual-stack configuration and validate services operation with lock, unlock and swact: - AIO-SX (IPv4 and IPv6) - AIO-DX (IPv4 and IPv6) - Standard (IPv4 and IPv6) - DC (SysCtrl=AIO-DX, subcloud=AIO-SX), using the admin network Story: 2011027 Task: 49761 Change-Id: Ic6451cae04769409babd2d6507c3677d1cce5617 Signed-off-by: Andre Kantek <andrefernandozanella.kantek@windriver.com>	2024-04-16 14:15:04 -03:00
junfeng-li	54ceb83726	Fix swact back after legacy upgrade This commit is to change the swact precheck logic. The new logic will unblock the swact precheck for host state sync state if the USM endpoint is not present. This change is needed for the release that doesn't have USM endpoint present. Test Plan: PASS: swact back to controller 0 after legacy upgrade to 24.09 PaSS: swact between controllers in 24.09 Task: 49826 Story: 2010676 Change-Id: I6824f00589057f4d48c8df04425431cb22361e8e Signed-off-by: junfeng-li <junfeng.li@windriver.com>	2024-04-05 18:06:07 +00:00
junfeng-li	fc8b75c6f9	Fix USM endpoint append error This is to fix the USM endpoint URL not appended properly using .join() function. The way .join() was used only append the endpoint resource to a empty string. Test Plan: PASS: Run the host-swact Depends-on: https://review.opendev.org/c/starlingx/update/+/911003 Task: 49660 Story: 2010676 Change-Id: Icf47494843d7ea6c9fcd73e9256d9be352d8f76f Signed-off-by: junfeng-li <junfeng.li@windriver.com>	2024-03-04 21:49:08 +00:00
Zuul	95b6310ac1	Merge "Deploy state sync on swact"	2024-02-29 18:51:50 +00:00
junfeng-li	23f48bd545	Deploy state sync on swact This commit is to ensure both controllers deployment state is in synced before host swact during platform upgrade. If the USM deploy is not started, this host swact pre-check is always passed. During the pre-swact check, the SM calls USM REST API endpoint to get the controller sync status. If the controllers deployment state is not in sync, the host swact is stopped. Depends-on: https://review.opendev.org/c/starlingx/update/+/906005 Test Plan: PASS: executed host swact when controllers are in sync PASS: executed host swact when controllers are not in sync Task: 49425 Story: 2010676 Change-Id: I8d262a731583f691fd0d85a33ddebcbb12f549e8 Signed-off-by: junfeng-li <junfeng.li@windriver.com>	2024-02-28 20:07:48 +00:00
Eric MacDonald	91fa44188c	Add node locked gate to SM enable for DX systems Service Management (SM) sometimes selects and activates services on a locked controller following a dead office recovery. This update adds a node locked check to SM's enable handler to block enable if present much like the existing goenabled check blocks enable if not present in the same function. The enable gate file is /etc/mtc/tmp/.node_locked on the local host. Maintenance manages the presence or absence of this file based on the node's administrative state. This update also cleans up some extra whitespace in the changed file. Test Plan: PASS: Verify system build. PASS: Verify AIO SX install. PASS: Verify AIO DX install. PASS: Verify Standard DX system install with worker and storage. For Both 'AIO DX' and 'Standard DX with worker and storage': PASS: Verify SM does not activate on a locked DX controller. PASS: ... DOR case PASS: ... Uncontrolled Swact case PASS: Verify Standard DX behavior over DOR with one locked controller while the only unlocked controller does not recover. PASS: Verify behavior after above test case once the only unlocked controller does recover. PASS: Verify lock of the standby controller and its sm logs PASS: Verify manually creating the new Nv locked file on the active controller will cause SM to go disabled and shut down all services on that controller. ... If there is another unlocked controller then verify it takes over as an uncontrolled swact. ... If there is no unlocked standby controller then verify SM remains shutdown until the manually created Nv node locked file is removed. At which point SM proceeds to activate services on that controller again. PASS: Verify SM ignores the node locked flag file for AIO SX systems. PASS: Verify lock/unlock of AIO SX controller. PASS: Verify original reported issue is resolved for AIO DX systems. Regression: PASS: Verify controlled swact with unlocked enabled standby. PASS: Verify uncontrolled swact with unlocked enabled standby. PASS: Verify standby controller lock/unlock soak loop (10). PASS: Verify swact loop soak (10). PASS: Verify no crash or core dumps. PASS: Verify SM logging Closes-Bug: 2051578 Change-Id: If8e27ef30d62096fa77c3868f4d460b18e10ade2 (cherry picked from commit `23d0d8ab2f`)	2024-02-26 22:15:03 +00:00
Zuul	338161f443	Merge "Revert "Add node locked gate to SM enable""	2024-02-23 14:26:53 +00:00
Eric MacDonald	1e62ab86f1	Revert "Add node locked gate to SM enable" This reverts commit `23d0d8ab2f`. Reason for revert: Breaks AIO SX Enable Change-Id: I662b8732e723f4ce5b748ef00a184ae5b8db523c	2024-02-23 14:06:10 +00:00
Zuul	031c2e223d	Merge "Add node locked gate to SM enable"	2024-02-16 16:11:22 +00:00
Zuul	9367d45672	Merge "Avoid potential blocking of heartbeat thread"	2024-02-14 21:35:21 +00:00
Eric MacDonald	23d0d8ab2f	Add node locked gate to SM enable Service Management (SM) sometimes selects and activates services on a locked controller following a dead office recovery. This update adds a node locked check to SM's enable handler to block enable if present much like the existing goenabled check blocks enable if not present in the same function. The enable gate file is /etc/mtc/tmp/.node_locked on the local host. Maintenance manages the presence or absence of this file based on the node's administrative state. This update also cleans up some extra whitespace in the changed file. Test Plan: PASS: Verify system build. PASS: Verify AIO DX install. PASS: Verify Standard DX system install with worker and storage. For Both 'AIO DX' and 'Standard DX with worker and storage': PASS: Verify SM does not activate on a locked controller. PASS: ... DOR case PASS: ... Uncontrolled Swact case PASS: Verify Standard DX behavior over DOR with one locked controller while the only unlocked controller does not recover. PASS: Verify behavior after above test case once the only unlocked controller does recover. PASS: Verify lock of the standby controller and its sm logs PASS: Verify manually creating the new Nv locked file on the active controller will cause SM to go disabled and shut down all services on that controller. ... If there is another unlocked controller then verify it takes over as an uncontrolled swact. ... If there is no unlocked standby controller then verify SM remains shutdown until the manually created Nv node locked file is removed. At which point SM proceeds to activate services on that controller again. Regression: PASS: Verify controlled swact with unlocked enabled standby. PASS: Verify uncontrolled swact with unlocked enabled standby. PASS: Verify standby controller lock/unlock soak loop (10). PASS: Verify swact loop soak (10). PASS: Verify no crash or core dumps. Closes-Bug: 2051578 Change-Id: I0f0e3d199586513ddce484fdcc056e1b2562b45f Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2024-02-14 13:01:03 +00:00
Zuul	2fd5ebc6e6	Merge "sm-common: add support for arm64"	2024-01-17 16:02:51 +00:00
Zuul	56b60d15a5	Merge "sm: fix the hardcoded includes for arm64"	2024-01-17 15:52:16 +00:00
Kyale, Eliud	0db57d60be	Add service dependancy haproxy dnsmasq haproxy uses dns resolution add service dependency to sm database to ensure that dnsmasq service is started before haproxy and dnsmasq is disabled after haproxy is disabled Test plan: PASS - AIO-SX: iso install PASS - AIO-SX: reboot test PASS - AIO-DX: iso install PASS - AIO-DX: swact test Closes-Bug: #2043506 Change-Id: I494faebfe67843d34819f66a0a2fbd977657bb6b Signed-off-by: Kyale, Eliud <Eliud.Kyale@windriver.com>	2024-01-16 09:32:53 -05:00
Jackie Huang	35d8d23563	sm-common: add support for arm64 Add support for aarch64 in sm_trap_thread_log. Test Plan: PASS: build-pkgs on x86-64 host PASS: build-image on x86-64 host PASS: build-pkgs on arm64 host PASS: build-image on arm64 host PASS: Deploy AIO-SX on x86-64 targets and check sm service PASS: Deploy AIO-SX on arm64 targets and check sm service PASS: Deploy AIO-DX on arm64 targets and check sm service PASS: Deploy std (2+2+2) on arm64 targets and check sm service Story: 2010739 Task: 48017 Change-Id: Iebea29e6df900f63d0dce24cf1a139f60c1cf6f8 Signed-off-by: Jackie Huang <jackie.huang@windriver.com>	2023-11-28 16:26:23 +08:00
Jackie Huang	15a8ffeee0	sm: fix the hardcoded includes for arm64 The includes path in Makefile is hardcoded with x86_64, use dpkg-architecture to check the host arch and replace the hardcoded name. Test Plan: PASS: build-pkgs on x86-64 host PASS: build-image on x86-64 host PASS: build-pkgs on arm64 host PASS: build-image on arm64 host PASS: Deploy AIO-SX on x86-64 targets and check sm service PASS: Deploy AIO-SX on arm64 targets and check sm service PASS: Deploy AIO-DX on arm64 targets and check sm service PASS: Deploy std (2+2+2) on arm64 targets and check sm service Story: 2010739 Task: 48017 Change-Id: Ie22477b7ec7df63377f666186d95201cd16f5809 Signed-off-by: Jackie Huang <jackie.huang@windriver.com>	2023-11-28 16:26:23 +08:00
Bin Qian	d91b069daf	Avoid potential blocking of heartbeat thread This is to avoid waiting for hbs cluster query for sending SM alive pulse. When a hbs cluster query or alive pulse is being sent, do not queue the subsequent alive pulse, as current request being sent is good enough to update hbs agent. Also move the function retrieving sock address to initial from inside the query sending procedure. The function getaddrinfo to avoid indirectly calling malloc, which invokes malloc_atfork to potentially a blocking call. TCs: This could improve in extreme situation only, passed regression. Closes-bug: 2025504 Change-Id: I520b42f0330b670e301279c2e42670d40361adc5 Signed-off-by: Bin Qian <bin.qian@windriver.com>	2023-11-17 21:01:08 +00:00
Zuul	4b800442ed	Merge "Use FQDN for MGMT network"	2023-11-02 20:22:20 +00:00
Steven Webster	6793f3840f	Use controller-services service group for admin-ip This commit resolves an issue seen when attempting to upgrade / patch from a previous StarlingX release to the current StarlingX development load in which the distributed cloud admin network feature is present. 1. Currently, the admin-services SM service group is created if the system is detected to be a subcloud. This presents a problem during upgrade, because the <N> system has no concept of the admin-services group, while the upgraded <N+1> system does. During upgrade when the system is swacted to the <N+1> system, an alarm will be raised as the <N> system has no admin-services group (designed to be an N+M redundancy model). 2. A potential solution to the above problem is to only provision the admin-services group if an operator has actually configured an admin interface / network after the full upgrade has completed. However, this presents the same sort of problem, as an interface-network association is done on a host-by-host basis, requiring a lock of each host to provision a new admin interface. Since there is no mechanism to provision a new SM service group at runtime, this leads to the same situation of one host being aware of the admin-services group, while the other does not. That is, a user might configure and admin inteface-network on host <X>, but the service group would not be present on host <Y>, leading to the same alarm. The solution here is to do away with the new admin-services service group, and leverage the existing controller-services group for the admin-ip service. Test-Plan: - Ensure no alarms when upgrading / patching a subcloud implementing the admin network feature. - Ensure a system utilizing the admin network can become online / in-sync. - Swact between controllers utilizing the admin-ip SM service to ensure the floating IP is correctly assigned to the active controller (using the controller-services service group). Regression: - Install a DC system using the management network. Create the admin interface, network on the subcloud and ensure a user can update the subcloud to use the admin network. Story: 2010319 Task: 47278 Change-Id: Ic36e83622c6ab5d15fd537be69d3314cb675c724 Signed-off-by: Steven Webster <steven.webster@windriver.com>	2023-10-30 01:52:58 -04:00
Fabiano Correa Mercer	c5fb81828a	Use FQDN for MGMT network The management network is used extensively for all internal communication. Since the original use of the network was a private network before it was exposed for external communication in a distributed cloud configuration, it was never designed to be reconfigured. To support MGMT network reconfiguration the idea is to configure the applications to use the hostname/FQDN instead of a static MGMT IP address. In this way the MGMT network can be changed and the services and applications will still work since they are using the hostname/FQDN and the DNS will be responsible to translate to the current MGMT IP address. The use of FQDN will be applied for all installation modes: AIO-SX, AIO-DX, Standard, AIO-PLUS and DC subclouds. But given the complexities of supporting the multi-host reconfiguration, the MGMT network reconfiguration will focus on support for AIO-SX only. The DNSMASQ service must start as soon as possible to translate the FQDN to IP address, for this reason the dnsmasq will start as soon the management-ip is ready. Test plan ( Debian only ) - AIO-SX and AIO-DX virtualbox installation IPv4/IPv6 - Standard virtualbox installation IPv6 - DC virtualbox installation IPv4 ( AIO-SX/DX subclouds ) - AIO-SX and AIO-DX installation IPv4/IPv6 - AIO-DX plus installation IPv6 - DC IPv6 and subcloud AIO-SX - AIO-DX host-swact - DC IPv4 virtualbox with subcloud AIO-DX and AIO-DX - AIO-SX to AIO-DX migration - netstat -tupl ( no services are using the MGMT IP address ) - Ran sanity/regression tests Story: 2010722 Task: 48889 Depends-On: https://review.opendev.org/c/starlingx/config/+/886208 Change-Id: If118132410a5a3db4c3a9d0ba029f4d45521574d Signed-off-by: Fabiano Correa Mercer <fabiano.correamercer@windriver.com>	2023-10-26 12:01:08 -03:00
Kyale, Eliud	efe4a7a370	IF_STATE_MASK fix for SM_FAILOVER_HEARTBEAT_ALIVE The SM_FAILOVER_IF_STATE_MASK change from 0xF to 0x3F mask was clearing the HEARTBEAT ALIVE flag. SM_FAILOVER_HEARTBEAT_ALIVE = (0x1 << 4), // 16 This change restores previous system behavior. Tester performs a cable pull on the oam ports. The expected behavior is an alarm being raised. Instead the standby controller ended up getting rebooted. oam interface testing was simulated by bringing the ip link down for 1 second. For example: sudo ip link set <oam> down; sleep 1 ; sudo ip link set <oam> up ----------------- Before change ----------------- - Heartbeat loss on oam interface resulted in standby controller reboot ----------------- After change: ----------------- - Heartbeat loss on oam interface resulted in alarm raised - Logs indicate the health score of controller-1 drops by 1 point Test plan: PASS - AIO-SX: iso install PASS - AIO-DX: iso install drop oam interface on standby verify standby controller-1 is not rebooted by active controller-0 restore oam interface PASS - AIO-DX: system host-swact . swact back and forth Closes-Bug: 2037579 Change-Id: I4f1ffc1169d4df090f71377e5aa8247e1cd17fc3 Signed-off-by: Kyale, Eliud <Eliud.Kyale@windriver.com>	2023-10-02 07:45:36 -04:00
Matheus Guilhermino	fd75850f12	Deployment Optimizations: SM throttling disable The SM throttling is a feature that limits the number of parallel service-enabling processes at a time. The SM throttling mechanism is always ON, during and after startup. Whenever SM controlled services transition to enabled status, it takes part of the process. By default, the SM throttling allows a maximum of 2 parallel service enabling processes at a time, but the throttling size is configurable, by means of the field ENABLING_THROTTLE from the CONFIGURATION table in the SM-DB database. Hence, it is possible to disable the SM throttling by increasing the throttling size to a reasonable big enough value, such way to enable full capacity of parallel service enabling. This commit improves system performance by disabling the SM throttling for AIO-SX systems, while still keeping the SM throttling mechanism, should it ever be needed as a fallback, for robustness reasons. In order to evaluate the SM throttling feature, and its costs in terms of performance, the throttling size was modified such way to disable the feature, from value 2 to 1000, and a series of tests were conducted to evaluate stability and performance benefits. Test Plan: - Fresh Install and bootstrap (PASS) - Lock/Unlock (PASS) - Restart SM service (PASS) Story: 2010802 Task: 48312 Change-Id: Ie96115293049e9939bc43feb2ad11432dd318323 Signed-off-by: Matheus Guilhermino <matheus.machadoguilhermino@windriver.com>	2023-09-14 14:31:20 -03:00
Luis Eduardo Bonatti	e35510e1cc	Fix /var are not updated according to the patch changes Ostree doesn't manage the /var filesystem. Anything installed there during initial filesystem setup, becomes unpatchble. This commit changes the sm-patch.sql deploy path to a place that ostree handles, /usr/share/sm/patches in this case and symlinks it to /var/lib/sm/patches/sm-patch.sql. Test Plan: PASS: ISO install symlink created PASS: sm-patch.sql installed to /usr/share/sm/patches PASS: PATCH apply and changes applied to /var/lib/sm/ patches/sm-patch.sql on stx8 Closes-Bug: 2030890 Change-Id: I07047e5383e8ae9e57687cd1e852c2efc0eb755f Signed-off-by: Luis Eduardo Bonatti <LuizEduardo.Bonatti@windriver.com>	2023-08-10 14:52:02 +00:00
Steven Webster	4a96509146	Disable admin network failover behaviour A requirement for a subcloud's admin network is that its subnet information be able to be updated without host lock / unlock. Accordingly, the service domain interface and admin-ip service in SM must be provisioned / deprovisioned at runtime. In an AIO-DX system this can cause issues in certain circumstances as the disablement / enablement must be done via puppet and can be affected by the ordering a user performs each action as well as the timing of the currently running manifests on each host. This commit disables the failover behaviour for the admin network, as link flapping and heartbeat losses are expected as the service domain interface is provisioned/deprovisioned. Also in this commit is the disablement of heartbeat messages on service domain interface de-provision to prevent log spamming, as well as a couple other minor issues that were found while testing. Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/889872 Test plan: - No uncontrolled swacts while re-configuring admin subnets or reverting to the management subnet (deleting the admin address pool) dozens of times. - Alarms still generated on interface down / heartbeat loss - Switching back and forth between admin network / mgmt network via dcmanager. Story: 2010319 Task: 47707 Change-Id: I761b5b20b6de198ef763b2d3480e6f7cd380f952 Signed-off-by: Steven Webster <steven.webster@windriver.com>	2023-08-01 12:08:56 -04:00
Roger Ferraz	ac8f60b120	starlingx/ha README improvement This story shall update the README file of a few most used StarlingX repos. Test Plan: N/A Story: 2010814 Task: 48377 Change-Id: I34b14275e6b61e1be6d659701be568e208e380e4 Signed-off-by: Roger Ferraz <rogerio.ferraz@encora.com>	2023-07-19 12:28:24 -03:00
Lucas Borges	42fb99a393	Removing sysinv-conductor dependency from rabbitmq The Sysinv components utilize ZeroMQ for communication among each other. The service dependency between sysinv-conductor and rabbit are removed on Service Manager improving the time it takes to swact, by 20%. Test Plan: PASS: Bootstrap AIO-SX, AIO-DX and Standard. PASS: Lock/unlock/swact in all environments PASS: Boostrap DC and perform lock/unlock/swact PASS: Add subcloud in DC environment PASS: Test DC orchestration (dcmanger dcorch) PASS: Restart sysinv only PASS: Restart sysinv, then dcorch and dcmanager PASS: After restart sysinv, dcorch and dcmanager run manage/unmanage PASS: After restart sysinv, run subcloud backup PASS: After all restart of services verify the whole system is working Refer to review https://review.opendev.org/c/starlingx/config/+/859571 for sysinv/ZeroMQ change details Closes-bug: 2022083 Signed-off-by: Lucas Borges <lucas.borges@windriver.com> Change-Id: I4014c05c914fc946946b14519f28a85067b06b34	2023-06-02 16:37:36 -03:00
Zuul	fc1936a70f	Merge "Shorten rabbit failure recovery delay"	2023-05-09 18:59:27 +00:00
Bin Qian	a85ffc695e	Shorten rabbit failure recovery delay In rare cases, when system running slowly with significant scheduling delay, rabbit disable action timeout continually. As final resort sm reboots the impacted controller for recovery after failure count reaches MAX_TRANSITION_FAILURES. As rabbit service disable timeout is set to 60 seconds, this result a significant delay before reboot for recovery. This change updates MAX_TRANSITION_FAILURES of rabbit service from 16 to 5 to reduce the delay of recovery of rabbit failure. TCs passed: Install a DX system Observed service group recovery escalated to reboot after 5 forced rabbit disable failure. Closes-bug: 2016168 Signed-off-by: Bin Qian <bin.qian@windriver.com> Change-Id: I660a64f0e78b6564456eb26245b672d2549f9a3b	2023-05-09 03:48:48 +00:00
Davlet Panech	e601f7ce3e	Fix github mirroring for this repo Updating the rsa ssh host key based on: https://github.blog/2023-03-23-we-updated-our-rsa-ssh-host-key/ Note: In the future, StarlingX should have a zuul job and secret setup for all repos so we do not need to do this for every repo. Needed to rename the secret, because zuul fails if like-named secrets have diffent values in different branches of the same repo. Partial-Bug: #2015246 Change-Id: Iedfe334611d14e7e6b5a3b2108501d0b2fdf1e13 Signed-off-by: Davlet Panech <davlet.panech@windriver.com>	2023-04-28 12:38:51 -04:00
Kyale, Eliud	9e2ff82411	Add failover state of peer to heartbeat msg - add failover state to heartbeat message ( 4 bits ) - add logic to survived_state to use peer's failover state to determine whether to exit survived state and enter normal state - throttle peer is normal events with a threshold of 10 used to ensure the peer is normal and stable - change fsm->send_event() log to debug from info log level - a few logging improvements; debug send_event logs - update copyright year 2023 Test plan: PASS - AIO-DX: iso install PASS - AIO-DX: crash the sm as indicated in bug and observe swact to standby PASS - AIO-DX: manual swact PASS - AIO-DX: power off active controller PASS - AIO-SX: install and basic sanity check PASS - AIO-SX: upgrade test to verify sm heartbeat messages changes still function when controllers are running different loads Closes-Bug: 2012519 Signed-off-by: Kyale, Eliud <Eliud.Kyale@windriver.com> Change-Id: I1f86dcb8c9d9dbaf436b9240867f61adc405e88c	2023-04-14 08:14:07 -04:00
Zuul	002f5a79db	Merge "Update rule of disable & standby dependency"	2023-04-04 15:43:16 +00:00
Bin Qian	c81032a572	Update rule of disable & standby dependency This change is to update the service disabling and going standby dependency check. The 2 specific rules are 1. "service a" has a disable action dependency to "service b", with targeted "service b" state of disabled, disable action of "service a" is considered as "dependency met" only when "service b" is in disabled stated, or enabled-standby state. 2. "service a" has a go-standby action dependency "to service b", with targeted "service b" state of disabled, go-standby action of "service a" is considered as "dependency met" only when "service b" is in disabled stated, or enabled-standby state. TCs: passed: Perform repeatedly host-swact operations, with adding long delay in xxx-fs ocf-script in disable action, observed that all xxx-fs services are disabled before drbd-xxx services start disabling. Closes-Bug: 2012570 Signed-off-by: Bin Qian <bin.qian@windriver.com> Change-Id: Ie9717d3b2b73dc7d623e1b980b3387c6c4e6d991	2023-03-27 19:39:09 +00:00
Fabiano Mercer	3d1d82b0a2	Keep platform-nfs-ip for upgrade process The platform-nfs-ip service is not necessary for fresh installs because it is just an alias for the controller IP. But for old releases like StarlingX rel. 6 or 7 the platform-nfs-ip uses a specific IP, If for some reason an error occurs during the upgrade process, the upgrade will be aborted and the nodes will downgrade to the old release again. At this moment the nodes will try to communicate with the previous platform-nfs-ip configured in /etc/hosts. But if the active controller is using the new Release this IP doesn't exist anymore and the downgrade will fail. For this reason the platform-nfs-ip service will be available just for upgrade operations and will be deprovisioned for fresh installs or at the end of the upgrade process ( upgrade-activate phase ). Test plan PASS Fresh install on AIO-SX Fresh install on AIO-DX PASS Upgrade AIO-DX system from CENTOS Rel 7 to DEBIAN Rel 8 PASS Reboot controller-0 during upgrade of AIO-DX controller-1 was the active one with the new release ( Rel 8 ) controller-0 using old release. reboot controller-0 and check if it could connect to controller-1 using old platform-nfs-ip. PASS Upgrade-abort during AIO-DX upgrade controller-1 was the active controller and already upgraded controller-0 was upgraded but locked. Abort the upgrade and downgrade to old release ( Rel 7 ) Partial-Bug: #2012387 Signed-off-by: Fabiano Mercer <fabiano.correamercer@windriver.com> Change-Id: I704e15fffc6e7efa7b1fea56164a21af02222dd6	2023-03-22 14:53:01 -03:00
Kyale, Eliud	b65eb7b2f6	Add PTHREAD_PRIO_PROTECT to sm mutexes - rename mutexes from generic name '_mutex' - create common util functions for initializing and destroying mutexes - add mutex initialize/finalize functions and run them in sm_main_process_initialize - update copyright info to 2023 Test plan: PASS - AIO-DX: iso install PASS - AIO-DX: verify ha swact PASS - AIO-DX: failover swact test PASS - AIO-DX: run pi_stress (rt-tests) to confirm priority inheritance POSIX attribute is working ------------------------------------------------------------------- sysadmin@localhost:~$ sudo pi_stress --uniprocessor --duration=10s Starting PI Stress Test Number of thread groups: 3 Duration of test run: 10 seconds Number of inversions per group: unlimited Admin thread SCHED_FIFO priority 4 3 groups of 3 threads will be created High thread SCHED_FIFO priority 3 Med thread SCHED_FIFO priority 2 Low thread SCHED_FIFO priority 1 Current Inversions: 992034 Stopping test Total inversion performed: 992038 Test Duration: 0 days, 0 hours, 0 minutes, 11 seconds ------------------------------------------------------------------- PASS - AIO-DX: valgrind helgrind * test for inconsistence lock ordering * test race conditions [ detected outside scope of jira ] * test for deadlocks Task: 47503 Story: 2010609 Signed-off-by: Kyale, Eliud <Eliud.Kyale@windriver.com> Change-Id: Ic77f08ca7c3a687b1cc219ac9cba5711979206e8	2023-03-02 20:49:17 +00:00
Zuul	48ec1759ed	Merge "Add admin network support to SM"	2023-02-16 21:02:00 +00:00
Zuul	32927b3139	Merge "Update debian package versions to use git commits"	2023-02-15 16:10:38 +00:00
Al Bailey	4a424d77cc	Fix zuul pep8 failures related to bugbear Feb 13 released a new version of bugbear that raises new error codes. We are setting an upper limit for bugbear to be from before that version was released. bandit is also reporting a hashlib error. The bandit job is non-voting, but the error is now being suppressed. Story: 2010531 Task: 47313 Signed-off-by: Al Bailey <al.bailey@windriver.com> Change-Id: Iffb790c67e658c7a40c697e364cb34f4c4f9ec6c	2023-02-14 16:47:42 +00:00
Steven Webster	db1eea124d	Add admin network support to SM Add SM support for the DC admin network This commit adds SM support for the DC admin network. The admin network is intended to be used between a subcloud and system controller. Because the (existing) management network is so embedded in other parts of the StarlingX system, it makes it prohibitively hard to re-configure this network after initial installation. The admin network is intended to be isolated from the management network, allowing re-configuration of the network parameters in the case that the physical network between subcloud and system controller has been changed. In the case of admin network usage, the management network still exist but is a private network in the context of a subcloud. This specific commit provides for admin-ip and admin-interface services to be added to the SM database and be recognized in processing similar to the management, cluster-host, oam, etc networks. Since there is a requirement for the admin IP subnet information to be allowed to change at runtime, in-service updating of SM information relating to the admin-ip service (floating IP), as well as unicast heartbeating between peers is also added in this commit. Testing: AIO-SX: - admin-ip service is enabled when the admin network is created. - admin-ip service is not enabled when the admin network is not created. - floating-ip is updated on the admin interface when admin addr-pool information is changed. AIO-DX: - admin-ip service is enabled when the admin network is created. - admin-ip service is not enabled when the admin network is not created. - floating-ip is updated on the active-controller when the admin addr-pool information is changed. - When a peer admin interface is down, an alarm is raised. - When a peer admin IP is not correct (changed), an alarm is raised. - Swact between controllers. - Inactive controller admin interface goes down Result: A 400.005 major communication loss fault is generated for the inactive controller entity - Inactive controller admin interface comes back up Result: The fault is cleared - Inactive controller admin IP address is removed/changed Result: Two 400.005 major communication loss faults are generated for both controller entities - Inactive controller admin node IP address is re-applied Result: The faults are cleared - Active admin interface goes down Result: A 400.005 major communication loss fault is generated for the inactive controller entity. A swact is not issued. - Active admin interface comes back up Result: The fault is cleared - Active admin node IP address is removed/changed Result: Two 400.005 major communication loss faults are generated for both controller entities. A swact is not issued. - Active admin floating IP address is removed/changed Result: A 400.001 critical admin-services / admin-ip alarm is raised. A swact occurs. The floating admin IP is applied to the newly active controller. Alarms are cleared. - After the above test, the newly active controller swacts back to the previously active controller. Result: No alarms. The floating IP is applied to the newly active controller. - The cable for the management interface on the active controller is pulled Result: A swact occurs - The cable for the OAM interface on the active controller is pulled Result: A swact occurs - The cable for the Admin interface on the active controller is pulled Result: A swact occurs. 400.005 alarms are raised. - The mgmt, cluster-host, oam interfaces are all brought down/up at the same time. The admin interface is also brought down, but not brought back up back up. Result: A swact occurs, with multiple controller-services related to the mgmt interface being in degraded state. Story: 2010319 Task: 47278 Signed-off-by: Steven Webster <steven.webster@windriver.com> Change-Id: I65df52600f4d5c499dceed32739cab414d36847a	2023-02-14 15:14:28 +00:00

1 2 3 4 5 ...

447 Commits All Branches Search

447 Commits

All Branches