The sm component had the 1.0.0 version in the folder name, this
change removes that version and updates the centos_pkg_dirs.
Story: 2006623
Task: 36827
Depends-On: https://review.opendev.org/#/c/685128/
Change-Id: I6725d1f961c2a82275da5fabbff8e89a8dd6f245
Signed-off-by: Erich Cordoba <erich.cordoba.malibran@intel.com>
Ensure all services are disabled when the active controller is failed
during a failover.
Partial-Bug: 1815969
Change-Id: Ieebcdc7b8a98be98c7d64c02c5934e523cc294e8
Signed-off-by: Bin Qian <bin.qian@windriver.com>
SM receives network interfaces state change on controllers.
But it should only log state changed of the network interfaces
that are used by SM.
Closes-Bug: 1823531
Change-Id: Iacdeeb8cfbb288b6b5572db606b97c18847950db
Signed-off-by: Bin Qian <bin.qian@windriver.com>
Added service domain for cluster host interface and service for
cluster host IP in the SM database.
Removed references of infrastructure interface.
Story: 2004273
Task: 29474
Change-Id: I6223047e9453eba83ea8b4ecf4db739d0f7d7665
Signed-off-by: Teresa Ho <teresa.ho@windriver.com>
The sm_hw is initialized too late to cause a few error log messages:
Failed to find thread information.
Failed to audit hardware state of interface (lo), error=FAILED
Change-Id: Ie7f813ff9a7900785e6d2af0ad5a75edc0cbf7c0
Partial-Bug: 1816764
Signed-off-by: Bin Qian <bin.qian@windriver.com>
Adding new domain event SM_SERVICE_DOMAIN_EVENT_CHANGING_LEADER
to handle an on demand switching of service scheduler leader.
Closes-Bug: 1812108
Change-Id: I6796d8efcb1ef0c7fa835ed34028c8e6a2b5dcae
Signed-off-by: Bin Qian <bin.qian@windriver.com>
h/w subsystem is mistakenly initialized twice. It causes the
interface operational state changed events not being passed to
the listener. In the event an interface operational state changed,
i.e, cable is pulled, the system could not react to it.
Change-Id: I014d25befda536265c9c588a156ce411d01147cf
Closes-Bug: 1812019
Signed-off-by: Bin Qian <bin.qian@windriver.com>
When controller-1 reboot in an AIO-DX/DC setup, mgmt/infra network will
temporarily go down. This is expected. However, SM couldn't determine
the interface going up again when the controller-1 reboot after unlock
the first time.
Add code to reverify the state of down interfaces when heartbeat
message is received.
Closes-Bug: 1809315
Change-Id: I02c9b6bf35539df2d36ad6b665b0a5ce8f2a1c75
Signed-off-by: Bin Qian <bin.qian@windriver.com>
If in_transition is false, the result of in_transition is always false
with '&&', so we should be changed '&&' to '||'.
Change-Id: I8c18c052c94ebfdbcbcec215d64a8bceeda34f27
Closes-Bug: #1809412
Uses cluster hbs info to determine which controller to be the survivor when
communication lost between 2 controllers with the new rules:
1. If a controller is the only controller to connect to storage-0,
it is choosen to be the survivor
2. A controller that can reach more nodes is choosen to be the survivor.
3. A controller is choosen to be failed if it cannot reach any nodes.
Story: 2003577
Task: 27704
Change-Id: I79659e1a788b865536500fc125fd65ae2f34123d
Signed-off-by: Bin Qian <bin.qian@windriver.com>
This change includes:
1. adds code to receive cluster info update from hbsAgent.
2. support of ondemand hbs cluster info query (asynchronous).
Depends-On: I7d294d40e84469df6b6a6f6dd490cf3c4557b711
Story: 2003577
Task: 27816
Change-Id: Idb65abc58b4afe9649aba442f0798c24d9fffb10
Signed-off-by: Bin Qian <bin.qian@windriver.com>
This change enables one way communication via BMC (if configured)
through mtce.
when 2 controllers lost all communications to each other.
The algorithm is:
when communications all lost,
both active and standby controllers, verify its interfaces (mgmt,
infra, and oam)
if active controller is healthy, it will request a bmc reset
thorugh mtce, against standby controller.
if standby controller is healthy, it will active itself and wait
a total 45 seconds before requesting a bmc reset through mtce,
against the active controller.
Changes also include:
1. adding new initial failover state.
initial state is a state before the node is enabled
2. remove failover thread.
using worker thread action to perform time consuming operations
3. remove entire failover action table
Story: 2003577
Task: 24901
Change-Id: I7d294d40e84469df6b6a6f6dd490cf3c4557b711
Signed-off-by: Bin Qian <bin.qian@windriver.com>
The sm-api node-set command listener should not be overriden
Story: 2003577
Task: 26404
Change-Id: I9a20989bd679744f2598389c71f923aa65a66084
Signed-off-by: Bin Qian <bin.qian@windriver.com>
Introduce failover FSM to handle communication failure between
controllers.
Failover FSM has 4 states:
Normal: when system running with full redundency
Fail Pending: communication failure occured
Failed: the controller is determined as failure. Its peer will
assume service
Survived: the controller is determined as survivor. Its peer has
failed
The controllers are in one of the below possible state pairs:
normal/normal, fail-pending/fail-pending, failed/survived
A failed controller will not resume responsbility before the
system restores its full redundency (normal/normal)
A survivor will not fail before the system restores its
full redundency (normal/normal)
Future implementation may allow an administrator to force
a failed controller become active, to manually recover
(with possiblity of losing data), should the survivor is
no longer capable to provide service.
Story: 2003577
Task: 26404
Change-Id: I51635e9e60b6fb6bad89e06c9f08d3f28e21db82
Signed-off-by: Bin Qian <bin.qian@windriver.com>
Individual services should not fail itself and trigger swact when infra i/f goes down
SM will collect the overrall system healthy state to schedule the services.
Story: 2003577
Task: 24899
Change-Id: Ifa7453136f34768b99e2bcd741d1065e69ef452e
Signed-off-by: Bin Qian <bin.qian@windriver.com>
Add condition for the logging so to log only when the active controller
failure which triggers a uncontrollered swact.
The following changes are made:
1. move get_controller_state to a new sm_failover_utils.c and renamed it
to sm_get_controller_state.
2. use the above function to check ensure to log only when the controller
schedulering state is changing (swact).
Closes-Bug: 1788697
Change-Id: I145b579c2d31e8c9e184894774d3a1c06c9149d7
Signed-off-by: Bin Qian <bin.qian@windriver.com>