Commit Graph

23 Commits

Author SHA1 Message Date
Erich Cordoba c8735e882a Remove version from sm folder
The sm component had the 1.0.0 version in the folder name, this
change removes that version and updates the centos_pkg_dirs.

Story: 2006623
Task: 36827

Depends-On: https://review.opendev.org/#/c/685128/
Change-Id: I6725d1f961c2a82275da5fabbff8e89a8dd6f245
Signed-off-by: Erich Cordoba <erich.cordoba.malibran@intel.com>
2019-09-26 14:11:31 -05:00
Zuul cde3747183 Merge "Ensure all services disabled when node is failed" 2019-07-04 14:57:35 +00:00
Zuul 0d7a269bc7 Merge "fix bug when judge if state in transition" 2019-07-02 16:08:04 +00:00
Bin Qian f4be3908c3 Ensure all services disabled when node is failed
Ensure all services are disabled when the active controller is failed
during a failover.

Partial-Bug: 1815969

Change-Id: Ieebcdc7b8a98be98c7d64c02c5934e523cc294e8
Signed-off-by: Bin Qian <bin.qian@windriver.com>
2019-07-02 09:49:52 -04:00
fpxie 98d4cae416 fix bug when judge if state in transition
Change-Id: Ib1f57c588379d63b2f63ce866bed7338dda0bc46
Story: 2006064
Task: 34778
2019-06-28 08:57:44 +00:00
Bin Qian 4b9ace1ef3 Cleanup loggings
SM receives network interfaces state change on controllers.
But it should only log state changed of the network interfaces
that are used by SM.

Closes-Bug: 1823531

Change-Id: Iacdeeb8cfbb288b6b5572db606b97c18847950db
Signed-off-by: Bin Qian <bin.qian@windriver.com>
2019-06-06 11:40:11 -04:00
fpxie 4c42a8b99c fix log param
Change-Id: I61f574eb730e5bb67197a4fef13dfe67762afd98
2019-04-29 14:08:43 +08:00
Teresa Ho 5232bdd8fd SM monitoring for cluster-host-ip service
Added service domain for cluster host interface and service for
cluster host IP in the SM database.
Removed references of infrastructure interface.

Story: 2004273
Task: 29474

Change-Id: I6223047e9453eba83ea8b4ecf4db739d0f7d7665
Signed-off-by: Teresa Ho <teresa.ho@windriver.com>
2019-04-11 07:43:06 -04:00
Bin Qian f86e8160dd Initialize sm_hw earlier
The sm_hw is initialized too late to cause a few error log messages:

Failed to find thread information.
Failed to audit hardware state of interface (lo), error=FAILED


Change-Id: Ie7f813ff9a7900785e6d2af0ad5a75edc0cbf7c0
Partial-Bug: 1816764
Signed-off-by: Bin Qian <bin.qian@windriver.com>
2019-02-22 16:26:33 +00:00
Zuul beda852ec8 Merge "Fixed host-swact failed" 2019-02-06 22:26:49 +00:00
Zuul eebb879358 Merge "Fix a major logic error" 2019-02-04 23:14:45 +00:00
Bin Qian 1066d26e9e Fixed host-swact failed
Adding new domain event SM_SERVICE_DOMAIN_EVENT_CHANGING_LEADER
to handle an on demand switching of service scheduler leader.

Closes-Bug: 1812108

Change-Id: I6796d8efcb1ef0c7fa835ed34028c8e6a2b5dcae
Signed-off-by: Bin Qian <bin.qian@windriver.com>
2019-02-01 09:07:57 -05:00
Bin Qian 0641b4a44e Fix h/w subsystem duplicated initialization
h/w subsystem is mistakenly initialized twice. It causes the
interface operational state changed events not being passed to
the listener. In the event an interface operational state changed,
i.e, cable is pulled, the system could not react to it.

Change-Id: I014d25befda536265c9c588a156ce411d01147cf
Closes-Bug: 1812019
Signed-off-by: Bin Qian <bin.qian@windriver.com>
2019-01-28 13:02:32 -05:00
Bin Qian 6c93e74230 Fix AIO-DX/DC no controller active issue
When controller-1 reboot in an AIO-DX/DC setup, mgmt/infra network will
temporarily go down. This is expected. However, SM couldn't determine
the interface going up again when the controller-1 reboot after unlock
the first time.

Add code to reverify the state of down interfaces when heartbeat
message is received.

Closes-Bug: 1809315
Change-Id: I02c9b6bf35539df2d36ad6b665b0a5ce8f2a1c75
Signed-off-by: Bin Qian <bin.qian@windriver.com>
2019-01-07 14:33:40 -05:00
Austin 8350ded5fc Fix a major logic error
If in_transition is false, the result of in_transition is always false
with '&&', so we should be changed '&&' to '||'.

Change-Id: I8c18c052c94ebfdbcbcec215d64a8bceeda34f27
Closes-Bug: #1809412
2018-12-21 16:52:00 +08:00
Bin Qian ad8665a1b7 Use hbs cluster info to determine best survivor
Uses cluster hbs info to determine which controller to be the survivor when
communication lost between 2 controllers with the new rules:

1. If a controller is the only controller to connect to storage-0,
it is choosen to be the survivor
2. A controller that can reach more nodes is choosen to be the survivor.
3. A controller is choosen to be failed if it cannot reach any nodes.

Story: 2003577
Task: 27704

Change-Id: I79659e1a788b865536500fc125fd65ae2f34123d
Signed-off-by: Bin Qian <bin.qian@windriver.com>
2018-12-11 11:12:16 -05:00
Bin Qian 28e293bda5 Retrieve hbs cluster info
This change includes:
1. adds code to receive cluster info update from hbsAgent.
2. support of ondemand hbs cluster info query (asynchronous).

Depends-On: I7d294d40e84469df6b6a6f6dd490cf3c4557b711

Story: 2003577
Task: 27816

Change-Id: Idb65abc58b4afe9649aba442f0798c24d9fffb10
Signed-off-by: Bin Qian <bin.qian@windriver.com>
2018-11-30 11:59:23 -05:00
Bin Qian 133da10b08 split-brain avoidance improvement
This change enables one way communication via BMC (if configured)
through mtce.
when 2 controllers lost all communications to each other.
The algorithm is:
when communications all lost,
both active and standby controllers, verify its interfaces (mgmt,
infra, and oam)
if active controller is healthy, it will request a bmc reset
thorugh mtce, against standby controller.
if standby controller is healthy, it will active itself and wait
a total 45 seconds before requesting a bmc reset through mtce,
against the active controller.

Changes also include:
1. adding new initial failover state.
   initial state is a state before the node is enabled
2. remove failover thread.
   using worker thread action to perform time consuming operations
3. remove entire failover action table

Story: 2003577
Task:  24901
Change-Id: I7d294d40e84469df6b6a6f6dd490cf3c4557b711
Signed-off-by: Bin Qian <bin.qian@windriver.com>
2018-11-08 20:18:43 +00:00
Bin Qian 1902da1ce9 Remove code incorrectly listen sm-api commands
The sm-api node-set command listener should not be overriden

Story: 2003577
Task: 26404

Change-Id: I9a20989bd679744f2598389c71f923aa65a66084
Signed-off-by: Bin Qian <bin.qian@windriver.com>
2018-10-09 13:56:18 -04:00
Bin Qian edc8a56472 Introduce failover FSM
Introduce failover FSM to handle communication failure between
controllers.
Failover FSM has 4 states:
Normal: when system running with full redundency
Fail Pending: communication failure occured
Failed: the controller is determined as failure. Its peer will
        assume service
Survived: the controller is determined as survivor. Its peer has
        failed

The controllers are in one of the below possible state pairs:
normal/normal, fail-pending/fail-pending, failed/survived

A failed controller will not resume responsbility before the
system restores its full redundency (normal/normal)

A survivor will not fail before the system restores its
full redundency (normal/normal)

Future implementation may allow an administrator to force
a failed controller become active, to manually recover
(with possiblity of losing data), should the survivor is
no longer capable to provide service.

Story: 2003577
Task: 26404

Change-Id: I51635e9e60b6fb6bad89e06c9f08d3f28e21db82
Signed-off-by: Bin Qian <bin.qian@windriver.com>
2018-09-18 08:08:40 -04:00
Bin Qian 68b5ce3835 SM to monitor infra i/f and swact when needed
Individual services should not fail itself and trigger swact when infra i/f goes down
SM will collect the overrall system healthy state to schedule the services.

Story: 2003577
Task: 24899

Change-Id: Ifa7453136f34768b99e2bcd741d1065e69ef452e
Signed-off-by: Bin Qian <bin.qian@windriver.com>
2018-09-11 02:28:26 +00:00
Bin Qian 53a055cb3a remove incorrect logging when standby controller failed
Add condition for the logging so to log only when the active controller
failure which triggers a uncontrollered swact.
The following changes are made:
1. move get_controller_state to a new sm_failover_utils.c and renamed it
   to sm_get_controller_state.
2. use the above function to check ensure to log only when the controller
   schedulering state is changing (swact).

Closes-Bug: 1788697

Change-Id: I145b579c2d31e8c9e184894774d3a1c06c9149d7
Signed-off-by: Bin Qian <bin.qian@windriver.com>
2018-08-24 14:20:08 -04:00
Dean Troyer 17c909ec83 StarlingX open source release updates
Signed-off-by: Dean Troyer <dtroyer@gmail.com>
2018-05-31 07:36:26 -07:00