Commit Graph

6 Commits

Author SHA1 Message Date
Erich Cordoba 44f220a3b8 Remove version from sm-common folder
The sm-common component had the 1.0.0 version in the folder name, this
change removes that version and updates the centos_pkg_dirs.

Story: 2006623
Task: 36828

Change-Id: I0e998a3e2482bc06f3a91f9494a3e5d21faa28e7
Signed-off-by: Erich Cordoba <erich.cordoba.malibran@intel.com>
2019-09-26 12:00:43 -05:00
Teresa Ho 5232bdd8fd SM monitoring for cluster-host-ip service
Added service domain for cluster host interface and service for
cluster host IP in the SM database.
Removed references of infrastructure interface.

Story: 2004273
Task: 29474

Change-Id: I6223047e9453eba83ea8b4ecf4db739d0f7d7665
Signed-off-by: Teresa Ho <teresa.ho@windriver.com>
2019-04-11 07:43:06 -04:00
Bin Qian 1066d26e9e Fixed host-swact failed
Adding new domain event SM_SERVICE_DOMAIN_EVENT_CHANGING_LEADER
to handle an on demand switching of service scheduler leader.

Closes-Bug: 1812108

Change-Id: I6796d8efcb1ef0c7fa835ed34028c8e6a2b5dcae
Signed-off-by: Bin Qian <bin.qian@windriver.com>
2019-02-01 09:07:57 -05:00
Bin Qian 133da10b08 split-brain avoidance improvement
This change enables one way communication via BMC (if configured)
through mtce.
when 2 controllers lost all communications to each other.
The algorithm is:
when communications all lost,
both active and standby controllers, verify its interfaces (mgmt,
infra, and oam)
if active controller is healthy, it will request a bmc reset
thorugh mtce, against standby controller.
if standby controller is healthy, it will active itself and wait
a total 45 seconds before requesting a bmc reset through mtce,
against the active controller.

Changes also include:
1. adding new initial failover state.
   initial state is a state before the node is enabled
2. remove failover thread.
   using worker thread action to perform time consuming operations
3. remove entire failover action table

Story: 2003577
Task:  24901
Change-Id: I7d294d40e84469df6b6a6f6dd490cf3c4557b711
Signed-off-by: Bin Qian <bin.qian@windriver.com>
2018-11-08 20:18:43 +00:00
Bin Qian edc8a56472 Introduce failover FSM
Introduce failover FSM to handle communication failure between
controllers.
Failover FSM has 4 states:
Normal: when system running with full redundency
Fail Pending: communication failure occured
Failed: the controller is determined as failure. Its peer will
        assume service
Survived: the controller is determined as survivor. Its peer has
        failed

The controllers are in one of the below possible state pairs:
normal/normal, fail-pending/fail-pending, failed/survived

A failed controller will not resume responsbility before the
system restores its full redundency (normal/normal)

A survivor will not fail before the system restores its
full redundency (normal/normal)

Future implementation may allow an administrator to force
a failed controller become active, to manually recover
(with possiblity of losing data), should the survivor is
no longer capable to provide service.

Story: 2003577
Task: 26404

Change-Id: I51635e9e60b6fb6bad89e06c9f08d3f28e21db82
Signed-off-by: Bin Qian <bin.qian@windriver.com>
2018-09-18 08:08:40 -04:00
Dean Troyer 17c909ec83 StarlingX open source release updates
Signed-off-by: Dean Troyer <dtroyer@gmail.com>
2018-05-31 07:36:26 -07:00