starlingx/ha - ha - OpenDev: Free Software Needs Free Tools

Commit Graph

Author	SHA1	Message	Date
Erich Cordoba	44f220a3b8	Remove version from sm-common folder The sm-common component had the 1.0.0 version in the folder name, this change removes that version and updates the centos_pkg_dirs. Story: 2006623 Task: 36828 Change-Id: I0e998a3e2482bc06f3a91f9494a3e5d21faa28e7 Signed-off-by: Erich Cordoba <erich.cordoba.malibran@intel.com>	2019-09-26 12:00:43 -05:00
Teresa Ho	5232bdd8fd	SM monitoring for cluster-host-ip service Added service domain for cluster host interface and service for cluster host IP in the SM database. Removed references of infrastructure interface. Story: 2004273 Task: 29474 Change-Id: I6223047e9453eba83ea8b4ecf4db739d0f7d7665 Signed-off-by: Teresa Ho <teresa.ho@windriver.com>	2019-04-11 07:43:06 -04:00
Bin Qian	1066d26e9e	Fixed host-swact failed Adding new domain event SM_SERVICE_DOMAIN_EVENT_CHANGING_LEADER to handle an on demand switching of service scheduler leader. Closes-Bug: 1812108 Change-Id: I6796d8efcb1ef0c7fa835ed34028c8e6a2b5dcae Signed-off-by: Bin Qian <bin.qian@windriver.com>	2019-02-01 09:07:57 -05:00
Bin Qian	133da10b08	split-brain avoidance improvement This change enables one way communication via BMC (if configured) through mtce. when 2 controllers lost all communications to each other. The algorithm is: when communications all lost, both active and standby controllers, verify its interfaces (mgmt, infra, and oam) if active controller is healthy, it will request a bmc reset thorugh mtce, against standby controller. if standby controller is healthy, it will active itself and wait a total 45 seconds before requesting a bmc reset through mtce, against the active controller. Changes also include: 1. adding new initial failover state. initial state is a state before the node is enabled 2. remove failover thread. using worker thread action to perform time consuming operations 3. remove entire failover action table Story: 2003577 Task: 24901 Change-Id: I7d294d40e84469df6b6a6f6dd490cf3c4557b711 Signed-off-by: Bin Qian <bin.qian@windriver.com>	2018-11-08 20:18:43 +00:00
Bin Qian	edc8a56472	Introduce failover FSM Introduce failover FSM to handle communication failure between controllers. Failover FSM has 4 states: Normal: when system running with full redundency Fail Pending: communication failure occured Failed: the controller is determined as failure. Its peer will assume service Survived: the controller is determined as survivor. Its peer has failed The controllers are in one of the below possible state pairs: normal/normal, fail-pending/fail-pending, failed/survived A failed controller will not resume responsbility before the system restores its full redundency (normal/normal) A survivor will not fail before the system restores its full redundency (normal/normal) Future implementation may allow an administrator to force a failed controller become active, to manually recover (with possiblity of losing data), should the survivor is no longer capable to provide service. Story: 2003577 Task: 26404 Change-Id: I51635e9e60b6fb6bad89e06c9f08d3f28e21db82 Signed-off-by: Bin Qian <bin.qian@windriver.com>	2018-09-18 08:08:40 -04:00
Dean Troyer	17c909ec83	StarlingX open source release updates Signed-off-by: Dean Troyer <dtroyer@gmail.com>	2018-05-31 07:36:26 -07:00

6 Commits