starlingx/ha - ha - OpenDev: Free Software Needs Free Tools

Commit Graph

Author	SHA1	Message	Date
Erich Cordoba	c8735e882a	Remove version from sm folder The sm component had the 1.0.0 version in the folder name, this change removes that version and updates the centos_pkg_dirs. Story: 2006623 Task: 36827 Depends-On: https://review.opendev.org/#/c/685128/ Change-Id: I6725d1f961c2a82275da5fabbff8e89a8dd6f245 Signed-off-by: Erich Cordoba <erich.cordoba.malibran@intel.com>	2019-09-26 14:11:31 -05:00
Zuul	cde3747183	Merge "Ensure all services disabled when node is failed"	2019-07-04 14:57:35 +00:00
Zuul	0d7a269bc7	Merge "fix bug when judge if state in transition"	2019-07-02 16:08:04 +00:00
Bin Qian	f4be3908c3	Ensure all services disabled when node is failed Ensure all services are disabled when the active controller is failed during a failover. Partial-Bug: 1815969 Change-Id: Ieebcdc7b8a98be98c7d64c02c5934e523cc294e8 Signed-off-by: Bin Qian <bin.qian@windriver.com>	2019-07-02 09:49:52 -04:00
fpxie	98d4cae416	fix bug when judge if state in transition Change-Id: Ib1f57c588379d63b2f63ce866bed7338dda0bc46 Story: 2006064 Task: 34778	2019-06-28 08:57:44 +00:00
Bin Qian	4b9ace1ef3	Cleanup loggings SM receives network interfaces state change on controllers. But it should only log state changed of the network interfaces that are used by SM. Closes-Bug: 1823531 Change-Id: Iacdeeb8cfbb288b6b5572db606b97c18847950db Signed-off-by: Bin Qian <bin.qian@windriver.com>	2019-06-06 11:40:11 -04:00
fpxie	4c42a8b99c	fix log param Change-Id: I61f574eb730e5bb67197a4fef13dfe67762afd98	2019-04-29 14:08:43 +08:00
Teresa Ho	5232bdd8fd	SM monitoring for cluster-host-ip service Added service domain for cluster host interface and service for cluster host IP in the SM database. Removed references of infrastructure interface. Story: 2004273 Task: 29474 Change-Id: I6223047e9453eba83ea8b4ecf4db739d0f7d7665 Signed-off-by: Teresa Ho <teresa.ho@windriver.com>	2019-04-11 07:43:06 -04:00
Bin Qian	f86e8160dd	Initialize sm_hw earlier The sm_hw is initialized too late to cause a few error log messages: Failed to find thread information. Failed to audit hardware state of interface (lo), error=FAILED Change-Id: Ie7f813ff9a7900785e6d2af0ad5a75edc0cbf7c0 Partial-Bug: 1816764 Signed-off-by: Bin Qian <bin.qian@windriver.com>	2019-02-22 16:26:33 +00:00
Zuul	beda852ec8	Merge "Fixed host-swact failed"	2019-02-06 22:26:49 +00:00
Zuul	eebb879358	Merge "Fix a major logic error"	2019-02-04 23:14:45 +00:00
Bin Qian	1066d26e9e	Fixed host-swact failed Adding new domain event SM_SERVICE_DOMAIN_EVENT_CHANGING_LEADER to handle an on demand switching of service scheduler leader. Closes-Bug: 1812108 Change-Id: I6796d8efcb1ef0c7fa835ed34028c8e6a2b5dcae Signed-off-by: Bin Qian <bin.qian@windriver.com>	2019-02-01 09:07:57 -05:00
Bin Qian	0641b4a44e	Fix h/w subsystem duplicated initialization h/w subsystem is mistakenly initialized twice. It causes the interface operational state changed events not being passed to the listener. In the event an interface operational state changed, i.e, cable is pulled, the system could not react to it. Change-Id: I014d25befda536265c9c588a156ce411d01147cf Closes-Bug: 1812019 Signed-off-by: Bin Qian <bin.qian@windriver.com>	2019-01-28 13:02:32 -05:00
Bin Qian	6c93e74230	Fix AIO-DX/DC no controller active issue When controller-1 reboot in an AIO-DX/DC setup, mgmt/infra network will temporarily go down. This is expected. However, SM couldn't determine the interface going up again when the controller-1 reboot after unlock the first time. Add code to reverify the state of down interfaces when heartbeat message is received. Closes-Bug: 1809315 Change-Id: I02c9b6bf35539df2d36ad6b665b0a5ce8f2a1c75 Signed-off-by: Bin Qian <bin.qian@windriver.com>	2019-01-07 14:33:40 -05:00
Austin	8350ded5fc	Fix a major logic error If in_transition is false, the result of in_transition is always false with '&&', so we should be changed '&&' to '\|\|'. Change-Id: I8c18c052c94ebfdbcbcec215d64a8bceeda34f27 Closes-Bug: #1809412	2018-12-21 16:52:00 +08:00
Bin Qian	ad8665a1b7	Use hbs cluster info to determine best survivor Uses cluster hbs info to determine which controller to be the survivor when communication lost between 2 controllers with the new rules: 1. If a controller is the only controller to connect to storage-0, it is choosen to be the survivor 2. A controller that can reach more nodes is choosen to be the survivor. 3. A controller is choosen to be failed if it cannot reach any nodes. Story: 2003577 Task: 27704 Change-Id: I79659e1a788b865536500fc125fd65ae2f34123d Signed-off-by: Bin Qian <bin.qian@windriver.com>	2018-12-11 11:12:16 -05:00
Bin Qian	28e293bda5	Retrieve hbs cluster info This change includes: 1. adds code to receive cluster info update from hbsAgent. 2. support of ondemand hbs cluster info query (asynchronous). Depends-On: I7d294d40e84469df6b6a6f6dd490cf3c4557b711 Story: 2003577 Task: 27816 Change-Id: Idb65abc58b4afe9649aba442f0798c24d9fffb10 Signed-off-by: Bin Qian <bin.qian@windriver.com>	2018-11-30 11:59:23 -05:00
Bin Qian	133da10b08	split-brain avoidance improvement This change enables one way communication via BMC (if configured) through mtce. when 2 controllers lost all communications to each other. The algorithm is: when communications all lost, both active and standby controllers, verify its interfaces (mgmt, infra, and oam) if active controller is healthy, it will request a bmc reset thorugh mtce, against standby controller. if standby controller is healthy, it will active itself and wait a total 45 seconds before requesting a bmc reset through mtce, against the active controller. Changes also include: 1. adding new initial failover state. initial state is a state before the node is enabled 2. remove failover thread. using worker thread action to perform time consuming operations 3. remove entire failover action table Story: 2003577 Task: 24901 Change-Id: I7d294d40e84469df6b6a6f6dd490cf3c4557b711 Signed-off-by: Bin Qian <bin.qian@windriver.com>	2018-11-08 20:18:43 +00:00
Bin Qian	1902da1ce9	Remove code incorrectly listen sm-api commands The sm-api node-set command listener should not be overriden Story: 2003577 Task: 26404 Change-Id: I9a20989bd679744f2598389c71f923aa65a66084 Signed-off-by: Bin Qian <bin.qian@windriver.com>	2018-10-09 13:56:18 -04:00
Bin Qian	edc8a56472	Introduce failover FSM Introduce failover FSM to handle communication failure between controllers. Failover FSM has 4 states: Normal: when system running with full redundency Fail Pending: communication failure occured Failed: the controller is determined as failure. Its peer will assume service Survived: the controller is determined as survivor. Its peer has failed The controllers are in one of the below possible state pairs: normal/normal, fail-pending/fail-pending, failed/survived A failed controller will not resume responsbility before the system restores its full redundency (normal/normal) A survivor will not fail before the system restores its full redundency (normal/normal) Future implementation may allow an administrator to force a failed controller become active, to manually recover (with possiblity of losing data), should the survivor is no longer capable to provide service. Story: 2003577 Task: 26404 Change-Id: I51635e9e60b6fb6bad89e06c9f08d3f28e21db82 Signed-off-by: Bin Qian <bin.qian@windriver.com>	2018-09-18 08:08:40 -04:00
Bin Qian	68b5ce3835	SM to monitor infra i/f and swact when needed Individual services should not fail itself and trigger swact when infra i/f goes down SM will collect the overrall system healthy state to schedule the services. Story: 2003577 Task: 24899 Change-Id: Ifa7453136f34768b99e2bcd741d1065e69ef452e Signed-off-by: Bin Qian <bin.qian@windriver.com>	2018-09-11 02:28:26 +00:00
Bin Qian	53a055cb3a	remove incorrect logging when standby controller failed Add condition for the logging so to log only when the active controller failure which triggers a uncontrollered swact. The following changes are made: 1. move get_controller_state to a new sm_failover_utils.c and renamed it to sm_get_controller_state. 2. use the above function to check ensure to log only when the controller schedulering state is changing (swact). Closes-Bug: 1788697 Change-Id: I145b579c2d31e8c9e184894774d3a1c06c9149d7 Signed-off-by: Bin Qian <bin.qian@windriver.com>	2018-08-24 14:20:08 -04:00
Dean Troyer	17c909ec83	StarlingX open source release updates Signed-off-by: Dean Troyer <dtroyer@gmail.com>	2018-05-31 07:36:26 -07:00

23 Commits