metal

Commit Graph

Author	SHA1	Message	Date
Eric MacDonald	7d8be4bc1f	Add auto-versioning to starlingx/metal mtce packages This update makes use of the PKG_GITREVCOUNT variable to auto-version the mtce packages in this repo. Change-Id: Ifb4da4570e0261bbdcf0d7af79b8add7cfc133ac Story: 2006166 Task: 39822 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2020-05-21 15:18:43 -04:00
Kristine Bujold	bee31d98c8	Remove wrs-guest-heartbeat SDK Module With the StarlingX move to supporting pure upstream OpenStack, the majority of the SDK Modules are related to functionality no longer supported. The remaining SDK Modules will be moved to StarlingX documentation. Story: 2005275 Task: 30565 Change-Id: Ifc560a6865d045ab3bd93923811aeb5f8ac7f030 Signed-off-by: Kristine Bujold <kristine.bujold@windriver.com>	2019-04-17 13:38:18 -04:00
Eric MacDonald	7e8be89143	Make Mtce default to Simplex system type if label is missing This update refactors daemon_system_type function so that it returns a SIMPLEX system type if it is unable to properly find and parse the system_mode/system_type from platform.conf This is needed for Ansible Bootstrap Deployment where mtcAgent and mtcClient need to run and function like it would in a simplex system prior to the system type being added to the platform.conf file. Change-Id: Ib0130f3559ee3aa8d8d8203ea59d4896a571944f Story: 2004695 Task: 28714 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2019-02-04 14:15:40 +00:00
Eric MacDonald	ff8ef3ea8a	Change Mtce token endpoint lookup to be 'platform'. The maintenance token request's response parser is looking for nova compute endpoint as a day one implementation when mtce actually managed nova. That is long since changed but this endpoint lookup remained. In the new containterized environment the nova compute endpoint is not always present and when its not mtce fails to get its token. Since mtce needs the token for communication with sysinv this update changes the endpoint lookup type to 'platform' to match that of sysinv. Change-Id: I389b64d345e47f7d7bc062671da7c7cc51ac398f Story: 2004695 Task: 29213 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2019-01-30 12:55:55 -05:00
Eric MacDonald	3a5c578355	Mtce: Add Thresholded Maintenance Enable Recovery support This update stops trying to recover hosts that have failed the Enable sequence after a thresholded number of back-to-back tries. A host that has reached a particular failure modes' max failure threshold then maintenance puts it into a 'unlocked-disabled-failed' state and left that way with no further recovery action until it is manually locked and unlocked. The thresholded Enable failure causes are Configuration Failure ....... threshold:2 retry interval:30 secs In-Test GoEnabled Failure ... threshold:2 retry interval:30 sec Start Host Services Failure . threshold:2 retry interval:30 sec Heartbeat Soak Failure ...... threshold:2 retry interval:10 minute This update refactors the old auto recovery for AIO SX into this more generic framework. Story: 2003576 Task: 24905 Test Plan: PASS: Verify AIO DX System Install PASS: Verify AIO SX DOR PASS: Verify Auto recovery disabled state is maintained over AIO SX DOR PASS: Verify Lock/Unlock recovers host from Auto recovery disabled state PASS: Verify AIO SX Main Config Failure handling PASS: Verify AIO SX Main Config Timeout handling PASS: Verify AIO SX Main GoEnabled Failure Handling PASS; Verify AIO SX Main Host Services Failure handling PASS; Verify AIO SX Main Host Services Timeout handling PASS; Verify AIO SX Subf Config Failure handling PASS: Verify AIO SX Subf Config Timeout handling PASS: Verify AIO SX Subf GoEnabled Failure Handling PASS: Verify AIO SX Subf Host Services Failure handling PASS: Verify AIO DX System Install PASS: Verify AIO DX DOR PASS: Verify AIO DX DOR ; one time active controller GoEnabled failure ; swact requested PASS: Verify AIO DX Main First Unlock Failure handling PASS: Verify AIO DX Main Config Failure handling (inactive ctrl) PASS: Verify AIO DX Main one time Config Failure handling PASS: Verify AIO DX Main one time GoEnabled Failure handling. PASS: Verify AIO DX SUBF Inactive Controller 1 GoEnable Failure handling. PASS: Verify AIO DX Inactive Controller 1 GoEnable Failure with recovery on retry. PASS: Verify AIO DX Active controller Enable failure with no or locked peer controller. PASS: Verify AIO DX Reboot Active controller with peer in auto recovery disabled state. PASS: Verify AIO DX Active controller failure with peer in auto recovery disabled state. (vswitch process) PASS: Verify AIo DX Active controller failure then recovery after reboot with peer in auto recovery disabled state. (goenabled) PASS: Verify AIO DX Inactive Controller Enable Heartbeat Soak Failure handling. PASS: Verify AIO DX Active controller unhealthy detection and handling. (degrade) PASS: Verify AIO DX Inactive controller unhealthy detection and handling. (fail) PASS: Verify Normal System Install PASS: Verify Compute Enable Configuration Failure handling (wc71-75) PASS: Verify Compute Enable GoEnabled Failure handling (recover after 1) PASS: Verify Compute Enable Start Host Services Failure handling PASS: Verify Compute Enable Heartbeat Soak Failure handling PASS: Verify Inactive Controller Enable Heartbeat Soak Failure handling PASS: Verify Inactive Controller Configuration Failure handling PASS; Verify Inactive Controller GoEnabled Failure handling PASS; Verify Inactive Controller Host Services Failure handling PASS; Verify goEnabled failure after active controller reboot with no peer controller (C0 rebooted with C1 locked) - no SM startup PASS: Verify auto recovery threshold number is configurable PASS: Verify auto recovery retry interval is configurable PASS: Verify auto recovery host state and status message Regression: PASS: Verify Swact behavior, over and back PASS: Verify 5 node DOR PASS: Verify 3 host MNFA behavior PASS: verify in-service heartbeat failure handling PASS: verify no segfaults during UT Corner Cases: PASS: Verify mtcAlive boot failure behavior. reset progression. retry forever. - sleep in config script PASS: Verify AIO SX mtcAgent process restart while in autorecovery disabled state PASS: Verify autorecovery disabled state is preserved over mtcAgent process restart. Change-Id: I7098f16243caef27c5295971ef3c9de5be975755 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2018-12-12 08:11:36 -05:00
Eric MacDonald	dc531dc815	Fix mtce guest build failure A recent update to stx-metal/mtce-common removed a daemon_config structure member that the stx-nfv/mtce-guest git depends on. This was not detected during UT of the mtc-common change because of a missing build dependency that should force a rebuild of the mtce guest. Delivering the code fix to unblock the community. Will deliver the build dependency change shortly. Change-Id: Ice08424f156ffc84e38651fbc40ebc184170eb20 Closes-Bug: 1804579 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2018-11-22 10:26:18 -05:00
Jim Gauld	6a5e10492c	Decouple Guest-server/agent from stx-metal This decouples the build and packaging of guest-server, guest-agent from mtce, by splitting guest component into stx-nfv repo. This leaves existing C++ code, scripts, and resource files untouched, so there is no functional change. Code refactoring is beyond the scope of this update. Makefiles were modified to include devel headers directories /usr/include/mtce-common and /usr/include/mtce-daemon. This ensures there is no contamination with other system headers. The cgts-mtce-common package is renamed and split into: - repo stx-metal: mtce-common, mtce-common-dev - repo stx-metal: mtce - repo stx-nfv: mtce-guest - repo stx-ha: updates package dependencies to mtce-pmon for service-mgmt, sm, and sm-api mtce-common: - contains common and daemon shared source utility code mtce-common-dev: - based on mtce-common, contains devel package required to build mtce-guest and mtce - contains common library archives and headers mtce: - contains components: alarm, fsmon, fsync, heartbeat, hostw, hwmon, maintenance, mtclog, pmon, public, rmon mtce-guest: - contains guest component guest-server, guest-agent Story: 2002829 Task: 22748 Change-Id: I9c7a9b846fd69fd566b31aa3f12a043c08f19f1f Signed-off-by: Jim Gauld <james.gauld@windriver.com>	2018-09-18 17:15:08 -04:00
Eric MacDonald	316032b904	Mtce: Improve non-blocking http request dispatch Maintenance is seen to intermittently fail Swact requests early after initial system provisioning, without logging an error reason, only to always succeed later on. The issue is difficult to reproduce so this update adds extra logging to this code path and implements a speculative fix. The event_base_loop calls' non-zero return code is never being logged. The libevent documentation states that this API will return 1 while the target has not yet provided any data. Theory is, because the call is local, that normally it returns with data even on the first dispatch case. However, during early system configuration, when the system is busy, that first dispatch does not complete immediately like it normally does later on. Speculation is, instead it returns a 1 stating retry but the existing code path treats that as a failure. This update modifies the code to return a PASS if the command dispatch returns a 1 while the error case of -1 gets enhanced logging and continues to be treated as a failure. Test Plan: PASS: Swact 5 times PASS: Lock/Unlock Host PASS: Large System DOR Related Bug: https://bugs.launchpad.net/starlingx/+bug/1791381 Change-Id: I19b22e07d3224b2e9dd3f3569ecbe9aed7d9402f Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2018-09-10 19:02:42 +00:00
Eric MacDonald	74c5f89ab4	Mtce: Make Heartbeat Failure Action Configurable The current maintenance heartbeat failure action handling is to Fail and Gracefully Recover the host. This means that maintenance will ensure that a heartbeat failed host is rebooted/reset before it is recovered but will avoid rebooting it a second time if its recovered uptime indicates that it has already rebooted. This update expands that single action handling behavior to support three new actions. In doing so it adds a new configuration service parameter called heartbeat_failure_action. The customer can configure this new parameter with any one of the following 4 actions in order of decreasing impact. fail - Host is failed and gracefuly recovered. - Current Network specific alarms continue to be raised/cleared. Note: Prior to this update this was standard system behavior. degrade - Host is only degraded while it is failing heartbeat. - Current Network specific alarms continue to be raised/cleared. - heartbeat degrade reason is cleared as are the alarms when heartbeat responses resume. alarm - The only indication of a heartbeat failure is by alarm. - Same set of alarms as in above action cases - Only in this case no degrade, no failure, no reboot/reset none - Heartbeat is disabled ; no multicase heartbeat message is sent. - All existing heartbeat alarms are cleared. - The heartbeat soak as part of the enable sequence is bypassed. The selected action is a system wide setting. The selected setting also applies to Multi-Node Failure Avoidance. The default action is the legacy action Fail. This update also 1. Removes redundant inservice failure alarm for MNFA case in support of degrade only action. Keeping it would make that alarm handling case unnecessarily complicated. 2. No longer used 'hbs calibration' code is removed (cleanup). 3. Small amount of heartbeat logging cleanup. Test Plan: PASS: fail: Verify MNFA and recovery PASS: fail: Verify Single Host heartbeat failure and recovery PASS: fail: Verify Single Host heartbeat failure and recovery (from none) PASS: degrade: Verify MNFA and recovery PASS: degrade: Verify Single Host heartbeat failure and recovery PASS: degrade: Verify Single Host heartbeat failure and recovery (from alarm) PASS: alarm: Verify MNFA and recovery PASS: alarm: Verify Single Host heartbeat failure and recovery PASS: alarm: Verify Single Host heartbeat failure and recovery (from degrade) PASS: none: Verify heartbeat disable, fail ignore and no recovery PASS: none: Verify Single Host heartbeat ignore and no recovery PASS: none: Verify Single Host heartbeat ignode and no recovery (from fail) PASS: Verify action change behavior from none to alarm with active MNFA PASS: Verify action change behavior from alarm to degrade with active MNFA PASS: Verify action change behavior from degrade to none with active MNFA PASS: Verify action change behavior from none to fail with active MNFA PASS: Verify action change behavior from fail to none with active MNFA PASS: Verify action change behavior from degrade to fail then MNFA timeout PASS: Verify all heartbeat action change customer logs PASS: verify heartbeat stats clear over action change PASS: Verify LO DOR (several large labs - compute and storage systems) PASS: Verify recovery from failure of active controller PASS: Verify 3 host failure behavior with MNFA threshold at 3 (action:fail) PASS: Verify 2 host failure behavior with MNFA threshold at 3 (action:fail) Depends-On: https://review.openstack.org/601264 Change-Id: Iede5cdbb1c923898fd71b3a95d5289182f4287b4 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2018-09-10 13:03:30 -04:00
Eric MacDonald	82e851d651	Mtce: Make Multi-Node Failure Avoidance Configurable The maintenance system implements a high availability (HA) feature designed to detect the simultaneous heartbeat failure of a group of hosts and avoid failing all those hosts until heartbeat resumes or after a set period of time. This feature is called Multi-Node Failure Avoidance, aka MNFA, and currently has the hosts threshold set to 3 and timeout set to 100 secs. This update implements enhancements to that existing feature by making the 'number-of-hosts threshold' and 'timeout period' customer configurable service parameters. The new service parameters are listed under platform:maintenance which display with the following command > system service-parameter-list mnfa_threshold: This new label and value is added to the puppet managed /etc/mtc.ini and represents the number of hosts that are required to fail heartbeat as a group; within the heartbeat failure window (heartbeat_failure_threshold) after which maintenance activates MNFA Mode. This update changes the default number of failing hosts from 3 to 2 while allowing a configurable range from 2 to 100. mnfa_timeout: This new label and value is added to the puppet managed /etc/mtc.ini. While MNFA mode is active, it will remain active until the number of failing hosts drop below the mnfa_threshold or this timer expires. The MNFA mode deactivates on the first occurance of either case. Upon deactivation the remaining failed hosts are no longer treated as a failure group but instead are all Gracefully Recovered individually. A value of zero imposes no timeout making the deactivation criteria solely host based. This update changes the default 100 second timer to 0; no-timeout while permitting valid a times range from 100 to 86400 secs or 1 day. Test Plan: PASS - Verify duplex and 4 compute DOR PASS - Verify default MNFA - 1 inactive controller and 4 computes PASS - Verify default MNFA - 4 computes PASS - Verify default MNFA - 1 active controller and 3 computes and failed host PASS - Verify Single host heartbeat failure handling - fail host PASS - Verify Multi Node failure below mnfa_threshold - fail hosts PASS - Verify MNFA handling with timeout of zero and threshold of 3 PASS - Verify MNFA timeout handling with timeout set at 100 sec PASS - Verify MNFA service parameter lising, default value and mtc.ini PASS - Verify MNFA service parameter change and inservice apply PASS - Verify MNFA timeout service parameter change from value to 0 PASS - Verify MNFA timeout service parameter change from to inrange value PASS - Verify MNFA service parametrer out of range change handling PASS - Verify MNFA timeout change from No-Timeout to 100 sec (while active) DocImpact Story: 2003576 Task: 24903 Change-Id: Ib56dd79b38c3726e042cf34aae361f229c89940b Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2018-08-31 15:35:08 -04:00
Alex Kozyrev	00520ac78c	Moving PMON script for NTP from MTCE to Puppet Introduction of PTP service requires NTP service to be disabled. Process monitoring of NTP daemon must be turned off as well. There is no way to start/stop process monitoring from MTCE. Puppet can check NTP status at startup and enable/disable monitoring. So, it is needed to move NTP-related PMON script from MTCE to Puppet. This is first step: removing NTP references from MTCE. Change-Id: I1ca6045af8c5169220b7332d45b843fdb4960f01 Story: 2002935 Task: 24520 Signed-off-by: Alex Kozyrev <alex.kozyrev@windriver.com>	2018-08-09 16:04:57 -04:00
Kam Nasim	5e725a7a0a	Multi-Region: Support shared LDAP service Decouple NSLCD from the open-ldap SM service and manage it by PMOND instead. This is needed because in the Shared LDAP case, we deprovision the open-ldap service on the Secondary Region which renders NSLCD unmanaged. Additionally, we allow the Secondary Region or Sub Clouds to bind anonymously, but still need to support LDAP read operations in these regions such as ldapfinger or lsldap. For this purpose, the ldapscripts runtime library has been modified to allow anonymous binds during LDAP search operations. Change-Id: Ic01a8097e8124348d493c9e0c82fda94700e28e2 Signed-off-by: Jack Ding <jack.ding@windriver.com>	2018-06-28 15:49:45 -04:00
Dean Troyer	18922761a6	StarlingX open source release updates Signed-off-by: Dean Troyer <dtroyer@gmail.com>	2018-05-31 07:36:43 -07:00

13 Commits