metal

Commit Graph

Author	SHA1	Message	Date
Eric MacDonald	14bb67789e	Add pxeboot network mtcAlive messaging to Maintenance The introduction of the new pxeboot network requires maintenance verify and report on messaging failures over that network. Towards that, this update introduces periodic mtcAlive messaging between the mtcAgent and mtcClinet. Test Plan: PASS: Verify install and provision each system type with a mix of networking modes ; ethernet, bond and vlan - AIO SX, AIO DX, AIO DX plus - Standard System 2+1 - Storage System 2+1+1 PASS: Verify feature with physical on management interface PASS: Verify feature with vlan on management interface PASS: Verify feature with bonded management interface PASS: Verify feature with bonded vlans on management interface PASS: Verify in bonded cases handling with 2, 1 or no slaves found PASS: Verify mgmt-combined or separate cluster-host network PASS: Verify mtcClient pxeboot interface address learning - for worker and storage nodes ; dhcp leases file - for controller nodes before unlock ; dhcp leases file - for controller nodes after unlock ; static from ifcfg - from controller within 10 seconds of process restart PASS: Verify mtcAgent pxeboot interface address learning from dnsmasq.hosts file PASS: Verify pxeboot mtcAlive initiation, handling, loss detection and recovery PASS: Verify success and failure handling of all new pxeboot ip address learning functions ; - dhcp - all system node installs. - dnsmasq.hosts - active controller for all hosts. - interfaces.d - controller's mtcClient pxeboot address. - pxeboot req mtcAlive - mtcAgent mtcAlive request message. PASS: Verify mtcClient pxeboot network 'mtcAlive request' and 'reboot' command handling for ethernet, vlan and bond configs. PASS: Verify mtcAlive sequence number monitoring, out-of-sequence detection, handling and logging. PASS: Verify pxeboot rx socket binding and non-blocking attribute PASS: Verify mtcAgent handling stress soaking of sustained incoming 500+ msgs/sec ; batch handling and logging. PASS: Verify mtcAgent and mtcClient pxeboot tx and rx socket messaging, failure recovery handling and logging. PASS: Verify pxeboot receiver is not setup on the oam interface on controller-0 first install until after initial config complete. Regression: PASS: Verify mtcAgent/mtcClient online and offline state management PASS: Verify mtcAgent/mtcClient command handling - over management network - over cluster-host network PASS: Verify mtcClient interface chain log for all iface types - bond : vlan123 -> pxeboot0 (802.3ad 4) -> enp0s8 and enp0s9 - vlan : vlan123 -> enp0s8 - ethernet: enp0s8 PASS: Verify mtcAgent/mtcClient handling and logging including debug logging for standard operations - node install and unlock - node lock and unlock - node reinstall, reboot, reset PASS: Verify graceful recovery handling of heartbeat loss failure. - node reboot - management interface down PASS: Verify systemcontroller and subcloud install with dc-libvirt PASS: Verify no log flooding, coredumps, memory leaks Story: 2010940 Task: 49541 Change-Id: Ibc87b85e3e0e07c3b8c40b5291bd3372506fbdfb Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2024-03-28 15:28:27 +00:00
Eric MacDonald	191c0aa6a8	Add a wait time between http request retries Maintenance interfaces with sysinv, sm and the vim using http requests. Request timeout's have an implicit delay between retries. However, command failures or outright connection failures don't. This has only become obvious in mtce's communication with the vim where there appears to be a process startup timing change that leads to the 'vim' not being ready to handle commands before mtcAgent startup starts sending them after a platform services group startup by sm. This update adds a 10 second http retry wait as a configuration option to mtc.conf. The mtcAgent loads this value at startup and uses it in a new HTTP__RETRY_WAIT state of http request work FSM. The number of retries remains unchanged. This update is only forcing a minimum wait time between retries, regardless of cause. Failure path testing was done using Fault Insertion Testing (FIT). Test Plan: PASS: Verify the reported issue is resolved by this update. PASS: Verify http retry config value load on process startup. PASS: Verify updated value is used over a process -sighup. PASS: Verify default value if new mtc.conf config value is not found. PASS: Verify http connection failure http retry handling. PASS: Verify http request timeout failure retry handling. PASS: Verify http request operation failure retry handling. Regression: PASS: Build and install ISO - Standard and AIO DX. PASS: Verify http failures do not fail a lock operation. PASS: Verify host unlock fails if its http done queue shows failures. PASS: Verify host swact. PASS: Verify handling of random and persistent http errors involving the need for retries. Closes-Bug: 2047958 Change-Id: Icc758b0782be2a4f2882efd56f5de1a8dddea490 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2024-02-07 20:33:01 +00:00
Eric MacDonald	79d8644b1e	Add bmc reset delay in the reset progression command handler This update solves two issues involving bmc reset. Issue #1: A race condition can occur if the mtcAgent finds an unlocked-disabled or heartbeat failing node early in its startup sequence, say over a swact or an SM service restart and needs to issue a one-time-reset. If at that point it has not yet established access to the BMC then the one-time-reset request is skipped. Issue #2: When issue #1 race conbdition does not occur before BMC access is established the mtcAgent will issue its one-time reset to a node. If this occurs as a result of a crashdump then this one-time reset can interrupt the collection of the vmcore crashdump file. This update solves both of these issues by introducing a bmc reset delay following the detection and in the handling of a failed node that 'may' need to be reset to recover from being network isolated. The delay prevents the crashdump from being interrupted and removes the race condition by giving maintenance more time to establish bmc access required to send the reset command. To handle significantly long bmc reset delay values this update cancels the posted 'in waiting' reset if the target recovers online before the delay expires. It is recommended to use a bmc reset delay that is longer than a typical node reboot time. This is so that in the typical case, where there is no crashdump happening, we don't reset the node late in its almost done recovery. The number of seconds till the pending reset countdown is logged periodically. It can take upwards of 2-3 minutes for a crashdump to complete. To avoid the double reboot, in the typical case, the bmc reset delay is set to 5 minutes which is longer than a typical boot time. This means that if the node recovers online before the delay expires then great, the reset wasn't needed and is cancelled. However, if the node is truely isolated or the shutdown sequence hangs then although the recovery is delayed a bit to accomodate for the crashdump case, the node is still recovered after the bmc reset delay period. This could lead to a double reboot if the node recovery-to-online time is longer than the bmc reset delay. This update implements this change by adding a new 'reset send wait' phase to the exhisting reset progression command handler. Some consistency driven logging improvements were also implemented. Test Plan: PASS: Verify failed node crashdump is not interrupted by bmc reset. PASS: Verify bmc is accessible after the bmc reset delay. PASS: Verify handling of a node recovery case where the node does not come back before bmc_reset_delay timeout. PASS: Verify posted reset is cancelled if the node goes online before the bmc reset delay and uptime shows less than 5 mins. PASS: Verify reset is not cancelled if node comes back online without reboot before bmc reset delay and still seeing mtcAlive on one or more links.Handles the cluster-host only heartbeat loss case. The node is still rebooted with the bmc reset delay as backup. PASS: Verify reset progression command handling, with and without reboot ACKs, with and without bmc PASS: Verify reset delay defaults to 5 minutes PASS: Verify reset delay change over a manual change and sighup PASS: Verify bmc reset delay of 0, 10, 60, 120, 300 (default), 500 PASS: Verify host-reset when host is already rebooting PASS: Verify host-reboot when host is already rebooting PASS: Verify timing of retries and bmc reset timeout PASS: Verify posted reset throttled log countdown Failure Mode Cases: PASS: Verify recovery handling of failed powered off node PASS: Verify recovery handling of failed node that never comes online PASS: Verify recovery handling when bmc is never accessible PASS: Verify recovery handling cluster-host network heartbeat loss PASS: Verify recovery handling management network heartbeat loss PASS: Verify recovery handling both heartbeat loss PASS: Verify mtcAgent restart handling finding unlocked disabled host Regression: PASS: Verify build and DX system install PASS: Verify lock/unlock (soak 10 loops) PASS: Verify host-reboot PASS: Verify host-reset PASS: Verify host-reinstall PASS: Verify reboot graceful recovery (force and no force) PASS: Verify transient heartbeat failure handling PASS: Verify persistent heartbeat loss handling of mgmt and/or cluster networks PASS: Verify SM peer reset handling when standby controller is rebooted PASS: Verify logging and issue debug ability Closes-Bug: 2042567 Closes-Bug: 2042571 Change-Id: I195661702b0d843d0bac19f3d1ae70195fdec308 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2023-11-02 20:58:00 +00:00
Eric MacDonald	1196056612	Disable Redfish BMC audit and improve reinstall failure handling The Mtce Reinstall Handler can collide with the BMC Redfish audit resulting in reinstall failure. BMC handler's 2 minute connection audit can colliding with other BMC commands. The reinstall handler, with 4 bmc command operations is particularly suseptable. Two additional bmc communication improvements are implemented: 1. Add 'retry' handling to all BMC requests in the Maintenance Reinstall Handler FSM to handle transient command failures. Note: There are already retries to all but the power status query and the netboot requests in that handler and retries in other administrative commands that involve bmc requests. 2. Switch BMC power control command management from 'static' to 'learned' lists. Some BMCs don't support both graceful and immediate power commands; Graceful Restart and Force Restart. To remove the possibility of using an unsupported BMC command, this update switches from static to learned power command lists with log produced if a server is missing command support. Power commands escalate from graceful to immediate in the presence of retries. Test Cases: PASS: Verify bmc handler redfish audit is disabled PASS: Verify reinstall soak using redfish PASS: Verify reinstall netboot and power status retry handling PASS: Verify all power control commands using redfish PASS: Verify graceful operations are used if available PASS: Verify immediate operations are used for retries Regression: PASS: Verify bmc ping audit success and failure handling PASS: Verify Reset Handling soak (redfish and ipmi) PASS: Verify Power-Off/On Handling soak (redfish and ipmi) PASS: Verify Reinstall Handling soak (redfish and ipmi) PASS: Verify Standard System Install (redfish and ipmi) PASS: Verify AIO DX System Install (redfish and ipmi) PASS: Verify this update as a patch Change-Id: Idb484512ccb1b16e2d0ea9aff4ab7965347b1322 Closes-Bug: 1880578 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2020-11-16 15:15:22 +00:00
Eric MacDonald	2fc05673d1	Add SysRq crash dump support for pmon quorum health messaging loss The hostwd process supports failure handling for two pmon quorum failure modes. 1. persistent pmon quorum process failure 2. persistent absence of pmon's quorum health report This update adds a new configuration option and associated implementation required to force a crash dump action for failure mode 2 above. This means that if the Process Monitor itself gets stalled or stops running for 3 (default config) minutes then the hostwd will trigger a SysRq to force a crash dump. Test Plan: PASS: Verify kdump for pmon quorum health report message loss PASS: Verify no kdump when kdump_on_stall is disabled PASS: Verify handling when kdump service is not active PASS: Verify sighup config change detection and handling Regression: PASS: Verify softdog timeout handling and logs PASS: Verify quorum threshold config change and handling PASS: Verify handling with reboot/reset recovery methods disabled PASS: Verify enable reboot_on_err config change handling PASS: Verify reboot/reset actions are ignored while host is locked PASS: Verify pmon failure recovery handling before threshold reached Change-Id: Id926447574e02013f83c0170784e2a8f9a46bac1 Partial-Bug: 1894889 Depends-On: https://review.opendev.org/#/c/750806 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2020-11-13 12:38:16 -05:00
Eric MacDonald	3a6fec50c1	Reduce Maintenance Host Watchdog timeout for controllers This update makes changes to the maintenance host watchdog and reduces the timeout from 5 to 3 minutes for controllers. This update also decouples the pmon quorum monitoring feature handling from the host watchdog timeout. Both were driven off the same select timer which prevented watchdog timeout value to be independently changed without affecting quorum monitoring. A new config label 'kernwd_update_period_stall_detect' is added and value loaded for hosts that need more rigid process stall detection. This new lower timeout value label is loaded and applied to hosts that run the system controller function. A few logging improvements were made. Test Plan: PASS: Verify pmon quorum failure handling while unlocked. Was and remains at 3 misses, 60 seconds each. PASS: Verify watchdog TO at 12 seconds on controllers. Was 300 secs. PASS: Verify kernel watchdog is not enabled when loaded kernwd_update_period is less than 5 seconds. Was 60 secs. PASS: Verify process logging ; startup, failure, transient PASS: Verify all config values loaded by hostwd process Regression: PASS: Verify watchdog TO at 300 seconds on non-controllers PASS: Verify handling of failed quorum process while locked PASS: Verify handling of failed quorum process while unlocked PASS: Verify handling of transient quorum messaging loss while unlocked PASS: Verify hostwd process patching ; locked and unlocked cases PASS: Verify AIO DX System Install PASS: Verify Standard System Install Note: There is no kernel WD TO log. The log is output to the console. Change-Id: Iad726436e28dfa48a06743aa166318969eb6915d Closes-Bug: #1894889 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2020-11-13 07:52:59 -05:00
Hang Li	f48eae8f35	fix spelling error Fixing spelling mistakes in notes helps us understand. Change-Id: Ic9050bd5f0141153f74d357f7405032d6aa1e1f1 Closes-Bug: #1852689	2019-11-15 14:11:52 +08:00
Eric MacDonald	df9343b0cc	Add redfish power/reset/reinstall bmc support to maintenance This update delivers redfish support for Power-On/Off, Reset and Netboot Reinstall handling to maintenance. Test Plan: (Testing Continues) PASS: Verify Redfish Power-Off action handling PASS: Verify Redfish Power-On action handling PASS: Verify Redfish Reset action handling PASS: Verify compute Redfish Reinstall action handling from controller-0 PASS: Verify compute Redfish Reinstall action handling from controller-1 PASS: Verify Redfish Power-Off Action failure handling PASS: Verify Redfish Power-On action failure handling PASS: Verify Redfish Reset action failure handling PASS: Verify Redfish Re-Install action failure handling PASS: verify Reset progression cycle does not leak memory. PASS: Verify bmc_handler failure handling does not leak memory. PASS: Verify Inservice BMC access (ping) failure and recovery handling. PASS: Verify BMC access failure alarm handling PASS: Verify BMC provisioning and deprovisioning soak (redfish - wolfpass) PASS: Verify BMC provisioning and deprovisioning does not leak memory. PASS: Verify BMC provisioning handling with bad ip and/or bad username PASS: Verify BMC reprovisioning to same protocol PASS: Verify BMC reprovisioning from ipmi host to redfish host PASS: Verify BMC reprovisioning from redfish host to ipmi host PASS: Verify mixed protocol support in same lab PASS: Verify mixed server support in same lab PASS: Verify Large System Install with BMCs provisioned (wp8-12) PASS: Verify bmc access method (learn,ipmi,redfish) learned from mtc.init PASS: Verify Swact with BMCs provisioned. PASS: Verify no segfaults. PASS: Verify AIO System Install in lab that supports redfish (WC3-6, WP8-12, Dell 720 3-7) PASS: Verify AIO Simplex Install with Redfish Support (SM1, SM3) PASS: Verify AIO Duplex Install with Redfish Support (SM 5-6, Dell 720 1-2 Useability: PASS: Verify handling of reprovisioning BMC between hosts that support different protocols. PASS: Verify handling of reprovisioning ip address to host that leads to a different protocol select. PASS: Verify manual relearn handling to recover from errors that result from the above case. PASS: Verify host BMC deprovisioning handling and cleanup. PASS: Verify sensor monitoring. PASS: Verify fault insertion for both protocols and action handling. PASS: Verify protocol select handover. PASS: Verify hwmond sticks with a selected protocol once a sensor model has been created using that protocol. PASS: Verify handling of missing bmc_access_method configuration select. PASS: Verify inservice bmc_access_method service parameter modification handling. Regression: PASS: Verify redfish BMC info query logging. PASS: Verify sensor monitoring and alarming still works. PASS: Verify all power/reset/netboot commands for IPMI PASS: Verify reprovisioning soak of Wolfpass servers PASS: Verify reprovisioning soak of SM servers Depends-on: https://review.opendev.org/#/c/679178/ Change-Id: I984057e04d7426e37d675cf4d334a4e35419f2e8 Story: 2005861 Task: 35826 Task: 36606 Task: 36467 Task: 36456 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2019-09-26 15:59:35 -04:00
Eric MacDonald	a0ab8947ab	Remove references to ceilometer in maintenance Maintence no longer has any plan to interface with ceilometer so this update removes all such references. In addition it removes 3 obsoleted files that also make reference to ceilometer. Change-Id: Iae0738946ff241acde44720024d25f8c38f65433 Story:2004764 Task:30666 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2019-04-30 14:28:12 -04:00
Teresa Ho	8e51a1660a	Refactor infrastructure network in mtce code Updated to read the host cluster-host parameter in /etc/hosts file. Replaced references of infra network with cluster-host network Story: 2004273 Task: 29473 Change-Id: I199fb82e5f6b459b181196d0802f1a74220b796e Signed-off-by: Teresa Ho <teresa.ho@windriver.com>	2019-04-18 09:32:41 -04:00
Eric MacDonald	f55ef546a7	Remove Resource Monitor ; aka rmon, from the load All rmon resource monitoring has been moved to collectd. This update removes rmon from mtce and the load. Story: 2002823 Task: 30045 Test Plan: PASS: Build and install a standard system. PASS: Inspect mtce rpm list PASS: Inspect logs PASS: Check pmon.d Change-Id: I7cf1fa071eac89274e7fae1f307e14d548cc945b Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2019-03-19 16:12:38 -04:00
Alex Kozyrev	506ef3fd7f	MTCE: reading BMC passwords from Barbican secret storage. Use Openstack Barbican API to retrieve BMC passwords stored by SysInv. See SysInv commit for details on how to write password to Barbican. MTCE is going to find corresponding secret by host uuid and retrieve secret payload associated with it. mtcSecretApi_get is used to find secret reference, based on a hostname. mtcSecretApi_read is used to read a password using the reference found on a prevoius step. Also, did a little cleanup and removed old unused token handling code. Depends-On: I7102a9662f3757c062ab310737f4ba08379d0100 Change-Id: I66011dc95bb69ff536bd5888c08e3987bd666082 Story: 2003108 Task: 27700 Signed-off-by: Alex Kozyrev <alex.kozyrev@windriver.com>	2019-02-14 09:04:46 -05:00
Eric MacDonald	7941ee5bbb	Add new Link Monitor (lmond) daemon to Mtce This update introduces a new Link Monitor daemon to the Mtce flock of daemons and disable rmon's interface monitoring. This new daemon parses the platform.conf file and using the interface names assigned to each monitored network (mgmt, infra and oam) queries the kernel for their physical, bonded and vlan interface names and then registers to listen for netlink events. All link/interface state change (netlink) events that correspond to any of the interfaces or links assiciated with the monitored networks are tracked by this new daemon. This new daemon then also implements an http listener for localhost initiated GET requests targeted to /mtce/lmond on port 2122 and responds with a json link_info string that contains a summary of monitored networks, links and their current Up/Down status. lmond behavioral summary: 1. learn interface/port model, 2. load initial link status for learned links, 3. listen for link status change events 4. provide link status info to http GET Query requests. Another update to stx-integ implements the collectd interface plugin that periodically issues the Link Status GET requests for the purponse of alarming port and interface Down conditions, clearing alarms on Up state changes, and storing sample data that represents the percentage of active links for each monitored network. Test Plan: PASS: Verify lmond process startup PASS: Verify lmond logging and log rotation PASS: Verify lmond process monitoring by pmon PASS: Verify lmond interface learning on process startup PASS: Verify lmond port learning on process startup PASS: Verify lmond handling of vlan and bond interface types PASS: Verify lmond http link info GET Query handling PASS: Verify lmond has no memory leak during normal and eventfull operation Change-Id: I58915644e60f31e3a12c3b451399c4f76ec2ea37 Story: 2002823 Task: 28635 Depends-On: Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2019-02-01 14:57:40 -05:00
Eric MacDonald	3a5c578355	Mtce: Add Thresholded Maintenance Enable Recovery support This update stops trying to recover hosts that have failed the Enable sequence after a thresholded number of back-to-back tries. A host that has reached a particular failure modes' max failure threshold then maintenance puts it into a 'unlocked-disabled-failed' state and left that way with no further recovery action until it is manually locked and unlocked. The thresholded Enable failure causes are Configuration Failure ....... threshold:2 retry interval:30 secs In-Test GoEnabled Failure ... threshold:2 retry interval:30 sec Start Host Services Failure . threshold:2 retry interval:30 sec Heartbeat Soak Failure ...... threshold:2 retry interval:10 minute This update refactors the old auto recovery for AIO SX into this more generic framework. Story: 2003576 Task: 24905 Test Plan: PASS: Verify AIO DX System Install PASS: Verify AIO SX DOR PASS: Verify Auto recovery disabled state is maintained over AIO SX DOR PASS: Verify Lock/Unlock recovers host from Auto recovery disabled state PASS: Verify AIO SX Main Config Failure handling PASS: Verify AIO SX Main Config Timeout handling PASS: Verify AIO SX Main GoEnabled Failure Handling PASS; Verify AIO SX Main Host Services Failure handling PASS; Verify AIO SX Main Host Services Timeout handling PASS; Verify AIO SX Subf Config Failure handling PASS: Verify AIO SX Subf Config Timeout handling PASS: Verify AIO SX Subf GoEnabled Failure Handling PASS: Verify AIO SX Subf Host Services Failure handling PASS: Verify AIO DX System Install PASS: Verify AIO DX DOR PASS: Verify AIO DX DOR ; one time active controller GoEnabled failure ; swact requested PASS: Verify AIO DX Main First Unlock Failure handling PASS: Verify AIO DX Main Config Failure handling (inactive ctrl) PASS: Verify AIO DX Main one time Config Failure handling PASS: Verify AIO DX Main one time GoEnabled Failure handling. PASS: Verify AIO DX SUBF Inactive Controller 1 GoEnable Failure handling. PASS: Verify AIO DX Inactive Controller 1 GoEnable Failure with recovery on retry. PASS: Verify AIO DX Active controller Enable failure with no or locked peer controller. PASS: Verify AIO DX Reboot Active controller with peer in auto recovery disabled state. PASS: Verify AIO DX Active controller failure with peer in auto recovery disabled state. (vswitch process) PASS: Verify AIo DX Active controller failure then recovery after reboot with peer in auto recovery disabled state. (goenabled) PASS: Verify AIO DX Inactive Controller Enable Heartbeat Soak Failure handling. PASS: Verify AIO DX Active controller unhealthy detection and handling. (degrade) PASS: Verify AIO DX Inactive controller unhealthy detection and handling. (fail) PASS: Verify Normal System Install PASS: Verify Compute Enable Configuration Failure handling (wc71-75) PASS: Verify Compute Enable GoEnabled Failure handling (recover after 1) PASS: Verify Compute Enable Start Host Services Failure handling PASS: Verify Compute Enable Heartbeat Soak Failure handling PASS: Verify Inactive Controller Enable Heartbeat Soak Failure handling PASS: Verify Inactive Controller Configuration Failure handling PASS; Verify Inactive Controller GoEnabled Failure handling PASS; Verify Inactive Controller Host Services Failure handling PASS; Verify goEnabled failure after active controller reboot with no peer controller (C0 rebooted with C1 locked) - no SM startup PASS: Verify auto recovery threshold number is configurable PASS: Verify auto recovery retry interval is configurable PASS: Verify auto recovery host state and status message Regression: PASS: Verify Swact behavior, over and back PASS: Verify 5 node DOR PASS: Verify 3 host MNFA behavior PASS: verify in-service heartbeat failure handling PASS: verify no segfaults during UT Corner Cases: PASS: Verify mtcAlive boot failure behavior. reset progression. retry forever. - sleep in config script PASS: Verify AIO SX mtcAgent process restart while in autorecovery disabled state PASS: Verify autorecovery disabled state is preserved over mtcAgent process restart. Change-Id: I7098f16243caef27c5295971ef3c9de5be975755 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2018-12-12 08:11:36 -05:00
Eric MacDonald	dc531dc815	Fix mtce guest build failure A recent update to stx-metal/mtce-common removed a daemon_config structure member that the stx-nfv/mtce-guest git depends on. This was not detected during UT of the mtc-common change because of a missing build dependency that should force a rebuild of the mtce guest. Delivering the code fix to unblock the community. Will deliver the build dependency change shortly. Change-Id: Ice08424f156ffc84e38651fbc40ebc184170eb20 Closes-Bug: 1804579 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2018-11-22 10:26:18 -05:00
Eric MacDonald	0b922227ac	Implement Active-Active Heartbeat as HA Improvement This update introduces mtce changes to support Active-Active Heartbeating. The purpose of Active-Active Heartbeating is help avoid Split-Brain. Active-Active heartbeating has each controller maintain a 5 second heartbeat response history cache of each network for all monitored hosts as well as the on-going health of storage-0 if provisioned and enabled. This is referred to as the 'heartbeat cluster history' Each controller then includes its cluster history in each heartbeat pulse request message. The hbsClient, now modified to handle heartbeat from both controllers, saves each controllers' heartbeat cluster history in a local cache and criss-crosses the data in its pulse responses. So when the hbsClient receives a pulse request from controller-0 it saves its reported history and then replaces that history information in its response to controller-0 with what it saved from controller-1's last pulse request ; i.e. its view of the system. Controller-0, receiving a host's pulse response, saves its peers heartbeat cluster history so that it has summary of heartbeat cluster history for the last 5 seconds for each monitored network of every monitored host in the system from both controllers' perspectives. Same for controller-1 with controller-0's history. The hbsAgent is then further enhanced to support a query request for this information. So now SM, when it needs to make a decision to avoid Split-Brain or otherwise, can query either controller for its heartbeat cluster history and get the last 5 second summary view of heartbeat (network) responsivness from both controllers perspectives to help decide which controller to make active. This involved removing the hbsAgent process from SM control and monitor and adding a new hbsAgent LSB init script for process launch, service file to run the init script and pmon config file for hbsAgent process monitoring. With hbsAgent now running on both controllers, changes to maintenance were required to send inventory to hbsAgent on both controllers, listen for hbsAgent event messages over the management interface and inform both hbsAgents which controller is active. The hbsAgent running on the inactive controller does not - does not send heartbeat events to maintenance - does not send raise or clear alarms or produce customer logs Test Plan: Feature: PASS: Verify hbsAgent runs on both controllers PASS: Verify hbsAgent as pmon monitored process (not SM) PASS: Verify system install and cluster collection in all system types (10+) PASS: Verify active controller hbsAgent detects and handles heartbeat loss PASS: Verify inactive controller hbsAgent detects and logs heartbeat loss PASS: Verify heartbeat cluster history collection functions properly. PASS: Verify storage-0 state tracking in cluster into. PASS: Verify storage-0 not responding handling PASS: Verify heartbeat response is sent back to only the requesting controller. PASS: Verify heartbeat history is correct from each controller PASS: Verify MNFA from active controller after install to controller-0 PASS: Verify MNFA from active controller after swact to controller-1 PASS: Verify MNFA for 80%+ of the hosts in the storage system PASS: Verify SM cluster query operation and content from both controllers PASS: Verify restart of inactive hbsAgent doesn't clear existing heartbeat alarms Logging: PASS: Verify cluster info logs. PASS: Verify feature design logging. PASS: Verify hbsAgent and hbsClient design logs on all hosts add value PASS: Verify design logging from both controllers in heartbeat loss case PASS: Verify design logging from both controllers in MNFA case PASS: Verify clog logs cluster info vault status and updates for controllers PASS: Verify clog1 logs full cluster state change for all hosts PASS: Verify clog2 logs cluster info save/append logs for controllers PASS: Verify clog3 memory dumps a cluster history PASS: Verify USR2 forces heartbeat and cluster info log dump PASS: Verify hourly heartbeat and cluster info log dump PASS: Verify loss events force heartbeat and cluster info log dump Regression: PASS: Verify Large System DOR PASS: Verify pmond regression test that now includes hbsAgent PASS: Verify Lock/Unlock of inactive controller (x3) PASS: Verify Swact behavior (x10) PASS: Verify compute Lock/Unlock PASS: Verify storage-0 Lock/Unlock PASS: Verify compute Host Failure and Graceful Recovery PASS: Verify Graceful Recovery Retry to Max:3 then Full Enable PASS: Verify Delete Host PASS: Verify Patching hbsAgent and hbsClient PASS: Verify event driven cluster push Story: 2003576 Task: 24907 Change-Id: I5baf5bcca23601a99473d039356d58250ffb01b5 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2018-11-20 19:57:18 +00:00
Eric MacDonald	8a223f395d	Mtce: Add heartbeat cluster information for SM query This part one of a two part HA Improvements feature that introduces the collection of heartbeat health at the system level. The full feature is intended to provide service management (SM) with the last 2 seconds of maintenace's heartbeat health view that is reflective of each controller's connectivity to each host including its peer controller. The heartbeat cluster summary information is additional information for SM to draw on when needing to make a choice of which controller is healthier, if/when to switch over and to ultimately avoid split brain scenarios in a two controller system. Feature Behavior: A common heartbeat cluster data structure is introduced and published to the sysroot for SM. The heartbeat service populates and maintains a local copy of this structure with data that reflects the responsivness for each monitored network of all the monitored hosts for the last 20 heartbeat periods. Mtce sends the current cluster summary to SM upon request. General flow of cluster feature wrt hbsAgent: hbs_cluster_init: general data init hbs_cluster_nums: set controller and network numbers forever: select: hbs_cluster_add / hbs_cluster_del: - add/del hosts from mtcAgent hbs_sm_handler -> hbs_cluster_send: - send cluster to SM heartbeating: hbs_cluster_append: add controller cluster to pulse request hbs_cluster_update: get controller cluster data from pulse responses hbs_cluster_save: save other controller cluster view in cluster vault hbs_cluster_log: log cluster state changes (clog) Test Plan: PASS: Verify compute system install PASS: Verify storage system install PASS: Verify cluster data ; all members of structure PASS: Verify storage-0 state management PASS: Verify add of second controller PASS: Verify add of storage-0 node PASS: Verify behavior over Swact PASS: Verify lock/unlock of second controller ; overall behavior PASS: Verify lock/unlock of storage-0 ; overall behavior PASS: Verify lock/unlock of storage-1 ; overall behavior PASS: Verify lock/unlock of compute nodes ; overall behavior PASS: Verify heartbeat failure and recovery of compute node PASS: Verify heartbeat failure and recovery of storage-0 PASS: Verify heartbeat failure and recovery of controller PASS: Verify delete of controller node PASS: Verify delete of storage-0 PASS: Verify delete of compute node PASS: Verify cluster when controller-1 active / controller-0 disabled PASS: Verify MNFA and recovery handling PASS: Verify handling in presence of multiple failure conditions PASS: Verify hbsAgent memory leak soak test with continuous SM query. PASS: Verify active controller-1 infra network failure behavior. PASS: Verify inactive controller-1 infra network failure behavior. Change-Id: I4154287f6dcf5249be5ab3180f2752ab47c5da3c Story: 2003576 Task: 24907 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2018-10-05 22:47:17 +00:00
Jim Gauld	6a5e10492c	Decouple Guest-server/agent from stx-metal This decouples the build and packaging of guest-server, guest-agent from mtce, by splitting guest component into stx-nfv repo. This leaves existing C++ code, scripts, and resource files untouched, so there is no functional change. Code refactoring is beyond the scope of this update. Makefiles were modified to include devel headers directories /usr/include/mtce-common and /usr/include/mtce-daemon. This ensures there is no contamination with other system headers. The cgts-mtce-common package is renamed and split into: - repo stx-metal: mtce-common, mtce-common-dev - repo stx-metal: mtce - repo stx-nfv: mtce-guest - repo stx-ha: updates package dependencies to mtce-pmon for service-mgmt, sm, and sm-api mtce-common: - contains common and daemon shared source utility code mtce-common-dev: - based on mtce-common, contains devel package required to build mtce-guest and mtce - contains common library archives and headers mtce: - contains components: alarm, fsmon, fsync, heartbeat, hostw, hwmon, maintenance, mtclog, pmon, public, rmon mtce-guest: - contains guest component guest-server, guest-agent Story: 2002829 Task: 22748 Change-Id: I9c7a9b846fd69fd566b31aa3f12a043c08f19f1f Signed-off-by: Jim Gauld <james.gauld@windriver.com>	2018-09-18 17:15:08 -04:00

18 Commits