metal

Commit Graph

Author	SHA1	Message	Date
Eric MacDonald	649e94c8da	Add pxeboot mtcAlive messaging alarm handling This update adds alarm handling to the recently introduced pxeboot network mtcAlive messaging, see depends on review below. A new 200.003 maintenance alarm is introduced with the second depends on update below. This new alarm is MINOR but also Management Affecting because the pxeboot network is required for node installation. This update enhances the new pxeboot_mtcAlive_monitor FSM for the purpose of detecting pxeboot mtcAlive message loss, alarming and then clearing the alarm once pxceboot mtcAlive messaging resumes. The new alarm assertion and clear is debounced: - alarm is asserted if message loss persists to the accumulation of 12 missed messages or after 2 minutes of complete message loss. - alarm is cleared after decrementing the message missed counter to zero or 1 minute of loss-less messaging. Upgrades are supported with the addition of a features list to the mtcClient ready event. All new mtcClients that support pxeboot network messaging now publish pxeboot mtcAlive support through this new features list. This is rendered in the logs like this: <hostname> mtcClient ready ; with pxeboot mtcAlive support The mtcAgent does not expect/monitor pxeboot mtcAlive messages from hosts that don't publish the feature support. Test Plan: PASS: Verify mtcAlive period is 5 seconds. PASS: Verify pxeboot mtcAlive monitor period is 10 seconds. PASS: Verify mtcAgent sends mtcClient a mtcAlive request on every mtcAlive monitor miss. PASS: Verify pxeboot mtcAlive alarm is not raised while a node is locked. Alarm attributes: PASS: Verify severity is minor. PASS: Verify alarm is cleared while node is locked. PASS: Verify alarm can be suppressed while unlocked. PASS: Verify asserted alarm is management affecting. PASS: Verify alarm-show output format including cause and repair action text. Process Restart Handling: PASS: Verify alarm is maintained over a mtcAgent process restart. PASS: Verify pxeboot monitoring resumes with or without asserted alarm immediately following a mtcAgent process restart. PASS: Verify mtcClient learns and starts pxeboot mtcAlive messaging immediately following mtcClient process restart for locked or unlocked nodes. Alarm Debounce Handling: PASS: Verify alarm assertion only after 2 minutes of mtcAlive loss. PASS: Verify alarm clear after 1 minutes of mtcAlive recovery. PASS: Verify assertion and recovery debounce logging. PASS: Verify alarm management miss and loss controls handle all boundary conditions exercised by a 12 hr soak with randomized period between message loss and recovery. Host Action Handling: PASS: Verify mtcAlive alarm is not raised over a Host Unlock Enable. PASS: Verify mtcAlive alarm is not raised over a Host Graceful Recovery. PASS: Verify mtcAlive alarm is not raised over a Host Power Off/On. PASS: Verify mtcAlive alarm is not raised over a Host Reboot/Reset. PASS: Verify mtcAlive alarm is not raised over a Host Reinstall. PASS: Verify pxeboot mtcAlive is factored into Host Offline Handling. PASS: Verify pxeboot alarm handling for node that does not send pxeboot mtcAlive after unlock. Stuck Alarm Avoidance Handling: PASS: Verify typical alarm assertion and clear handling. PASS: Verify alarm is maintained or cleared over node reboot if the messaging issue persists or resolves over the reboot recovery. PASS: Verify mtcAlive alarm is maintained over a Swact and cleared if the messaging is ok on the newly active controller. PASS: Verify mtcAlive alarm assertion recovery case over uncontrolled Swact due to active controller reboot. PASS: Verify alarm is cleared over a spontaneous reboot if pxeboot messaging recovers over that reboot. Upgrades Case: PASS: Verify pxeboot mtcAlive monitoring only occurs on mtcClients that actually support pxeboot network mtcAlive monitoring. PASS: Verify mtcClient new features list, parsing which enables pxeboot mtcAlive monitoring for that node. PASS: Verify pxeboot mtcAlive messaging monitoring is not enabled towards nodes whose mtcClient does publish pxeboot mtcAlive messaging feature support. PROG: Verify AIO DX upgrade from 22.12 to current master branch. Focus on pxeboot messaging over the upgrade process. Depends-On: https://review.opendev.org/c/starlingx/metal/+/912654 Depends-On: https://review.opendev.org/c/starlingx/fault/+/914660 Story: 2010940 Task: 49542 Change-Id: I1b51ad9ebcf010f5dee9a86c0295be3da6e2f9b1 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2024-04-09 14:13:23 +00:00
Eric MacDonald	14bb67789e	Add pxeboot network mtcAlive messaging to Maintenance The introduction of the new pxeboot network requires maintenance verify and report on messaging failures over that network. Towards that, this update introduces periodic mtcAlive messaging between the mtcAgent and mtcClinet. Test Plan: PASS: Verify install and provision each system type with a mix of networking modes ; ethernet, bond and vlan - AIO SX, AIO DX, AIO DX plus - Standard System 2+1 - Storage System 2+1+1 PASS: Verify feature with physical on management interface PASS: Verify feature with vlan on management interface PASS: Verify feature with bonded management interface PASS: Verify feature with bonded vlans on management interface PASS: Verify in bonded cases handling with 2, 1 or no slaves found PASS: Verify mgmt-combined or separate cluster-host network PASS: Verify mtcClient pxeboot interface address learning - for worker and storage nodes ; dhcp leases file - for controller nodes before unlock ; dhcp leases file - for controller nodes after unlock ; static from ifcfg - from controller within 10 seconds of process restart PASS: Verify mtcAgent pxeboot interface address learning from dnsmasq.hosts file PASS: Verify pxeboot mtcAlive initiation, handling, loss detection and recovery PASS: Verify success and failure handling of all new pxeboot ip address learning functions ; - dhcp - all system node installs. - dnsmasq.hosts - active controller for all hosts. - interfaces.d - controller's mtcClient pxeboot address. - pxeboot req mtcAlive - mtcAgent mtcAlive request message. PASS: Verify mtcClient pxeboot network 'mtcAlive request' and 'reboot' command handling for ethernet, vlan and bond configs. PASS: Verify mtcAlive sequence number monitoring, out-of-sequence detection, handling and logging. PASS: Verify pxeboot rx socket binding and non-blocking attribute PASS: Verify mtcAgent handling stress soaking of sustained incoming 500+ msgs/sec ; batch handling and logging. PASS: Verify mtcAgent and mtcClient pxeboot tx and rx socket messaging, failure recovery handling and logging. PASS: Verify pxeboot receiver is not setup on the oam interface on controller-0 first install until after initial config complete. Regression: PASS: Verify mtcAgent/mtcClient online and offline state management PASS: Verify mtcAgent/mtcClient command handling - over management network - over cluster-host network PASS: Verify mtcClient interface chain log for all iface types - bond : vlan123 -> pxeboot0 (802.3ad 4) -> enp0s8 and enp0s9 - vlan : vlan123 -> enp0s8 - ethernet: enp0s8 PASS: Verify mtcAgent/mtcClient handling and logging including debug logging for standard operations - node install and unlock - node lock and unlock - node reinstall, reboot, reset PASS: Verify graceful recovery handling of heartbeat loss failure. - node reboot - management interface down PASS: Verify systemcontroller and subcloud install with dc-libvirt PASS: Verify no log flooding, coredumps, memory leaks Story: 2010940 Task: 49541 Change-Id: Ibc87b85e3e0e07c3b8c40b5291bd3372506fbdfb Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2024-03-28 15:28:27 +00:00
Eric MacDonald	3c94b0e552	Avoid creating non-volatile node locked file while in simplex mode It is possible to lock controller-0 on a DX system before controller-1 has been configured/enabled. Due to the following recent updates this can lead to SM disabling all controller services on that now locked controller-0 thereby preventing any subsequent controller-0 unlock attempts. https://review.opendev.org/c/starlingx/metal/+/907620 https://review.opendev.org/c/starlingx/ha/+/910227 This update modifies the mtce node locked flag file management so that the non-volatile node locked file (/etc/mtc/tmp/.node_locked) is only created on a locked host after controller-1 is installed, provisioned and configured. This prevents SM from shutting down if the administrator locks controller-0 before controller-1 is configured. Test Plan: PASS: Verify AIO DX Install. PASS: Verify Standard System Install. PASS: Verify Swact back and forth. PASS: Verify lock/unlock of controller-0 prior to controller-1 config PASS: Verify the non-volatile node locked flag file is not created while the /etc/platform/simplex file exists on the active controller. PASS: Verify lock and delete of controller-1 puts the system back into simplex mode where the non-volatile node locked flag file is once again not created if controller-0 is then unlocked. PASS: Verify an existing non-volatile node locked flag file is removed if present on a node that is locked without new persist option. PASS: Verify original reported issue is resolved for DX systems. Closes-Bug: 2051578 Change-Id: I40e9dd77aa3e5b0dc03dca3b1d3d73153d8816be Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2024-03-09 12:45:54 +00:00
Eric MacDonald	d9982a3b7e	Mtce: Create non-volatile backup of node locked flag file The existing /var/run/.node_locked flag file is volatile. Meaning it is lost over a host reboot which has DOR implications. Service Management (SM) sometimes selects and activates services on a locked controller following a DOR (Dead Office Recovery). This update is part one of a two-part update that solves both of the above problems. Part two is a change to SM in the ha git. This update can be merged without part two. This update maintains the existing volatile node locked file because it is looked at by other system services. So to minimize the change and therefore patchback impact, a new non-volatile 'backup' of the existing node locked flag file is created. This update incorporates modifications to the mtcAgent and mtcClient, introducing a new backup file and ensuring their synchronized management to guarantee their simultaneous presence or absence. Note: A design choice was made to not use a symlink of one to the other rather than add support to manage symlinks in the code. This approach was chosen for its simplicity and reliability in directly managing both files. At some point in the future volatile file could be deprecated contingent upon identifying and updating all services that directly reference it. This update also removes some dead code that was adjacent to my update. Test Plan: This test plan covers the maintenance management of both files to ensure they always align and the expected behavior exists. PASS: Verify AIO DX Install. PASS: Verify Storage System Install. PASS: Verify Swact back and forth. PASS: Verify mtcClient and mtcAgent logging. PASS: Verify node lock/unlock soak. Non-volatile (Nv) node locked management test cases: PASS: Verify Nv node locked file is present when a node is locked. Confirmed on all node types. PASS: Verify any system node install comes up locked with both node locked flag files present. PASS: Verify mtcClient logs when a node is locked and unlocked. PASS: Verify Nv node locked file present/absent state mirrors the already existing /var/run/.node_locked flag file. PASS: Verify node locked file is present on controller-0 during ansible run following initial install and removed as part of the self-unlock. PASS: Verify the Nv node locked file is removed over the unlock along with the administrative state change prior to the unlock reboot. PASS: Verify both node locked files are always present or absent together. PASS: Verify node locked file management while the management interface is down. File is still managed over cluster network. PASS: Verify node locked file management while the cluster interface is down. File is still managed over management network. PASS: Verify behavior if the new unlocked message is received by a mtcClient process that does not support it ; unknown command log. PASS: Verify a node locked state is auto corrected while not in a locked/unlocked action change state. ... Manually remove either file on locked node and verify they are both recreated within 5 seconds. ... Manually create either node locked file on unlocked worker or storage node and verify the created files are removed within 5 seconds. Note: doing this to the new backup file on the active controller will cause SM to shutdown as expected. PASS: Verify Nv node locked file is auto created on a node that spontaneously rebooted while it was unlocked. During the reboot the node was administratively locked. The node should come online with both node locked files present. Partial-Bug: 2051578 Change-Id: I0c279b92491e526682d43d78c66f8736934221de Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2024-02-14 00:54:11 +00:00
Eric MacDonald	191c0aa6a8	Add a wait time between http request retries Maintenance interfaces with sysinv, sm and the vim using http requests. Request timeout's have an implicit delay between retries. However, command failures or outright connection failures don't. This has only become obvious in mtce's communication with the vim where there appears to be a process startup timing change that leads to the 'vim' not being ready to handle commands before mtcAgent startup starts sending them after a platform services group startup by sm. This update adds a 10 second http retry wait as a configuration option to mtc.conf. The mtcAgent loads this value at startup and uses it in a new HTTP__RETRY_WAIT state of http request work FSM. The number of retries remains unchanged. This update is only forcing a minimum wait time between retries, regardless of cause. Failure path testing was done using Fault Insertion Testing (FIT). Test Plan: PASS: Verify the reported issue is resolved by this update. PASS: Verify http retry config value load on process startup. PASS: Verify updated value is used over a process -sighup. PASS: Verify default value if new mtc.conf config value is not found. PASS: Verify http connection failure http retry handling. PASS: Verify http request timeout failure retry handling. PASS: Verify http request operation failure retry handling. Regression: PASS: Build and install ISO - Standard and AIO DX. PASS: Verify http failures do not fail a lock operation. PASS: Verify host unlock fails if its http done queue shows failures. PASS: Verify host swact. PASS: Verify handling of random and persistent http errors involving the need for retries. Closes-Bug: 2047958 Change-Id: Icc758b0782be2a4f2882efd56f5de1a8dddea490 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2024-02-07 20:33:01 +00:00
Eric Macdonald	50dc29f6c0	Improve maintenance power/reset control command retry handling This update improves on and drives consistency into the maintenance power on/off and reset handling in terms of retries and use of graceful and immediate commands. This update maintains the 10 retries for both power-on and power-off commands and increases the number of retries for the reset command from 5 to 10 to line up with the power operation commands. This update also ensures that the first 5 retries are done with the graceful action command while the last 5 are with the immediate. This update also removed a power on handling case that could have lead to a stuck state. This case was virtually impossible to hit based on the required sequence of intermittent command failures but that scenario handling was fixed up anyway. Issues have been seen with the power-off handling on some servers. Suspect that those servers need more time to power-off. So, this introduced a 30 seconds delay following a power-off command before issuing the power status query to give the server some time to power-off before retrying the power-off command. Test Plan: Both IPMI and Redfish PASS: Verify power on/off and reset handling support up to 10 retries PASS: Verify graceful command is used for the first power on/off or reset try and the first 5 retries PASS: Verify immediate command is used for the final 5 retries PASS: Verify reset handling with/without retries (none/mid/max) PASS: Verify power-on handling with/without retries (none/mid/max) PASS: Verify power-off handling with/without retries (none/mid/max) PASS: Verify power status command failure handling for power on/off NOTE: FIT (fault insertion testing) was used to create retry scenarios PASS: Verify power-off inter retry delay feature PASS: Verify 30 second power-off to power query delay PASS: Verify redfish power/reset commands used are logged by default PASS: Verify power-off/on and reset logging Regression: PASS: verify power-on/off and reset handling without retries PASS: Verify power-off handling when power is already off PASS: Verify power-on handling when power is already on Closes-Bug: 2031945 Signed-off-by: Eric Macdonald <eric.macdonald@windriver.com> Change-Id: Ie39326bcb205702df48ff9dd090f461c7110dd36	2024-01-25 22:42:26 +00:00
Zuul	125601c2f9	Merge "Failure case handling of LUKS service"	2023-12-14 18:09:46 +00:00
Jagatguru Prasad Mishra	1210ed450a	Failure case handling of LUKS service luks-fs-mgr service creates and unseals the LUKS volume used to store keys/secrets. This change handles the failure case if this essential service is inactive. It introduces an alarm LUKS_ALARM_ID which is raised if service is inactive which implies that there is an issue in creating or unsealing the LUKS volume. Test Plan: PASS" build-pkgs -c -p mtce-common PASS: build-pkgs -c -p mtce PASS: build-image PASS: AIO-SX bootstrap with luks volume status active PASS: AIO-DX bootstrap with volume status active PASS: Standard setup with 2 controllers and 1 compute node with luks volume status active. There should not be any alarm and node status should be unlocked/enabled/available. PASS: AIO-DX node enable failure on the controller where luks volume is inactive. Node availability should be failed. A critical alarm with id 200.016 should be displayed with 'fm alarm-list' PASS: AIO-SX node enable failure on the controller-0. Node availability should be failed. A critical alarm with id 200.016 should be displayed with 'fm alarm-list' PASS: Standard- node enable failure on the node (controller-0, controller-1, storage-0, compute-1). Node availability should be failed. A critical alarm with id 200.016 should be displayed with 'fm alarm-list' for the failed host. PASS: AIO-DX In service volume inactive should be detected and a critical alarm should be raised with ID 200.016. Node availability should be changed to degraded. PASS: AIO-SX In service volume inactive status should be detected and a critical alarm should be raised with ID 200.016. Node availability should be changed to degraded. PASS: Standard ( 2 controller, 1 storage, 1 compute) In service volume inactive status should be detected and a critical alarm should be raised with ID 200.016. Node availability should be changed to degraded. PASS: AIO-DX In service: If volume becomes active and a LUKS alarm is active, alarm should be cleared. Node availability should be changed to available. PASS: AIO-SX In service: If volume becomes active and a LUKS alarm is active, alarm should be cleared. Node availability should be changed to available. PASS: Standard ( 2 controller, 1 storage, 1 compute) In service: If volume becomes active and a LUKS alarm is active, alarm should be cleared. Node availability should be changed to available. PASS: AIO-SX, AIO-DX, Standard- If intest fails and node availability is 'failed'. After fixing the volume issue, a lock/unlock should make the node available. Story: 2010872 Task: 49108 Change-Id: I4621e7c546078c3cc22fe47079ba7725fbea5c8f Signed-off-by: Jagatguru Prasad Mishra <jagatguruprasad.mishra@windriver.com>	2023-12-06 00:34:02 -05:00
Teresa Ho	36814db843	Increase timeout for runtime manifest In management network reconfiguration for AIO-SX, the runtime manifest executed during host unlock could take more than five minutes to complete. This commit is to extend the timeout period from five minutes to eight minutes. Test Plan: PASS: AIO-SX subcloud mgmt network reconfiguration Story: 2010722 Task: 49133 Change-Id: I6bc0bacad86e82cc1385132f9cf10b56002f385e Signed-off-by: Teresa Ho <teresa.ho@windriver.com>	2023-11-23 16:51:22 -05:00
Eric MacDonald	79d8644b1e	Add bmc reset delay in the reset progression command handler This update solves two issues involving bmc reset. Issue #1: A race condition can occur if the mtcAgent finds an unlocked-disabled or heartbeat failing node early in its startup sequence, say over a swact or an SM service restart and needs to issue a one-time-reset. If at that point it has not yet established access to the BMC then the one-time-reset request is skipped. Issue #2: When issue #1 race conbdition does not occur before BMC access is established the mtcAgent will issue its one-time reset to a node. If this occurs as a result of a crashdump then this one-time reset can interrupt the collection of the vmcore crashdump file. This update solves both of these issues by introducing a bmc reset delay following the detection and in the handling of a failed node that 'may' need to be reset to recover from being network isolated. The delay prevents the crashdump from being interrupted and removes the race condition by giving maintenance more time to establish bmc access required to send the reset command. To handle significantly long bmc reset delay values this update cancels the posted 'in waiting' reset if the target recovers online before the delay expires. It is recommended to use a bmc reset delay that is longer than a typical node reboot time. This is so that in the typical case, where there is no crashdump happening, we don't reset the node late in its almost done recovery. The number of seconds till the pending reset countdown is logged periodically. It can take upwards of 2-3 minutes for a crashdump to complete. To avoid the double reboot, in the typical case, the bmc reset delay is set to 5 minutes which is longer than a typical boot time. This means that if the node recovers online before the delay expires then great, the reset wasn't needed and is cancelled. However, if the node is truely isolated or the shutdown sequence hangs then although the recovery is delayed a bit to accomodate for the crashdump case, the node is still recovered after the bmc reset delay period. This could lead to a double reboot if the node recovery-to-online time is longer than the bmc reset delay. This update implements this change by adding a new 'reset send wait' phase to the exhisting reset progression command handler. Some consistency driven logging improvements were also implemented. Test Plan: PASS: Verify failed node crashdump is not interrupted by bmc reset. PASS: Verify bmc is accessible after the bmc reset delay. PASS: Verify handling of a node recovery case where the node does not come back before bmc_reset_delay timeout. PASS: Verify posted reset is cancelled if the node goes online before the bmc reset delay and uptime shows less than 5 mins. PASS: Verify reset is not cancelled if node comes back online without reboot before bmc reset delay and still seeing mtcAlive on one or more links.Handles the cluster-host only heartbeat loss case. The node is still rebooted with the bmc reset delay as backup. PASS: Verify reset progression command handling, with and without reboot ACKs, with and without bmc PASS: Verify reset delay defaults to 5 minutes PASS: Verify reset delay change over a manual change and sighup PASS: Verify bmc reset delay of 0, 10, 60, 120, 300 (default), 500 PASS: Verify host-reset when host is already rebooting PASS: Verify host-reboot when host is already rebooting PASS: Verify timing of retries and bmc reset timeout PASS: Verify posted reset throttled log countdown Failure Mode Cases: PASS: Verify recovery handling of failed powered off node PASS: Verify recovery handling of failed node that never comes online PASS: Verify recovery handling when bmc is never accessible PASS: Verify recovery handling cluster-host network heartbeat loss PASS: Verify recovery handling management network heartbeat loss PASS: Verify recovery handling both heartbeat loss PASS: Verify mtcAgent restart handling finding unlocked disabled host Regression: PASS: Verify build and DX system install PASS: Verify lock/unlock (soak 10 loops) PASS: Verify host-reboot PASS: Verify host-reset PASS: Verify host-reinstall PASS: Verify reboot graceful recovery (force and no force) PASS: Verify transient heartbeat failure handling PASS: Verify persistent heartbeat loss handling of mgmt and/or cluster networks PASS: Verify SM peer reset handling when standby controller is rebooted PASS: Verify logging and issue debug ability Closes-Bug: 2042567 Closes-Bug: 2042571 Change-Id: I195661702b0d843d0bac19f3d1ae70195fdec308 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2023-11-02 20:58:00 +00:00
Al Bailey	37c5910a62	Update mtce debian package ver based on git Update debian package versions to use git commits for: - mtce (old 9, new 30) - mtce-common (old 1, new 9) - mtce-compute (old 3, new 4) - mtce-control (old 7, new 10) - mtce-storage (old 3, new 4) The Debian packaging has been changed to reflect all the git commits under the directory, and not just the commits to the metadata folder. This ensures that any new code submissions under those directories will increment the versions. Test Plan: PASS: build-pkgs -p mtce PASS: build-pkgs -p mtce-common PASS: build-pkgs -p mtce-compute PASS: build-pkgs -p mtce-control PASS: build-pkgs -p mtce-storage Story: 2010550 Task: 47401 Task: 47402 Task: 47403 Task: 47404 Task: 47405 Signed-off-by: Al Bailey <al.bailey@windriver.com> Change-Id: I4846804320b0ad3ec10799a468a9ee3bf7973587	2023-03-02 14:50:35 +00:00
Eric MacDonald	67c4f1b148	Avoid logging in fork_sysreq_reboot failsafe thread Continuing to log in the fork_sysreq_reboot failsafe thread is seen to cause mtcAgent and mtcClient log file corruption with binary data. As an avoidance measure this update changes the offending information logs to normally disabled debug logs. Test Plan: PASS: Verify build, install and provision system with debian iso - AIO SX (hw), Standard 2+1 (vbox) PASS: Verify mtcAgent and mtcClient log files do not get binary data (corruption) injected over a self reboot. PASS: Verify lock and unlock of AIO SX host PASS: Verify lock and unlock of system node from active controller PASS: Verify host reboot command PASS: Verify critical process failure reboot handling Closes-Bug: 2001719 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com> Change-Id: Ib49ee427d2a6363ce21ec7488b1f739986828219	2023-01-10 11:38:12 -05:00
Eric MacDonald	da398e0c5f	Debian: Make Mtce offline handler more resilient to slow shutdowns The current offline handler assumes the node is offline after 'offline_search_count' reaches 'offline_threshold' count regardless of whether mtcAlive messages were received during the search window. The offline algorithm requires that no mtcAlive messages be seen for the full offline_threshold count. During a slow shutdown the mtcClient runs for longer than it should and as a result can lead to maintenance seeing the node as recovered before it should. This update manages the offline search counter to ensure that it only reached the count threshold after seeing no mtcAlive messages for the full search count. Any mtcAlive message seen during the count triggers a count reset. This update also 1. Adjusts the reset retry cadence from 7 to 12 secs to prevent unnecessary reboot thrash during the current shutdown. 2. Clears the hbsClient ready event at the start of the subfunction handler so the heartbeat soak is only started after seeing heartbeat client ready events that follow the main config. Test Plan: PASS: Debian and CentOS Build and DX install PASS: Verify search count management PASS: Verify issue does not occur over lock/unlock soak (100+) - where the same test without update did show issue. PASS: Monitor alive logs for behavioral correctness PASS: Verify recovery reset occurs after expected extended time. Closes-Bug: 1993656 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com> Change-Id: If10bb75a1fb01d0ecd3f88524d74c232658ca29e	2022-10-24 15:57:43 +00:00
Eric MacDonald	3f4c2cbb45	Mtce: Add ActionInfo extension support for reset operations. StarlingX Maintenance supports host power and reset control through both IPMI and Redfish Platform Management protocols when the host's BMC (Board Management Controller) is provisioned. The power and reset action commands for Redfish are learned through HTTP payload annotations at the Systems level; "/redfish/v1/Systems. The existing maintenance implementation only supports the "ResetType@Redfish.AllowableValues" payload property annotation at the #ComputerSystem.Reset Actions property level. However, the Redfish schema also supports an 'ActionInfo' extension at /redfish/v1/Systems/1/ResetActionInfo. This update adds support for the 'ActionInfo' extension for Reset and power control command learning. For more information refer to the section 6.3 ActionInfo 1.3.0 of the Redfish Data Model Specification link in the launchpad report. Test Plan: PASS: Verify CentOS build and patch install. PASS: Verify Debian build and ISO install. PASS: Verify with Debian redfishtool 1.1.0 and 1.5.0 PASS: Verify reset/power control cmd load from newly added second level query from ActionInfo service. Failure Handling: Significant failure path testing with this update PASS: Verify Redfish protocol is periodically retried from start when bm_type=redfish fails to connect. PASS: Verify BMC access protocol defaults to IPMI when bm_type=dynamic but failed connect using redfish. Connection failures in the above cases include - redfish bmc root query fails - redfish bmc info query fails - redfish bmc load power/reset control actions fails - missing second level Parameters label list - missing second level AllowableValues label list PASS: Verify sensor monitoring is relearned to ipmi from failed and retried with bm_type=redfish after switch to bm_type=dynamic or bm_type=ipmi by sysinv update command. Regression: PASS: Verify with CentOS redfishtool 1.1.0 PASS: Verify switch back and forth between ipmi and redfish using update bm_type=ipmi and bm_type=redfish commands PASS: Verify switch from ipmi to redfish usinf bm_type=dynamic for hosts that support redfish PASS: Verify redfish protocol is preferred in bm_type=dynamic mode PASS: Verify IPMI sensor monitoring when bm_type=ipmi PASS: Verify IPMI sensor monitoring when bm_type=dynamic and redfish connect fails. PASS: Verify redfish sensor event assert/clear handling with alarm and degrade condition for both IPMI and redfish. PASS: Verify reset/power command learn by single level query. PASS: Verify mtcAgent.log logging Closes-Bug: 1992286 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com> Change-Id: Ie8cdbd18104008ca46fc6edf6f215e73adc3bb35	2022-10-13 17:40:05 +00:00
Zuul	ad1c87669f	Merge "Debian: Redfishtool requests fail when IPV4 address has square brackets"	2022-10-11 15:41:59 +00:00
Eric MacDonald	db0b4ccadd	Debian: Redfishtool requests fail when IPV4 address has square brackets Redfishtool was introduced in CentOS for maintenance power control and sensor monitoring. Both IPV4 and IPV6 addressing is supported. The initial integration exposed an issue where square brackets were required around the BMC IP address for IPV6 addressing. At the time it was simpler to add the brackets for IPV4 as well. redfishtool -S Always -T 30 -r [${BM_IP}] root However, the python3 version of redfishtool, introduced in Debian, rejects requests with square braces around IPV4 addresses. redfishtool -v -S Always -T 20 -r [${BM_IP}] root # Main: Error: rc=5 This update introduces a utility to the mtce msgClass module used to distinguish between IPV4 and IPV6 addresses. The redfish request create utility is updated to use this new utility when creating the redfishtool request without adding the square brackets in Debian for BMC's provisioned with IPV4 addressing. Update testing revealed that the Debian based python3 version of redfishtool takes a few seconds longer compared to the python2 in CentOS. This exposes a timer race condition during sensor monitoring. The BMC pthread is currently given 60 seconds to complete its requests. However, unlike sensor monitoring using ipmi which uses one request, redfish requires two requests. Unfortunately, requests using the Debian python3 version of redfishtool sometimes take longer than 30 seconds. If the cumulation of both requests take longer than the current max timeout of 60 seconds then that is treated as a pthread timeout error condition causing the hardware monitor to enter an error state for that host which requires it to go through a full reconnection algorithm. Given this additional issue this update also increases the BMC thread timeout from 60 to 100 seconds to avoid needless reconnections when using the mildly slower Debian python3 redfishtool. Test Plan: PASS: Verify Build Debian and CentOS iso images PASS: Verify Patch CentOS change PASS: Verify Install Debian image For both Debian and CentOS PASS: Verify redfish sensor monitoring over IPV4 PASS: Verify ipmi sensor monitoring over IPV4 PASS: Verify redfish sensor monitoring over IPV6 PASS: Verify ipmi sensor monitoring over IPV6 PASS: Verify no redfish connection failure/recovery errors; 3 hr soak Regression: PASS: Verify sensor model relearn by command and reprovision PASS: Verify system host-modify <id> bm_type change handling PASS: Verify redfish critical sensor assert/clear handling PASS: Verify ipmi critical sensor assert/clear handling PASS: Verify mtcAgent and hwmond logging Closes-Bug: 1991819 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com> Change-Id: I3b69cc4f19c580687cd91b4f2033f6019be87e5e	2022-10-06 22:21:38 +00:00
Girish Subramanya	86681b7598	Alarm Hostname controller function has in-service failure reported When compute services remain healthy: - listing alarms shall not refer to the below Obsoleted alarm - 200.012 alarm hostname controller function has an in-service failure This update deletes definition of the obsoleted alarm and any references 200.012 is removed in events.yaml file Also updated any reference to this alarm definition. Need to also raise a Bug to track the Doc change. Test Plan: Verify on a Standard configuration no alarms are listed for hostname controller in-service failure Code (removal) changes exercised with fix prior to ansible bootstrap and host-unlock and verify no unexpected alarms Regression: There is no need to test the alarm referred here as they are obsolete Closes-Bug: 1991531 Signed-off-by: Girish Subramanya <girish.subramanya@windriver.com> Change-Id: I255af68155c5392ea42244b931516f742fa838c3	2022-10-05 10:30:01 -04:00
Eric MacDonald	aaf9d08028	Mtce: Fix bmc password fetch error handling The mtcAgent process sometimes segfaults while trying to fetch the bmc password from a failing barbican process. With that issue fixed the mtcAgent sends the bmc access credentials to the hardware monitor (hwmond) process which then segfaults for a reason similar In cases where the process does not segfault but also does not get a bmc password, the mtcAgent will flood its log file. This update 1. Prevents the segfault case by properly managing acquired json-c object releases. There was one in the mtcAgent and another in the hardware monitor (hwmond). The json_object_put object release api should only be called against objects that were created with very specific apis. See new comments in the code. 2. Avoids log flooding error case by performing a password size check rather than assume the password is valid following the secret payload receive stage. 3. Simplifies the secret fsm and error and retry handling. 4. Deletes useless creation and release of a few unused json objects in the common jsonUtil and hwmonJson modules. Note: This update temporarily disables sensor and sensorgroup suppression support for the debian hardware monitor while a suppression type fix in sysinv is being investigated. Test Plan: PASS: Verify success path bmc password secret fetch PASS: Verify secret reference get error handling PASS: Verify secret password read error handling PASS: Verify 24 hr provision/deprov success path soak PASS: Verify 24 hr provision/deprov error path path soak PASS: Verify no memory leak over success and failure path soaking PASS: Verify failure handling stress soak ; reduced retry delay PASS: Verify blocking secret fetch success and error handling PASS: Verify non-blocking secret fetch success and error handling PASS: Verify secret fetch is set non-blocking PASS: Verify success and failure path logging PASS: Verify all of jsonUtil module manages object release properly PASS: Verify hardware monitor sensor model creation, monitoring, alarming and relearning. This test requires suppress disable in order to create sensor groups in debian. PASS: Verify both ipmi and redfish and switch between them with just bm_type change. PASS: Verify all above tests in CentOS PASS: Verify over 4000 provision/deprovision cycles across both failure and success path handling with no process failures Closes-Bug: 1975520 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com> Change-Id: Ibbfdaa1de662290f641d845d3261457904b218ff	2022-06-01 15:21:05 +00:00
Tracey Bogue	0551c665cb	Add Debian packaging for mtce packages Some of the code used TRUE instead of true which did not compile for Debian. These instances were changed to true. Some #define constants generated narrowing errors because their values are negative in a 32 bit integer. These values were explicitly casted to int in the case statements causing the errors. Story: 2009101 Task: 43426 Signed-off-by: Tracey Bogue <tracey.bogue@windriver.com> Change-Id: Iffc4305660779010969e0c506d4ef46e1ebc2c71	2021-10-29 09:17:00 -05:00
Eric MacDonald	48978d804d	Improved maintenance handling of spontaneous active controller reboot Performing a forced reboot of the active controller sometimes results in a second reboot of that controller. The cause of the second reboot was due to its reported uptime in the first mtcAlive message, following the reboot, as greater than 10 minutes. Maintenance has a long standing graceful recovery threshold of 10 minutes. Meaning that if a host looses heartbeat and enters Graceful Recovery, if the uptime value extracted from the first mtcAlive message following the recovery of that host exceeds 10 minutes, then maintenance interprets that the host did not reboot. If a host goes absent for longer than this threshold then for reasons not limited to security, maintenance declares the host as 'failed' and force re-enables it through a reboot. With the introduction of containers and addition of new features over the last few releases, boot times on some servers are approaching the 10 minute threshold and in this case exceeded the threshold. The primary fix in this update is to increase this long standing threshold to 15 minutes to account for evolution of the product. During the debug of this issue a few other related undesirable behaviors related to Graceful Recovery were observed with the following additional changes implemented. - Remove hbsAgent process restart in ha service management failover failure recovery handling. This change is in the ha git with a loose dependency placed on this update. Reason: https://review.opendev.org/c/starlingx/ha/+/788299 - Prevent the hbsAgent from sending heartbeat clear events to maintenance in response to a heartbeat stop command. Reason: Maintenance receiving these clear events while in Graceful Recovery causes it to pop out of graceful recovery only to re-enter as a retry and therefore needlessly consumes one (of a max of 5) retry count. - Prevent successful Graceful Recovery until all heartbeat monitored networks recover. Reason: If heartbeat of one network, say cluster recovers but another (management) does not then its possible the max Graceful Recovery Retries could be reached quite quickly, while one network recovered but the other may not have, causing maintenance to fail the host and force a full enable with reboot. - Extend the wait for the hbsClient ready event in the graceful recovery handler timout from 1 minute to worker config timeout. Reason: To give the worker config time to complete before force starting the recovery handler's heartbeat soak. - Add Graceful Recovery Wait state recovery over process restart. Reason: Avoid double reboot of Gracefully Recovering host over SM service bounce. - Add requirement for a valid out-of-band mtce flags value before declaring configuration error in the subfunction enable handler. Reason: rebooting the active controller can sometimes result in a falsely reported configation error due to the subfunction enable handler interpreting a zero value as a configuration error. - Add uptime to all Graceful Recovery 'Connectivity Recovered' logs. Reason: To assist log analysis and issue debug Test Plan: PASS: Verify handling active controller reboot cases: AIO DC, AIO DX, Standard, and Storage PASS: Verify Graceful Recovery Wait behavior cases: with and without timeout, with and without bmc cases: uptime > 15 mins and 10 < uptime < 15 mins PASS: Verify Graceful Recovery continuation over mtcAgent restart cases: peer controller, compute, MNFA 4 computes PASS: Verify AIO DX and DC active controller reboot to standby takeover that up for less than 15 minutes. Regression: PASS: Verify MNFA feature ; 4 computes in 8 node Storage system PASS: Verify cluster network only heartbeat loss handling cases: worker and standby controller in all systems. PASS: Verify Dead Office Recovery (DOR) cases: AIO DC, AIO DX, Standard, Storage PASS: Verify system installations cases: AIO SX/DC/DX and 8 node Storage system PASS: Verify heartbeat and graceful recovery of both 'standby controller' and worker nodes in AIO Plus. PASS: Verify logging and no coredumps over all of testing PASS: Verify no missing or stuck alarms over all of testing Change-Id: I3d16d8627b7e838faf931a3c2039a6babf2a79ef Closes-Bug: 1922584 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2021-04-30 15:35:53 +00:00
Eric MacDonald	7539d36c3f	Prevent mtcClient from sending to uninitialized socket in AIO SX The mtcClient will perform a socket reinit if it detects a socket failure. The mtcClient also avoids setting up its controller-1 cluster network socket for the AIO SX system type ; because there is no controller-1 provisioned. Most AIO SX systems have the management/cluster networks set to the 'loopback' interface. However, when an AIO SX system is setup with its management and cluster networks on physical interfaces, with or without vlan, the mtcAlive send message utility will try to send to the uninitialized controller-1 cluster socket. This leads to a socket error that triggers a socket reinitialization loop which causes log flooding. This update adds a check to the mtcAlive send utility to avoid sending mtcAlive to controller-1 for AIO SX system type where there is no controller-1 provisioned; no send,no error,no flood. Since this update needed to add a system type check, this update also implemented a system type definition rename from CPE to AIO. Other related definitions and comments were also changed to make the code base more understandable and maintainable Test Plan: PASS: Verify AIO SX with mgmnt/clstr on physical (failure mode) PASS: Verify AIO SX Install with mgmnt/clstr on 'lo' PASS: Verify AIO SX Lock msg and ack over mgmnt and clstr PASS: Verify AIO SX locked-disabled-online state PASS: Verify mtcClient clstr socket error detect/auto-recovery (fit) PASS: Verify mtcClient mgmnt socket error detect/auto-recovery (fit) Regression: PASS: Verify AIO SX Lock and Unlock (lazy reboot) PASS: Verify AIO DX and DC install with pv regression and sanity PASS: Verify Standard system install with pv regression and sanity Change-Id: I658d33a677febda6c0e3fcb1d7c18e5b76cb3762 Closes-Bug: 1897334 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2021-04-21 10:20:10 -04:00
Eric MacDonald	5c83453fdf	Fix Graceful Recovery handling while in Graceful Recovery handling The current Graceful Recovery handler is not properly handling back-to-back Multi Node Failure Avoidance (MNFA) events. There are two phases to MNFA phase 1: waiting for number of failed nodes to fall below mnfa_threahold as each affected node's heartbeat is recovered. phase 2: then a Graceful Recovery Wait period which is an 11 second heartbeat soak to verify that a stable heartbeat is regained before declaring the NMFA event complete. The Graceful Recovery Wait status of one or more affected nodes has been seen to be left uncleared (stuck) on one or more of the affected nodes if phase 2 of MNFA is interrupted by another MNFA event ; aka MNFA Nesting. Although this stuck status is not service affecting it does leave one or more nodes' host.task field, as observed under host-show, with "Graceful Recovery Wait" rather than empty. This update makes Multi Node Failure Avoidance (MNFA) handling changes to ensure that, upon MNFA exit, the recovery handler is properly restarted if MNFA Nesting occurs. Two additional Graceful Recovery phase issues were identified and fixed by this update. 1. Cut Graceful recovery handling in half - Found and removed a redundant 11 second heartbeat soak at the very end of the recovery handler. - This cuts the graceful recovery handling time down from 22 to 11 seconds thereby cutting potential for nesting in half. 2. Increased supported Graceful Recovery nesting from 3 to 5 - Found that some links bounce more than others so a nesting count of 3 can lead to an occasional single node failure. - This adds a bit more resiliency to MNFA handling of cases that exhibit more link messaging bounce. Test Plan: Verified 60+ MNFA occurrences across 4 different system types including AIO plus, Standard and Storage PASS: Verify Single Node Graceful Recovery Handling PASS: Verify Multi Node Graceful Recovery Handling PASS: Verify Single Node Graceful Recovery Nesting Handling PASS: Verify Multi Node Graceful Recovery Nesting Handling PASS: Verify MNFA of up to 5 nests can be gracefully recovered PASS: Verify MNFA of 6 nests lead to full enable of affected nodes PASS: Verify update as a patch PASS: Verify mtcAgent logging Regression: PASS: Verify standard system install PASS: Verify product verification maintenance regression (4 runs) PASS: Verify MNFA threshold increase and below threshold behavior PASS: Verify MNFA with reduced timeout behavior for ... nested case that does not timeout ... case that does not timeout ... case that does timeout Closes Bug: 1892877 Change-Id: I6b7d4478b5cae9521583af78e1370dadacd9536e Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2021-03-17 14:25:19 -04:00
Eric MacDonald	4f5bf78f55	Improve mtcAgent interrupted thread cleanup A BMC command send will be rejected if its thread is not in the IDLE state going into the call. This issue is seen to occur over a reprovisioning action while the bmc access alarmable condition exists. Maintenance will do retries. So the only visible side affect of this issue is a failure to provision to 'redfish' over a provisioning switch to 'dynamic' (learn mode). Instead ipmi is selected. The non-return to idle can occur when the bmc handler FSM is interrupted by a reprovisioning request while a bmc command is in flight. This update enhances the thread management module by introducing a thread consumption utility that is called by the bmc command send utility. If the send finds that its thread is not in the IDLE state it will either kill the thread if it is running or free a completed but-not- consumed thread result. Note: Maintenance only supports the execution of a single thread per host per process at one time. Test Plan: PASS: Verify BMC provisioning change from ipmi to dynamic while the ipmi provisioning was failing prior to re-provisioning. Verify the previous error is cleaned up and the reprovisioning request succeeds as expected. PASS: Verify thread 'execution timeout kill' cleanup handling. PASS: Verify thread 'complete but not consumed' cleanup handling. PASS: Verify logging during regression soaks Regression: PASS: Verify bmc protocol reprovisioning script soak PASS: Verify sensor monitoring following BMC reprovisioning PASS: Verify product verification mtce regression test suite Change-Id: Ie5e9e89ed2f8db6888c0fc7de03d494c75517178 Closes-Bug: 1864906 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2021-03-15 10:51:16 -04:00
Eric MacDonald	9ab726b0eb	Add support for peer controller reset via mtcClient This update adds the ability for SM to passively request the mtcClient to BMC reset its peer controller as a means to recover a severely loaded active controller. To do this the mtcAgent is modified keep the controllers' mtcClients updated with the BMC info of its peer. The mtcClient is modified to audit for the SM signal and then when asserted issue a BMC reset of its peer controller using ipmitool system call. The ability to command the peer mtcCient to 'sync' prior to the BMC reset is implemented but configured disabled for now. Change-Id: Ibe4c8aaa3a980cbe5f34c3e22f015698a6453c1a Partial-Bug: #1895350 Co-Authored-By: Bin.Qian@windriver.com Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2021-01-14 16:44:14 -05:00
Eric MacDonald	8c81914ea5	Add SM process heartbeat and status to the hbs cluster This update is the mtc hbsAgent side of a new SM -> hbsAgent heartbeat algorithm for the purpose of detecting peer SM process stalls. This update adds an 'SM Heartbeat status' bit to the cluster view it injects into its multicast heartbeat requests. Its peer is able to read this on-going hbsAgent/SM heartbeat status through the cluster. The status bit reads 'ok' while the hbsAgent sees the SM heartbeat as steady. The status bit reads 'not ok' while the SM heartbeat is lost for longer than 800 msecs. Change-Id: I0f2079b0fafd7bce0b97ee26d29899659d66f81d Partial-Fix: #1895350 Co-Authored-By: Bin.Qian@windriver.com Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2020-12-10 11:13:13 -05:00
Eric MacDonald	1350502720	Make Mtce Power-Off FSM verify power-off If a host's BMC server accepts a power-off command without error but does not actually power-off the host, the power-off FSM reports success yet the host power is still on. This update adds a verification component to the power-off FSM. Once the power-off command is issued and succeeds at the command level, the power-off FSM will now query power status and retry the power-off command until the server is verified to be powered-off or the retry max (10) is reached and the power-off command is failed. Test Plan: PASS: Verify 200+ Mtce Power Off/On cycles (ipmi & redfish) PASS: Verify 100+ Mtce Reinstalls with FIT (ipmi & redfish) Change-Id: Iddd120d89d1152fc0b26915df123f586c38b909b Closes-Bug: 1865087 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2020-11-22 13:38:33 +00:00
Eric MacDonald	1196056612	Disable Redfish BMC audit and improve reinstall failure handling The Mtce Reinstall Handler can collide with the BMC Redfish audit resulting in reinstall failure. BMC handler's 2 minute connection audit can colliding with other BMC commands. The reinstall handler, with 4 bmc command operations is particularly suseptable. Two additional bmc communication improvements are implemented: 1. Add 'retry' handling to all BMC requests in the Maintenance Reinstall Handler FSM to handle transient command failures. Note: There are already retries to all but the power status query and the netboot requests in that handler and retries in other administrative commands that involve bmc requests. 2. Switch BMC power control command management from 'static' to 'learned' lists. Some BMCs don't support both graceful and immediate power commands; Graceful Restart and Force Restart. To remove the possibility of using an unsupported BMC command, this update switches from static to learned power command lists with log produced if a server is missing command support. Power commands escalate from graceful to immediate in the presence of retries. Test Cases: PASS: Verify bmc handler redfish audit is disabled PASS: Verify reinstall soak using redfish PASS: Verify reinstall netboot and power status retry handling PASS: Verify all power control commands using redfish PASS: Verify graceful operations are used if available PASS: Verify immediate operations are used for retries Regression: PASS: Verify bmc ping audit success and failure handling PASS: Verify Reset Handling soak (redfish and ipmi) PASS: Verify Power-Off/On Handling soak (redfish and ipmi) PASS: Verify Reinstall Handling soak (redfish and ipmi) PASS: Verify Standard System Install (redfish and ipmi) PASS: Verify AIO DX System Install (redfish and ipmi) PASS: Verify this update as a patch Change-Id: Idb484512ccb1b16e2d0ea9aff4ab7965347b1322 Closes-Bug: 1880578 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2020-11-16 15:15:22 +00:00
Eric MacDonald	2fc05673d1	Add SysRq crash dump support for pmon quorum health messaging loss The hostwd process supports failure handling for two pmon quorum failure modes. 1. persistent pmon quorum process failure 2. persistent absence of pmon's quorum health report This update adds a new configuration option and associated implementation required to force a crash dump action for failure mode 2 above. This means that if the Process Monitor itself gets stalled or stops running for 3 (default config) minutes then the hostwd will trigger a SysRq to force a crash dump. Test Plan: PASS: Verify kdump for pmon quorum health report message loss PASS: Verify no kdump when kdump_on_stall is disabled PASS: Verify handling when kdump service is not active PASS: Verify sighup config change detection and handling Regression: PASS: Verify softdog timeout handling and logs PASS: Verify quorum threshold config change and handling PASS: Verify handling with reboot/reset recovery methods disabled PASS: Verify enable reboot_on_err config change handling PASS: Verify reboot/reset actions are ignored while host is locked PASS: Verify pmon failure recovery handling before threshold reached Change-Id: Id926447574e02013f83c0170784e2a8f9a46bac1 Partial-Bug: 1894889 Depends-On: https://review.opendev.org/#/c/750806 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2020-11-13 12:38:16 -05:00
Eric MacDonald	3a6fec50c1	Reduce Maintenance Host Watchdog timeout for controllers This update makes changes to the maintenance host watchdog and reduces the timeout from 5 to 3 minutes for controllers. This update also decouples the pmon quorum monitoring feature handling from the host watchdog timeout. Both were driven off the same select timer which prevented watchdog timeout value to be independently changed without affecting quorum monitoring. A new config label 'kernwd_update_period_stall_detect' is added and value loaded for hosts that need more rigid process stall detection. This new lower timeout value label is loaded and applied to hosts that run the system controller function. A few logging improvements were made. Test Plan: PASS: Verify pmon quorum failure handling while unlocked. Was and remains at 3 misses, 60 seconds each. PASS: Verify watchdog TO at 12 seconds on controllers. Was 300 secs. PASS: Verify kernel watchdog is not enabled when loaded kernwd_update_period is less than 5 seconds. Was 60 secs. PASS: Verify process logging ; startup, failure, transient PASS: Verify all config values loaded by hostwd process Regression: PASS: Verify watchdog TO at 300 seconds on non-controllers PASS: Verify handling of failed quorum process while locked PASS: Verify handling of failed quorum process while unlocked PASS: Verify handling of transient quorum messaging loss while unlocked PASS: Verify hostwd process patching ; locked and unlocked cases PASS: Verify AIO DX System Install PASS: Verify Standard System Install Note: There is no kernel WD TO log. The log is output to the console. Change-Id: Iad726436e28dfa48a06743aa166318969eb6915d Closes-Bug: #1894889 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2020-11-13 07:52:59 -05:00
Eric MacDonald	126cdfa369	Make daemon_get_file_str return first line in specified file The current implementation will return only the first group of characters up to the first space from the first line of the specified file. This function was intended to return the entire first line. Change-Id: Ic34361c32aeff564f4645070279cdb53d5b87626 Closes-Bug: 1896669 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2020-09-22 18:19:24 -04:00
Eric MacDonald	55d5f43edb	Fix heartbeat messaging when interface is set to 'lo' Maintenance heartbeat service should not be multicast messaging over an 'lo' interface which in IPv6 leads to socket failures, log flooding and the inability to detect and report pmond process failure. To fix that this update - configures pulse messaging to unicast for monitored networks configured as 'lo'. - prevents heartbeating over the cluster network if both it and the management network are both configured on the 'lo' interface. - improves logging to avoid flooding in the presence of socket setup or access errors. - stops logging netlink events (interface state changes) on unmonitored network interfaces. - maintains heartbeat disabled state until the management network is up. - modifies hbsAgent socket failure handling and its pmon conf file so that a persistent socket failure during startup is alarmed as an hbsAgent process failure. Test Plan: PASS: Verify logging over system install and socket errors PASS: Verify unicast messaging when cluster is set to 'lo' PASS: Verify no cluster network heartbeat when it and mgmnt are set to 'lo'. Regression: PASS: Verify heartbeat messaging and cluster info PASS: Verify pmond process failure alarm management PASS: Verify heartbeat failure detection and graceful recovery PASS: Verify AIO SX IPv6 system install and run PASS: Verify AIO DX IPv6 system install and run PASS: Verify Standard IPv6 system install and run PASS: Verify Storage system IPv6 install and run PASS: Verify Storage system IPv4 install and run PASS: Verify MNFA handling in IPv6 storage system Change-Id: I5a2a0b2dee0c690617c4e0b0e2ab8b1172b2dc49 Closes-Bug: 1884585 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2020-06-26 14:16:41 +00:00
Eric MacDonald	e379fdfe18	Prevent pmond process recovery when system is not running The maintenance process monitor (pmon) should only recover failed processes when the system state is 'running' or 'degraded'. The current implementation allowed process recovery for other non-inservice states, including an unknown state if systemd returns no data on the state query. This update tighten's up the system state check by adding retries to the state query utility and restricting accepted states to 'running' and 'degraded'. This change then prevents pmon from inadvertently killing and recovering the mtcClient which indirectly kills off the mtcClient's fail-safe sysreq reboot child thread if pmon state query returns anything other than running or degraded during a shut down. Change-Id: I605ae8be06f8f8351a51afce98a4f8bae54a40fd Closes-Bug: 1883519 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2020-06-15 11:09:47 -04:00
Eric MacDonald	7d8be4bc1f	Add auto-versioning to starlingx/metal mtce packages This update makes use of the PKG_GITREVCOUNT variable to auto-version the mtce packages in this repo. Change-Id: Ifb4da4570e0261bbdcf0d7af79b8add7cfc133ac Story: 2006166 Task: 39822 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2020-05-21 15:18:43 -04:00
Zuul	897eb75270	Merge "Fix mtce-common build error with gcc-8.2.1"	2020-04-28 12:36:08 +00:00
Dongqi Chen	7423edce9b	Fix mtce-common build error with gcc-8.2.1 gcc-8.2.1 reports "Werror=format-truncation" error due to there is possibility the string be truncated, add return value check could avoid the error. Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Signed-off-by: Dongqi Chen <chen.dq@neusoft.com> Change-Id: I8fa08077e47ee3777a50f018af77b3e8fc6191f9 Story: 2007506 Task: 39278	2020-04-03 14:49:09 +08:00
Eric MacDonald	0826882308	Add mtcAgent socket initialization failure retry handling. The main maintenance process (mtcAgent) exits on a process start-up socket initialization failure. SM restarts the failed process within seconds and will swact if the second restart also fails. From startup to swact can be as quick as 4 seconds. This is too short to handle a collision with a manifest. This update adds a number of socket initialization retries to extend the time the process has to resolve socket initialization failures by giving the collided manifest time to complete between retries. The number of retries and inter retry wait time is calibrated to ensure that a persistently failing mtcAgent process exits in under 40 seconds. This is to ensure that SM is able to detect and swact away from a persistently failing maintenance process while also giving the process a few tries to resolve on its own. Test Plan: PASS: Verify socket init failure thresholded retry handling with no, persistent and recovered failure conditions. PASS: Verify swact if socket init failure is persistent PASS: Verify no swact if socket failure recovers after first exit PASS: Verify no swact if socket failure recovers over init retry PASS: Verify an hour long soak of continuous socket open/close retry Change-Id: I3cb085145308f0e920324e22111f40bdeb12b444 Closes-Bug: 1869192 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2020-04-01 19:24:22 +00:00
Eric MacDonald	da7b2e94f1	Modify Mtce Reinstall FSM to first power-off BMC provisioned hosts This update only applies to servers that support and are provisioned for Board Management Control (BMC). The BMC of some servers silently reject the 'set next boot device', a command while it is executing BIOS. The current reinstall algorithm when the BMC is provisioned starts by detecting the power state of the target server. If the power is off it will 'first power it on' and then proceed to 'set next boot device' to pxe followed by a reset. For the initial power off state case, the timing of these operations is such that the server is in BIOS when the 'set next boot device' command is issued. This update modifies the host reinstall algorithm to first power-off a server followed by setting the next boot device while the server is confirmed to be powered off, then powered on. This ensures the server gets and handles the set next boot device command operation properly. This update also fixes a race condition between the bmc_handler and power_handler by moving the final power state update in the power handler to the power done phase. Test Plan: Verify all new reinstall failure path handling via fault insertion testing Verify reinstall of powered off host Verify reinstall of powered on host Verify reinstall of Wildcat server with ipmi Verify reinstall of Supermicro server with ipmi and redfish Verify reinstall of Ironpass server with ipmi Verify reinstall of WolfPass server with redfish and ipmi Verify reinstall of Dell server with ipmi Over 30 reinstalls were performed across all server types, with initial power on and off using both ipmi and redfish (where supported). Change-Id: Iefb17e9aa76c45f2ceadf83f23b1231ae82f000f Closes-Bug: 1862065 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2020-02-12 15:44:26 +00:00
Eric MacDonald	9bf231a286	Fix BMC access loss handling Recent refactoring of the BMC handler FSM introduced a code change that prevents the BMC Access alarm from being raised after initial BMC accessibility was established and is then lost. This update ensures BMC access alarm management is working properly. This update also implements ping failure debounce so that a single ping failure does not trigger full reconnection handling. Instead that now requires 3 ping failures in a row. This has the effect of adding a minute to ping failure action handling before the usual 2 minute BMC access failure alarm is raised. ping failure logging is reduced/improved. Test Plan: for both hwmond and mtcAgent PASS: Verify BMC access alarm due to bad provisioning (un, pw, ip, type) PASS: Verify BMC ping failure debounce handling, recovery and logging PASS: Verify BMC ping persistent failure handling PASS: Verify BMC ping periodic miss handling PASS: Verify BMC ping and access failure recovery timing PASS: Verify BMC ping failure and recovery handling over BMC link pull/plug PASS: Verify BMC sensor monitoring stops/resumes over ping failure/recovery Regression: PASS: Verify IPv6 System Install using provisioned BMCs (wp8-12) PASS: Verify BMC power-off request handling with BMC ping failing & recovering PASS: Verify BMC power-on request handling with BMC ping failing & recovering PASS: Verify BMC reset request handling with BMC ping failing & recovering PASS: Verify BMC sensor group read failure handling & recovery PASS: Verify sensor monitoring after ping failure handling & recovery Change-Id: I74870816930ef6cdb11f987424ffed300ff8affe Closes-Bug: 1858110 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2020-01-03 09:34:37 -05:00
Eric MacDonald	c4b8171ddd	Refactor BMC provisioning in Maintenance The current mechanism used to preserve the learned bmc protocol in the filesystem on the active controller is problematic over swact. This update removes the file storage method in favor of preserving the learned protocol in the system inventory database as a key/value pair at the host level in already existing mtce_info database field. The specified or learned bmc access protocol is then shared with the hardware monitor through inter-daemon maintenance messaging. This update refactors bmc provisioning to accommodate bmc protocol selection at the host rather than system level. Towards that this update removes system level bmc_access_method selection in favor of host level selection through bm_type. A bm_type of 'bmc' specifies that the bmc access protocol for that host be learned. This has the effect of making it the same as what is delivered today but without support for changing it as the system level. A system inventory update will be delivered shortly that enables bmc access protocol selection at the host level. That update allows the customer to specify the bmc access protocol at the host level to be either dynamic (aka learned) or to only use 'redfish' or 'ipmi'. That system inventory update delivers that information to maintenance through bm_type via bmc provisioning. Until that update is delivered bm_type always comes in as 'bmc' which get interpreted as 'dynamic' to maintain existing configuration. The following additional issues were also fixed in this update. 1. The nodeTimers module defaults the 'ring' member of timers that are not running to false but should be true. 2. Added a pingUtil_restart function to facilitate quicker sensor monitoring following provisioning changes and bmc access failures. 3. Enhanced the hardware monitor sensor grouping filter to accommodate non-standard Redfish readout labelling so that more sensors fall into the existing canned groups ; leads to more monitored sensors. 4. Added a 'http security mode' to hardware monitor messaging. This defaults to https as that is all that is supported by the Redfish implementation today. This field can be used to specify non-secure 'http' mode in the future when that gets implemented. 5. Ensure the hardware monitor performs a bmc password re-fetch on every provisioning change. Test Plan: PASS: Verify bmc access protocol store/fetched from the database (mtce_info) PASS: Verify inventory push from mtcAgent to hwmond over mtcAgent restart PASS: Verify inventory push from mtcAgent to hwmond over hwmon restart PASS: Verify bmc provisioning of ipmi and redfish servers PASS: Verify learned bmc protocol persists over process restart and swact PASS: Verify process startup with protocol already learned Hardware Monitor: PASS: Verify bmc_type=ipmi handling ; protocol forced to ipmi ; (re)prov PASS: Verify bmc_type=redfish handling ; protocol forced to redfish ; (re)prov PASS: Verify bmc_type=dynamic handling ; protocol is learned then persisted PASS: Verify sensor model delete and relearn over ip address change PASS: Verify sensor model delete and relearn over bm_type change change PASS: Verify sensor model not relearned username change PASS: Verify bm pw is re-fetched over any (re)provisioning change PASS: Verify bmc re-provisioning soak (test-bmc-reprovisioning.sh 50 loops) PASS: Verify protocol change handling, file cleanup, model recreation PASS: Verify End-2-End behavior for bm_type change from redfish to ipmi PASS: Verify End-2-End behavior for bm_type change from ipmi to redfish PASS: Verify End-2-End behavior for bm_type change from redfish to dynamic PASS: Verify End-2-End behavior for bm_type change from ipmi to dynamic PASS: Verify End-2-End behavior for bm_type change from dynamic to ipmi PASS: Verify End-2-End behavior for bm_type change from dynamic to redfish PASS: Verify sensor model creation waits for server power to be on PASS: Verify sensor relearn by provisioning change during model creation. (soak) Regression: PASS: Verify host power off and on. PASS: Verify BMC access alarm handling (assert and clear) PASS: Verify mtcAgent and hwmond logs add value PASS: Verify no core dumps / seg faults. PASS: Verify no mtcAgent and hwmond memory leak. PASS: Verify delete of BMC provisioned host PASS: Verify sensor monitoring, alarming, degrade and then clear cycle PASS: Verify static analysis report of changed modules. PASS: Verify host level bm_type=bmc functions as would dynamic selection PASS: Verify batch provisioning and deprovisioning (7 nodes) PASS: Verify batch provisioning to different protocol (5 nodes) PASS: Verify handling of flaky Redfish responses PEND: Verify System Install Change-Id: Ic224a9c33e0283a611725b33c90009132cab3382 Closes-Bug: #1853471 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2019-12-09 09:39:49 -05:00
Eric MacDonald	66e8fbd747	Add urlencoding to ip address for redfish requests This change applies to both IPv4 and IPv6 because the specification permits it. Test Plan: PASS: Verify for both IPv4 and IPv6 addressing PASS: Verify patched change for IPv4 and IPv6 cases. Change-Id: I99dcb31c51dd287eed8eb3a038a1814763a4c600 Closes-Bug: #1852481 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2019-11-15 12:15:49 -05:00
Hang Li	f48eae8f35	fix spelling error Fixing spelling mistakes in notes helps us understand. Change-Id: Ic9050bd5f0141153f74d357f7405032d6aa1e1f1 Closes-Bug: #1852689	2019-11-15 14:11:52 +08:00
Zuul	7661fe5680	Merge "Removing unused flag disable_worker_services"	2019-11-04 13:52:12 +00:00
Eric MacDonald	15c036f321	Separate hardware monitor power and thermal senser data The redfish thermal sensor data output clobbers the power sensor data. This update directs the thermal and power sensor readouts into two separate files so they are preserved for off box analysis and continued support for sensor_data product verification testing. Removed unused procedure that did not support two sensor data output files. Test Plan: PASS: Verify system install PASS: Verify power and sensor monitoring. PASS: Verify power fault insertion testing PASS: Verify thermal fault insertion testing Change-Id: Ie7717728944e93dd6fcc38a2c971189764276929 Story: 2005861 Task: 37203 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2019-10-17 20:53:14 -04:00
Zuul	069daf1e22	Merge "Add mtcAgent support for sm_node_unhealthy condition"	2019-10-16 19:15:03 +00:00
Zuul	6e024a648f	Merge "Modify the strlen judgement to avoid memory leak."	2019-10-15 21:21:22 +00:00
Eric MacDonald	675f49d556	Add mtcAgent support for sm_node_unhealthy condition When heartbeat over both networks fail, mtcAgent provides a 5 second grace period for heartbeat to recover before failing the node. However, when heartbeat fails over only one of the networks (management or cluster) the mtcAgent does not honour that 5 second grace period ; a bug. When it comes to peer controller heartbeat failure handling, SM needs that 5 second grace period to handle swact before mtcAgent declares the peer controller as failed, resets the node and updates the database. This update implements a change that forces a 2 second wait time between each fast enable and fixes the fast enable threshold count to be the intended 3 retries. This ensures that at least 5 seconds, actually 6 in the case of single network heartbeat loss, passes before declaring the node as failed. In addition to that, a special condition is added to detect and stop work if the active controller is sm_node_unhealthy. We don't want mtcAgent to make any database updates while in this failure mode. This gives SM the time to handle the failure according to the system's controllers' high availability handling feature. Test Plan: PASS: Verify mtcAgent behavior on set and clear of SM node unhealthy state. PASS: Verify SM has at least 5 seconds to shut down mtcAgent when heartbeat to peer controller fails for one or both networks. PASS: Test real case scenario with link pull. PASS: Verify logging in presence of real failure condition. Change-Id: I8f8d6688040fe899aff6fc40aadda37894c2d5e9 Closes-Bug: 1847657 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2019-10-15 15:24:34 -04:00
Zuul	f2cba8f89b	Merge "Maintenance Redfish support useability enhancements."	2019-10-10 18:38:21 +00:00
Zuul	ed22f11172	Merge "Add alarm retry support to maintenance alarm handling daemon"	2019-10-07 15:24:14 +00:00
Eric MacDonald	f2fedc0446	Add alarm retry support to maintenance alarm handling daemon The maintenance alarm handling daemon (mtcalarmd) should not drop alarm requests simply because FM process is not running. Insteads it should retry for it and other FM error cases that will likely succeed in time if they are retried. Some error cases however do need to be dropped such as those that are unlikely to succeed with retries. Reviewed FM return codes with FM designer which lead to a list of errors that should drop and others that should retry. This update implements that handling with a posting and servicing of a first-in / first-out alarm queue. Typical retry case is the NOCONNECT error code which occurs when FM is not running. Alarm ordering and first try timestamp is maintained. Retries and logs are throttled to avoid flooding. Test Plan: PASS: Verify success path alarm handling End-to-End. PASS: Verify retry handling while FM is not running. PASS: Verify handling of all FM error codes (fit tool). PASS: Verify alarm handling under stress (inject-alarm script) soak. PASS: verify no memory leak over stress soak. PASS: Verify logging (success, retry, failure) PASS: Verify alarm posted date is maintained over retry success. Change-Id: Icd1e75583ef660b767e0788dd4af7f184bdb9e86 Closes-Bug: 1841653 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2019-10-07 09:07:49 -04:00
Eric MacDonald	4c541f50d4	Maintenance Redfish support useability enhancements. This update is a result of changes made during a suite of end-to-end provisioning, reprovisioning and deprovisioning customer exterience testing of the maintenance RedFish support feature. 1. Force reconnection and password fetch on provisioning changes 2. Force reconnection and password fetch on persistent connection failures 3. Fix redfish protocol learning (string compare) in hardware monitor 4. Improve logging for some typical error paths. Test Plan: PASS: Verify handling of reprovisioning BMC between hosts that support different protocols. PASS: Verify handling of reprovisioning ip address to host that leads to a different protocol select. PASS: Verify manual relearn handling to recover from errors that result from the above case. PASS: Verify host BMC deprovisioning handling and cleanup. PASS: Verify sensor monitoring. PASS: Verify hwmond sticks with a selected protocol once a sensor model has been created using that protocol. PASS: Verify handling of BMC reprovision - ip address change only PASS: Verify handling of BMC reprovision - username change only FAIL: Verify handling of BMC reprovision - password change only https://bugs.launchpad.net/starlingx/+bug/1846418 Change-Id: I4bf52a5dc3c97d7794ff623c881dff7886234e79 Closes-Bug: #1846212 Story: 2005861 Task: 36606 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2019-10-03 11:57:58 -04:00

1 2 3 4

160 Commits