metal

Commit Graph

Author	SHA1	Message	Date
Enzo Candotti	23143abbca	Update crashDumpMgr to source config from envfile This commit updates the crashDumpMgr service in order to: - Cleanup of current service naming and packaging to follow the standard Linux naming convention: - Repackage /etc/init.d/crashDumpMgr to /usr/sbin/crash-dump-manager - Rename crashDumpMgr.service to crash-dump-manager.service - Add EnvironmentFile to crash-dump-manager service file to source configuration from /etc/default/crash-dump-manager. - Update ExecStart of crash-dump-manager service to use parameters from EnvironmentFile - Update crash-dump-manager service dependencies to run after config.service. - Update logrotate configuration to support the retention polices of the maximum files. The “rotate 1” option was removed to permit crash-dump-manager to manage pruning old files. - Modify the crash-dump-manager script to enable updates to the max_files parameter to a lower value. If there are currently more files than the new max_files value, the oldest files will be deleted the next time a crash dump file needs to be stored, thus adhering to the new max_files values. Test Plan: PASS: Build ISO and perform a fresh install. Verify the new crash-dump-manager service is enabled and working as expected. PASS: Add and apply new crashdump service parameters and force a kernel panic. Verify that after the reboot, the max_files, max_used, min_available and max_size values are updated accordingly to the service parameters values. PASS: Verify that the crashdump files are rotated as expected. Story: 2010893 Task: 48910 Change-Id: I4a81fcc6ba456a0d73067b77588ee4a125e44e62 Signed-off-by: Enzo Candotti <enzo.candotti@windriver.com>	2023-10-06 23:06:54 +00:00
Zuul	829c7b1db6	Merge "Add /var/crash dump management to maintenance."	2020-10-18 04:34:31 +00:00
Eric MacDonald	85f605a762	Add /var/crash dump management to maintenance. The Linux kernel can be configured to perform a crash dump and reboot in response to specific, typically serious, events. A crash dump event produces a crash dump report bundle (directory) of files that represent the state of the kernel at the time of the event ; usefull for post-event root cause analysis. The kernel directs new crash dump bundles to /var/crash/<dated vmcore bundle>. Crash dump bundles are quite large and, if too many occur, can fill up its target filesystem. This update adds crash dump bundle management to the maintenance with a new crashDumpMgr service script and installs a crash dump logrotation configuration file to compress/preserve the first crash bundle and compress/rotate all subsequent bundles. With repeated crash dumps and the help of backgroud logrotation this update produces the following compressed crash dump bundles controller-1:~$ ls -lrth /var/log/crash total 238M -rw-r--r-- 1 root 77M <date> vmcore_first.tar.1.gz -rw-r--r-- 1 root 75M <date> vmcore.tar.1.gz Change-Id: I2741e610c6c417d7fc14dfada283a1edacd9327f Partial-Fix: 1898602 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2020-10-17 13:11:46 +00:00
Don Penney	b3cafc802c	Setup mtce logfile config Move mtce logfile configuration from central syslog-ng config to the mtce package. Change-Id: Ia9da3ce48cd73c275b3a3f6ecfaeabf0fff8c24b Story: 2008251 Task: 41100 Depends-On: https://review.opendev.org/757947 Signed-off-by: Don Penney <don.penney@windriver.com>	2020-10-16 10:43:01 -04:00
Eric MacDonald	0ec31b805a	Add maintenance BMC info collect script This update adds a collect_bmc script to collect. This new script collects the output from the following data into the var/extra/bmc.info file of the collect tarball. - host date - bmc sel date - bmc sel config - bmc sel logs - bmc sensor list Change-Id: I925bc19aba0eb72888f2fcdc31922ff763d88cd1 Closes-Bug: 1898788 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2020-10-15 15:41:51 -04:00
Eric MacDonald	a40175ec84	Restrict access privilege of mtce config files and daemons This update modifies the maintenance daemons and config files to restrict access privilege to root privilege Storage system was installed successfully and permission of affected files verified. Change-Id: I9c8e2e36f897c31d54ea5ade884a004c12251493 Closes-Bug: 1887403 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2020-07-14 19:49:57 -04:00
Sharath Kumar K	b725a0974b	De-branding in starlingx/metal: Titanium Cloud -> StarlingX 1. Rename Titanium Cloud to StarlingX for .spec files 2. Rename Titanium Cloud to StarlingX for .service file Test: After the de-brand change, bootimage.iso has built in the flock layer and installed on the dev machine to validate the changes. Please note, doing de-brand changes in batches, this is batch1 changes. Story: 2006387 Task: 36207 Change-Id: Ifa4dc5c7aa3189815e00b796fc833852e88c8fe3 Signed-off-by: Sharath Kumar K <sharath.kumar@intel.com>	2020-04-03 07:58:25 +02:00
Eric MacDonald	804ec52227	Add redfish support detection to maintenance This update 1. Refactors some of the common maintenance ipmi definitions and utilities into a more generic 'bmcUtil' module to reduce code duplication and improve improve code reuse with the introduction of a second bmc communication protocol ; redfish. 2. Creates a new 'redFishUtil' module similar to the existing 'ipmiUtil' module but in support of common redfish utilities and definitions that can be used by both maintenance and the hardware monitor. 3. Moves the existing 'mtcIpmiUtil' module to a more common 'mtcBmcUtil' and renames the 'ipmi_command_send/recv' to the more generic 'bmc_command_send/recv' which are enhanced to support both ipmi and redfish bmc communication methods. 4. Renames the bmc info collection and connection monitor ; 'bm_handler' to 'bmc_handler' and adds support necessary to learn if a host's bmc supports redfish. 5. Renames the existing 'mtcThread_ipmitool' to a more common 'mtcThread_bmc' and redfishtool support for the now common set of bmc thread commands and the addition of the new redfishtool bmc query, aka 'redfish root query', used to detect if a host's bmc supports redfish. Note: This aspect is the primary feature of this update. Namely the ability to detect and print a log indicating if a host's bmc supports redfish. Test Plan: PASS: Verify sensor monitoring and alarming still works. PASS: Verify power-off command handling. PASS: Verify power-on command handling. PASS: Verify reset command handling. PASS: Verify reinstall (netboot) command handling. PASS: Verify logging when redfish is not supported. PASS: Verify logging when redfish is supported. PASS: Verify ipmitool is used regardless of redfish support. PASS: Verify mtce thread error handling for both protocols. Change-Id: I72e63958f61d10f5c0d4a93a49a7f39bdd53a76f Story: 2005861 Task: 35825 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2019-08-19 14:03:37 +00:00
Zuul	fe77c236e3	Merge "Refactor infrastructure network in mtce code"	2019-04-23 21:12:41 +00:00
Teresa Ho	8e51a1660a	Refactor infrastructure network in mtce code Updated to read the host cluster-host parameter in /etc/hosts file. Replaced references of infra network with cluster-host network Story: 2004273 Task: 29473 Change-Id: I199fb82e5f6b459b181196d0802f1a74220b796e Signed-off-by: Teresa Ho <teresa.ho@windriver.com>	2019-04-18 09:32:41 -04:00
Kristine Bujold	bee31d98c8	Remove wrs-guest-heartbeat SDK Module With the StarlingX move to supporting pure upstream OpenStack, the majority of the SDK Modules are related to functionality no longer supported. The remaining SDK Modules will be moved to StarlingX documentation. Story: 2005275 Task: 30565 Change-Id: Ifc560a6865d045ab3bd93923811aeb5f8ac7f030 Signed-off-by: Kristine Bujold <kristine.bujold@windriver.com>	2019-04-17 13:38:18 -04:00
Eric MacDonald	f10b9a5170	Add mtce dependency on ipmitool ipmitool was recently found to be missing from the load after a rpm cleanup that seemed to remove all dependency on it. Maintenance and its Hardware Monitor use the ipmitool for power / reset control as well as sensor monitoring. This update adds a dependency on ipmitool in the maintenance mtcAgent and hwmon rpm build recipe so that it will always be included in the load with maintenance. Closes-Bug:1821958 Test Plan: PASS: Verify ipmitool in load PASS: Verify mtce and hwmon rpm dependency on ipmitool PASS: Verify system install Change-Id: I958a2365f6df7bdbf942bc57c1aa17ee2ae6a73d Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2019-03-28 15:36:12 -04:00
Eric MacDonald	f55ef546a7	Remove Resource Monitor ; aka rmon, from the load All rmon resource monitoring has been moved to collectd. This update removes rmon from mtce and the load. Story: 2002823 Task: 30045 Test Plan: PASS: Build and install a standard system. PASS: Inspect mtce rpm list PASS: Inspect logs PASS: Check pmon.d Change-Id: I7cf1fa071eac89274e7fae1f307e14d548cc945b Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2019-03-19 16:12:38 -04:00
Eric MacDonald	7941ee5bbb	Add new Link Monitor (lmond) daemon to Mtce This update introduces a new Link Monitor daemon to the Mtce flock of daemons and disable rmon's interface monitoring. This new daemon parses the platform.conf file and using the interface names assigned to each monitored network (mgmt, infra and oam) queries the kernel for their physical, bonded and vlan interface names and then registers to listen for netlink events. All link/interface state change (netlink) events that correspond to any of the interfaces or links assiciated with the monitored networks are tracked by this new daemon. This new daemon then also implements an http listener for localhost initiated GET requests targeted to /mtce/lmond on port 2122 and responds with a json link_info string that contains a summary of monitored networks, links and their current Up/Down status. lmond behavioral summary: 1. learn interface/port model, 2. load initial link status for learned links, 3. listen for link status change events 4. provide link status info to http GET Query requests. Another update to stx-integ implements the collectd interface plugin that periodically issues the Link Status GET requests for the purponse of alarming port and interface Down conditions, clearing alarms on Up state changes, and storing sample data that represents the percentage of active links for each monitored network. Test Plan: PASS: Verify lmond process startup PASS: Verify lmond logging and log rotation PASS: Verify lmond process monitoring by pmon PASS: Verify lmond interface learning on process startup PASS: Verify lmond port learning on process startup PASS: Verify lmond handling of vlan and bond interface types PASS: Verify lmond http link info GET Query handling PASS: Verify lmond has no memory leak during normal and eventfull operation Change-Id: I58915644e60f31e3a12c3b451399c4f76ec2ea37 Story: 2002823 Task: 28635 Depends-On: Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2019-02-01 14:57:40 -05:00
Eric MacDonald	f7031cf5fb	Add NTP server monitoring as a collectd plugin This update disables rmon NTP monitoring which is now done as a collectd plugin with the following depends update. Story: 2002823 Task: 22859 Depends-On: https://review.openstack.org/#/c/628685/ Change-Id: I736703542c8a6ba3dd9e9db2d6fb7ccbdc906643 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2019-01-11 09:15:58 -05:00
Tao Liu	9661e49411	Change compute node to worker node personality This update replaces compute references to worker in mtce, kickstarts, installer and bsp files. Tests Performed: Non-containerized deployment AIO-SX: Sanity and Nightly automated test suite AIO-DX: Sanity and Nightly automated test suite 2+2 System: Sanity and Nightly automated test suite 2+2 System: Horizon Patch Orchestration Kubernetes deployment: AIO-SX: Create, delete, reboot and rebuild instances 2+2+2 System: worker nodes are unlock enable and no alarms Story: 2004022 Task: 27013 Depends-On: https://review.openstack.org/#/c/624452/ Change-Id: I225f7d7143d841f80459603b27b95ac3f846c46f Signed-off-by: Tao Liu <tao.liu@windriver.com>	2018-12-13 13:08:48 -05:00
Eric MacDonald	0b922227ac	Implement Active-Active Heartbeat as HA Improvement This update introduces mtce changes to support Active-Active Heartbeating. The purpose of Active-Active Heartbeating is help avoid Split-Brain. Active-Active heartbeating has each controller maintain a 5 second heartbeat response history cache of each network for all monitored hosts as well as the on-going health of storage-0 if provisioned and enabled. This is referred to as the 'heartbeat cluster history' Each controller then includes its cluster history in each heartbeat pulse request message. The hbsClient, now modified to handle heartbeat from both controllers, saves each controllers' heartbeat cluster history in a local cache and criss-crosses the data in its pulse responses. So when the hbsClient receives a pulse request from controller-0 it saves its reported history and then replaces that history information in its response to controller-0 with what it saved from controller-1's last pulse request ; i.e. its view of the system. Controller-0, receiving a host's pulse response, saves its peers heartbeat cluster history so that it has summary of heartbeat cluster history for the last 5 seconds for each monitored network of every monitored host in the system from both controllers' perspectives. Same for controller-1 with controller-0's history. The hbsAgent is then further enhanced to support a query request for this information. So now SM, when it needs to make a decision to avoid Split-Brain or otherwise, can query either controller for its heartbeat cluster history and get the last 5 second summary view of heartbeat (network) responsivness from both controllers perspectives to help decide which controller to make active. This involved removing the hbsAgent process from SM control and monitor and adding a new hbsAgent LSB init script for process launch, service file to run the init script and pmon config file for hbsAgent process monitoring. With hbsAgent now running on both controllers, changes to maintenance were required to send inventory to hbsAgent on both controllers, listen for hbsAgent event messages over the management interface and inform both hbsAgents which controller is active. The hbsAgent running on the inactive controller does not - does not send heartbeat events to maintenance - does not send raise or clear alarms or produce customer logs Test Plan: Feature: PASS: Verify hbsAgent runs on both controllers PASS: Verify hbsAgent as pmon monitored process (not SM) PASS: Verify system install and cluster collection in all system types (10+) PASS: Verify active controller hbsAgent detects and handles heartbeat loss PASS: Verify inactive controller hbsAgent detects and logs heartbeat loss PASS: Verify heartbeat cluster history collection functions properly. PASS: Verify storage-0 state tracking in cluster into. PASS: Verify storage-0 not responding handling PASS: Verify heartbeat response is sent back to only the requesting controller. PASS: Verify heartbeat history is correct from each controller PASS: Verify MNFA from active controller after install to controller-0 PASS: Verify MNFA from active controller after swact to controller-1 PASS: Verify MNFA for 80%+ of the hosts in the storage system PASS: Verify SM cluster query operation and content from both controllers PASS: Verify restart of inactive hbsAgent doesn't clear existing heartbeat alarms Logging: PASS: Verify cluster info logs. PASS: Verify feature design logging. PASS: Verify hbsAgent and hbsClient design logs on all hosts add value PASS: Verify design logging from both controllers in heartbeat loss case PASS: Verify design logging from both controllers in MNFA case PASS: Verify clog logs cluster info vault status and updates for controllers PASS: Verify clog1 logs full cluster state change for all hosts PASS: Verify clog2 logs cluster info save/append logs for controllers PASS: Verify clog3 memory dumps a cluster history PASS: Verify USR2 forces heartbeat and cluster info log dump PASS: Verify hourly heartbeat and cluster info log dump PASS: Verify loss events force heartbeat and cluster info log dump Regression: PASS: Verify Large System DOR PASS: Verify pmond regression test that now includes hbsAgent PASS: Verify Lock/Unlock of inactive controller (x3) PASS: Verify Swact behavior (x10) PASS: Verify compute Lock/Unlock PASS: Verify storage-0 Lock/Unlock PASS: Verify compute Host Failure and Graceful Recovery PASS: Verify Graceful Recovery Retry to Max:3 then Full Enable PASS: Verify Delete Host PASS: Verify Patching hbsAgent and hbsClient PASS: Verify event driven cluster push Story: 2003576 Task: 24907 Change-Id: I5baf5bcca23601a99473d039356d58250ffb01b5 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2018-11-20 19:57:18 +00:00
Eric MacDonald	8a223f395d	Mtce: Add heartbeat cluster information for SM query This part one of a two part HA Improvements feature that introduces the collection of heartbeat health at the system level. The full feature is intended to provide service management (SM) with the last 2 seconds of maintenace's heartbeat health view that is reflective of each controller's connectivity to each host including its peer controller. The heartbeat cluster summary information is additional information for SM to draw on when needing to make a choice of which controller is healthier, if/when to switch over and to ultimately avoid split brain scenarios in a two controller system. Feature Behavior: A common heartbeat cluster data structure is introduced and published to the sysroot for SM. The heartbeat service populates and maintains a local copy of this structure with data that reflects the responsivness for each monitored network of all the monitored hosts for the last 20 heartbeat periods. Mtce sends the current cluster summary to SM upon request. General flow of cluster feature wrt hbsAgent: hbs_cluster_init: general data init hbs_cluster_nums: set controller and network numbers forever: select: hbs_cluster_add / hbs_cluster_del: - add/del hosts from mtcAgent hbs_sm_handler -> hbs_cluster_send: - send cluster to SM heartbeating: hbs_cluster_append: add controller cluster to pulse request hbs_cluster_update: get controller cluster data from pulse responses hbs_cluster_save: save other controller cluster view in cluster vault hbs_cluster_log: log cluster state changes (clog) Test Plan: PASS: Verify compute system install PASS: Verify storage system install PASS: Verify cluster data ; all members of structure PASS: Verify storage-0 state management PASS: Verify add of second controller PASS: Verify add of storage-0 node PASS: Verify behavior over Swact PASS: Verify lock/unlock of second controller ; overall behavior PASS: Verify lock/unlock of storage-0 ; overall behavior PASS: Verify lock/unlock of storage-1 ; overall behavior PASS: Verify lock/unlock of compute nodes ; overall behavior PASS: Verify heartbeat failure and recovery of compute node PASS: Verify heartbeat failure and recovery of storage-0 PASS: Verify heartbeat failure and recovery of controller PASS: Verify delete of controller node PASS: Verify delete of storage-0 PASS: Verify delete of compute node PASS: Verify cluster when controller-1 active / controller-0 disabled PASS: Verify MNFA and recovery handling PASS: Verify handling in presence of multiple failure conditions PASS: Verify hbsAgent memory leak soak test with continuous SM query. PASS: Verify active controller-1 infra network failure behavior. PASS: Verify inactive controller-1 infra network failure behavior. Change-Id: I4154287f6dcf5249be5ab3180f2752ab47c5da3c Story: 2003576 Task: 24907 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2018-10-05 22:47:17 +00:00
Jim Gauld	6a5e10492c	Decouple Guest-server/agent from stx-metal This decouples the build and packaging of guest-server, guest-agent from mtce, by splitting guest component into stx-nfv repo. This leaves existing C++ code, scripts, and resource files untouched, so there is no functional change. Code refactoring is beyond the scope of this update. Makefiles were modified to include devel headers directories /usr/include/mtce-common and /usr/include/mtce-daemon. This ensures there is no contamination with other system headers. The cgts-mtce-common package is renamed and split into: - repo stx-metal: mtce-common, mtce-common-dev - repo stx-metal: mtce - repo stx-nfv: mtce-guest - repo stx-ha: updates package dependencies to mtce-pmon for service-mgmt, sm, and sm-api mtce-common: - contains common and daemon shared source utility code mtce-common-dev: - based on mtce-common, contains devel package required to build mtce-guest and mtce - contains common library archives and headers mtce: - contains components: alarm, fsmon, fsync, heartbeat, hostw, hwmon, maintenance, mtclog, pmon, public, rmon mtce-guest: - contains guest component guest-server, guest-agent Story: 2002829 Task: 22748 Change-Id: I9c7a9b846fd69fd566b31aa3f12a043c08f19f1f Signed-off-by: Jim Gauld <james.gauld@windriver.com>	2018-09-18 17:15:08 -04:00

19 Commits