Commit Graph

21 Commits

Author SHA1 Message Date
Eric MacDonald 7d8be4bc1f Add auto-versioning to starlingx/metal mtce packages
This update makes use of the PKG_GITREVCOUNT variable
to auto-version the mtce packages in this repo.

Change-Id: Ifb4da4570e0261bbdcf0d7af79b8add7cfc133ac
Story: 2006166
Task: 39822
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2020-05-21 15:18:43 -04:00
Eric MacDonald 0826882308 Add mtcAgent socket initialization failure retry handling.
The main maintenance process (mtcAgent) exits on a process start-up
socket initialization failure. SM restarts the failed process within
seconds and will swact if the second restart also fails. From startup
to swact can be as quick as 4 seconds. This is too short to handle a
collision with a manifest.

This update adds a number of socket initialization retries to extend
the time the process has to resolve socket initialization failures by
giving the collided manifest time to complete between retries.

The number of retries and inter retry wait time is calibrated to ensure
that a persistently failing mtcAgent process exits in under 40 seconds.

This is to ensure that SM is able to detect and swact away from a
persistently failing maintenance process while also giving the process
a few tries to resolve on its own.

Test Plan:

PASS: Verify socket init failure thresholded retry handling
      with no, persistent and recovered failure conditions.
PASS: Verify swact if socket init failure is persistent
PASS: Verify no swact if socket failure recovers after first exit
PASS: Verify no swact if socket failure recovers over init retry
PASS: Verify an hour long soak of continuous socket open/close retry

Change-Id: I3cb085145308f0e920324e22111f40bdeb12b444
Closes-Bug: 1869192
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2020-04-01 19:24:22 +00:00
Eric MacDonald da7b2e94f1 Modify Mtce Reinstall FSM to first power-off BMC provisioned hosts
This update only applies to servers that support and are provisioned
for Board Management Control (BMC).

The BMC of some servers silently reject the 'set next boot device',
a command while it is executing BIOS.

The current reinstall algorithm when the BMC is provisioned starts by
detecting the power state of the target server. If the power is off
it will 'first power it on' and then proceed to 'set next boot device'
to pxe followed by a reset. For the initial power off state case, the
timing of these operations is such that the server is in BIOS when the
'set next boot device' command is issued.

This update modifies the host reinstall algorithm to first power-off
a server followed by setting the next boot device while the server is
confirmed to be powered off, then powered on. This ensures the server
gets and handles the set next boot device command operation properly.

This update also fixes a race condition between the bmc_handler and
power_handler by moving the final power state update in the power
handler to the power done phase.

Test Plan:

Verify all new reinstall failure path handling via fault insertion testing
Verify reinstall of powered off host
Verify reinstall of powered on host
Verify reinstall of Wildcat server with ipmi
Verify reinstall of Supermicro server with ipmi and redfish
Verify reinstall of Ironpass server with ipmi
Verify reinstall of WolfPass server with redfish and ipmi
Verify reinstall of Dell server with ipmi

Over 30 reinstalls were performed across all server types, with initial
power on and off using both ipmi and redfish (where supported).

Change-Id: Iefb17e9aa76c45f2ceadf83f23b1231ae82f000f
Closes-Bug: 1862065
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2020-02-12 15:44:26 +00:00
Eric MacDonald 9bf231a286 Fix BMC access loss handling
Recent refactoring of the BMC handler FSM introduced a code change that
prevents the BMC Access alarm from being raised after initial BMC
accessibility was established and is then lost.

This update ensures BMC access alarm management is working properly.

This update also implements ping failure debounce so that a single ping
failure does not trigger full reconnection handling. Instead that now
requires 3 ping failures in a row. This has the effect of adding a minute
to ping failure action handling before the usual 2 minute BMC access failure
alarm is raised. ping failure logging is reduced/improved.

Test Plan: for both hwmond and mtcAgent

PASS: Verify BMC access alarm due to bad provisioning (un, pw, ip, type)
PASS: Verify BMC ping failure debounce handling, recovery and logging
PASS: Verify BMC ping persistent failure handling
PASS: Verify BMC ping periodic miss handling
PASS: Verify BMC ping and access failure recovery timing
PASS: Verify BMC ping failure and recovery handling over BMC link pull/plug
PASS: Verify BMC sensor monitoring stops/resumes over ping failure/recovery

Regression:

PASS: Verify IPv6 System Install using provisioned BMCs (wp8-12)
PASS: Verify BMC power-off request handling with BMC ping failing & recovering
PASS: Verify BMC power-on request handling with BMC ping failing & recovering
PASS: Verify BMC reset request handling with BMC ping failing & recovering
PASS: Verify BMC sensor group read failure handling & recovery
PASS: Verify sensor monitoring after ping failure handling & recovery

Change-Id: I74870816930ef6cdb11f987424ffed300ff8affe
Closes-Bug: 1858110
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2020-01-03 09:34:37 -05:00
Eric MacDonald a42301c19b Make successful pmon-restart clear failed restarts count
The pmon-restart service, through a call to respawn_process,
increments that process's restarts counter but does not clear
that counter after a successful restart.

So, each pmon-restart mistakenly contributes to that process's
failure count. This has the effect of pre-loading that process's
restart counter by one for every pmon-restart of that process.

The effect is best described by example.
Say a process is pmon-restart'ed 4 times during one day which
increments that process's restart counter to 4. So assuming its
conf file specifies its threshold is 3 ; its already exceeded
its threshold. Then, even days later that process experiences
a real failure pmon will immediate take the severity action
because the failure threshold had already been exceeded.

This update ensures a process's restart counter is cleared
after successful pmon-restart operation ; in the process pid
registration phase of recovery.

Test Plan:

PASS: Verify pmon-restart continues to work.
PASS: Verify proper thresholding of failed process following
      many pmon-restart operations.
PEND: Verify pmon-restart and process failure automated test script
      against this update. 5 loops, all processes.

Change-Id: Ib01446f2e053846cd30cb0ca0e06d7c987cdf581
Closes-Bug: 1853330
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-11-21 14:58:28 +00:00
Don Penney e356019f95 Update host watchdog CONFIG_MASK
The CONFIG_MASK in hostw.h includes CONFIG_START_DELAY, which
correlates to an option that is not actually used in host watchdog. As
a result, the recently added check that verifies all options in the
CONFIG_MASK are set fails, and the host watchdog fails to launch.

This update removes the unused CONFIG_START_DELAY bit from
CONFIG_MASK.

Change-Id: I330e15520bc0f01a6cbfd4f83a1953c1c737da2b
Partial-Bug: 1835370
Signed-off-by: Don Penney <don.penney@windriver.com>
2019-10-30 16:40:56 -04:00
Eric MacDonald aefc81ec91 Fix hardware monitor degrade event handling
Long hostname support introduced a bug that causes the mtcAgent
to reject hardware monitor degrade requests due to the originating
service (daemon) not recognized.

This update fixes the parsed parameters in mtcAgent and adds
a sensor parm to the degrade API so that the sensor name
accompanying the degrade event can be logged in mtcAgent.

Test Plan: for hwmond degrade handling

PASS: verify degrade assert and sensor name in mtcAgent degrade assert log
PASS: Verify degrade clear handling and log

Change-Id: I5c11cc5f679f21e6aadd4d5be25e6c08a241e80b
Closes-Bug: 1838020
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-07-26 09:02:08 -04:00
Don Penney fd962863ce Set restricted permissions for mtce logfiles
This update sets the umask for the mtclog daemon
to restrict permissions on logfiles it creates.

Change-Id: I712ecd46e4c550b946dd1df39557a8e0a87dad3d
Partial-Bug: 1836632
Signed-off-by: Don Penney <don.penney@windriver.com>
2019-07-17 18:19:52 -04:00
Eric MacDonald fb89f62a9e Fix hbsAgent mgmnt interface name corruption
The hbsAgent socket initialization process is declaring
a local string variable and then loading the management
interface name into it and setting its static management
interface config pointer to that variable.

So, when that local variable falls out of scope and that
local (stack) variable memory is recycled and changed by
its new owner then the data the management interface config
pointer points to changes and that leads to the reported
error condition and log flooding.

This update allocates a new memory pointer instead.

Change-Id: I858bcc69455f1d915f2873c47a75dd1139cf8fcb
Closes-Bug: 1829608
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-05-22 08:45:14 -04:00
Eric MacDonald 5c043f7ca9 Make Mtce ignore heartbeat events from in-active controller.
There is the potential for a race condition that can lead to
mtce incorrectly failing hosts due to heartbeat failure event
messages sourced from the in-active controller.

During a split brain recovery action scenario there was a swact
which left the hbsAgent on the new stand-by controller thinking
it was still on the active controller.

This specific split brain failure mode was one where the active
and then (after swact) stand-by controller was failing heartbeat
to its peer and other nodes in the system even though the new
active controller saw heartbeat working fine.

The problem being, the in-active controller detected and sent
a heartbeat loss message to mtce before mtce was able to update
the in-active controller's heartbeat activity status which would
have gated the loss event send.

This update adds an additional layer of protection by intentionally
ignoring heartbeat events from the in-active controller that might
slip through due to this activity state change race condition.

Also fixed a flooding log in the hbsAgent for big systems.

Change-Id: I825a801166b3e80cbf67945c7f587851f4e0d90b
Closes-Bug: 1813976
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-05-09 14:42:01 +00:00
Eric MacDonald b13750d5f0 Make Mtce system mode scan case in-sensitive
Mtce is looking for 'Standard' in system_type and will
default to AIO if the 'S' in 'standard' is not uppercase.

This update makes the mtce system_type driver handle
system_type and system_mode readings case in-sensitive.

Tested on-system case variations for both type and mode
in platform.conf.

Closes-Bug: 1827904
Change-Id: I5e33097e1b13e5b5d385929dd13e7912ae89ead8
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-05-06 19:14:14 +00:00
Kristine Bujold bee31d98c8 Remove wrs-guest-heartbeat SDK Module
With the StarlingX move to supporting pure upstream OpenStack, the
majority of the SDK Modules are related to functionality no longer
supported. The remaining SDK Modules will be moved to StarlingX
documentation.

Story: 2005275
Task: 30565

Change-Id: Ifc560a6865d045ab3bd93923811aeb5f8ac7f030
Signed-off-by: Kristine Bujold <kristine.bujold@windriver.com>
2019-04-17 13:38:18 -04:00
Eric MacDonald f10b9a5170 Add mtce dependency on ipmitool
ipmitool was recently found to be missing from the load after
a rpm cleanup that seemed to remove all dependency on it.

Maintenance and its Hardware Monitor use the ipmitool
for power / reset control as well as sensor monitoring.

This update adds a dependency on ipmitool in the maintenance
mtcAgent and hwmon rpm build recipe so that it will always
be included in the load with maintenance.

Closes-Bug:1821958

Test Plan:
PASS: Verify ipmitool in load
PASS: Verify mtce and hwmon rpm dependency on ipmitool
PASS: Verify system install

Change-Id: I958a2365f6df7bdbf942bc57c1aa17ee2ae6a73d
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-03-28 15:36:12 -04:00
Eric MacDonald 9d38f56f7f pmond: don't error log first active pulse miss
Change-Id: I31ef5e290993e8d6b492d0d9b58709b854c4dffa
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-02-15 10:20:59 -05:00
Eric MacDonald 7941ee5bbb Add new Link Monitor (lmond) daemon to Mtce
This update introduces a new Link Monitor daemon to the Mtce
flock of daemons and disable rmon's interface monitoring.

This new daemon parses the platform.conf file and using the
interface names assigned to each monitored network (mgmt,
infra and oam) queries the kernel for their physical,
bonded and vlan interface names and then registers to listen
for netlink events.

All link/interface state change (netlink) events that correspond
to any of the interfaces or links assiciated with the monitored
networks are tracked by this new daemon.

This new daemon then also implements an http listener for
localhost initiated GET requests targeted to /mtce/lmond
on port 2122 and responds with a json link_info string that
contains a summary of monitored networks, links and their
current Up/Down status.

lmond behavioral summary:
  1. learn interface/port model,
  2. load initial link status for learned links,
  3. listen for link status change events
  4. provide link status info to http GET Query requests.

Another update to stx-integ implements the collectd interface
plugin that periodically issues the Link Status GET requests
for the purponse of alarming port and interface Down conditions,
clearing alarms on Up state changes, and storing sample data
that represents the percentage of active links for each monitored
network.

Test Plan:

PASS: Verify lmond process startup
PASS: Verify lmond logging and log rotation
PASS: Verify lmond process monitoring by pmon
PASS: Verify lmond interface learning on process startup
PASS: Verify lmond port learning on process startup
PASS: Verify lmond handling of vlan and bond interface types
PASS: Verify lmond http link info GET Query handling
PASS: Verify lmond has no memory leak during normal and eventfull operation

Change-Id: I58915644e60f31e3a12c3b451399c4f76ec2ea37
Story: 2002823
Task: 28635
Depends-On:
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-02-01 14:57:40 -05:00
Eric MacDonald f7031cf5fb Add NTP server monitoring as a collectd plugin
This update disables rmon NTP monitoring which is now done
as a collectd plugin with the following depends update.

Story: 2002823
Task: 22859

Depends-On: https://review.openstack.org/#/c/628685/
Change-Id: I736703542c8a6ba3dd9e9db2d6fb7ccbdc906643
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-01-11 09:15:58 -05:00
Eric MacDonald 4e132af308 Mtce: fix hbsClient active monitoring over config reload
The maintenance process monitor is failing the hbsClient
process over config or process reload operations.

The issue relates to the hbsClient's subfunction being
'last-config' without pmon properly gating the active
monitoring FSM from starting until the passive monitoring
phase is complete and in the MANAGE state.

Test Plan

PASS: Verify active monitoring failure detection and handling
PASS: Verify proper process monitoring over pmond config reload
PASS: Verify proper process monitoring over SIGHUP -> pmond
PASS: Verify proper process monitoring over SIGUSR2 -> pmond
PASS: Verify proper process monitoring over process failure recovery
PASS: Verify pmond regression test soak ; on active and inactive controllers
PASS: Verify pmond regression test soak ; on compute node
PASS: Verify pmond regression test soak ; kill/recovery function
PASS: Verify pmond regression test soak ; restart function
PASS: Verify pmond regression test soak ; alarming function
PASS: Verify pmond handles critical process failure with no restart config
PASS: Verify pmond handles ntpd process failure

PASS: Verify AIO DX Install
PASS: Verify AIO DX Inactive Controller process management over Lock/Unlock.

Change-Id: Ie2fe7b6ce479f660725e5600498cc98f36f78337
Closes-Bug: 1807724
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2018-12-12 13:53:18 -05:00
Eric MacDonald 9d7a4bf92c Implement Active-Active Heartbeat as HA Improvement Fix
A few small issues were found during integration testing with SM.

This update delivers those integration tested fixes.

1. Send cluster event to SM only after the first 10 heartbeat
   pulses are received.
2. Only send inventory to hbsAgent on provisioned controllers.
3. Add new OOB SM_UNHEALTHY flag to detect and act on an SM
   declared unhealthy controller.
4. Network monitoring enable fix.
5. Fix oldest entry tracking when a network history is not full.
6. Prevent clearing local uptime for a host that is being enabled.
7. Refactor cluster state change notification logging and handling.

These fixes were both UT and IT tested in multiple labs

Change-Id: I28485f241ac47bb3ed3ec1e2a8f4c09a1ca2070a
Story: 2003576
Task: 24907
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2018-12-10 09:57:34 -05:00
Eric MacDonald 0b922227ac Implement Active-Active Heartbeat as HA Improvement
This update introduces mtce changes to support Active-Active Heartbeating.

The purpose of Active-Active Heartbeating is help avoid Split-Brain.

Active-Active heartbeating has each controller maintain a 5 second
heartbeat response history cache of each network for all monitored
hosts as well as the on-going health of storage-0 if provisioned and
enabled.

This is referred to as the 'heartbeat cluster history'

Each controller then includes its cluster history in each heartbeat
pulse request message.

The hbsClient, now modified to handle heartbeat from both controllers,
saves each controllers' heartbeat cluster history in a local cache and
criss-crosses the data in its pulse responses.

So when the hbsClient receives a pulse request from controller-0 it
saves its reported history and then replaces that history information
in its response to controller-0 with what it saved from controller-1's
last pulse request ; i.e. its view of the system.

Controller-0, receiving a host's pulse response, saves its peers
heartbeat cluster history so that it has summary of heartbeat
cluster history for the last 5 seconds for each monitored network
of every monitored host in the system from both controllers'
perspectives. Same for controller-1 with controller-0's history.

The hbsAgent is then further enhanced to support a query request
for this information.

So now SM, when it needs to make a decision to avoid Split-Brain
or otherwise, can query either controller for its heartbeat cluster
history and get the last 5 second summary view of heartbeat (network)
responsivness from both controllers perspectives to help decide which
controller to make active.

This involved removing the hbsAgent process from SM control and monitor
and adding a new hbsAgent LSB init script for process launch, service
file to run the init script and pmon config file for hbsAgent process
monitoring.

With hbsAgent now running on both controllers, changes to maintenance
were required to send inventory to hbsAgent on both controllers,
listen for hbsAgent event messages over the management interface
and inform both hbsAgents which controller is active.

The hbsAgent running on the inactive controller does not
 - does not send heartbeat events to maintenance
 - does not send raise or clear alarms or produce customer logs

Test Plan:

Feature:
PASS: Verify hbsAgent runs on both controllers
PASS: Verify hbsAgent as pmon monitored process (not SM)
PASS: Verify system install and cluster collection in all system types (10+)
PASS: Verify active controller hbsAgent detects and handles heartbeat loss
PASS: Verify inactive controller hbsAgent detects and logs heartbeat loss
PASS: Verify heartbeat cluster history collection functions properly.
PASS: Verify storage-0 state tracking in cluster into.
PASS: Verify storage-0 not responding handling
PASS: Verify heartbeat response is sent back to only the requesting controller.
PASS: Verify heartbeat history is correct from each controller
PASS: Verify MNFA from active controller after install to controller-0
PASS: Verify MNFA from active controller after swact to controller-1
PASS: Verify MNFA for 80%+ of the hosts in the storage system
PASS: Verify SM cluster query operation and content from both controllers
PASS: Verify restart of inactive hbsAgent doesn't clear existing heartbeat alarms

Logging:
PASS: Verify cluster info logs.
PASS: Verify feature design logging.
PASS: Verify hbsAgent and hbsClient design logs on all hosts add value
PASS: Verify design logging from both controllers in heartbeat loss case
PASS: Verify design logging from both controllers in MNFA case
PASS: Verify clog  logs cluster info vault status and updates for controllers
PASS: Verify clog1 logs full cluster state change for all hosts
PASS: Verify clog2 logs cluster info save/append logs for controllers
PASS: Verify clog3 memory dumps a cluster history
PASS: Verify USR2 forces heartbeat and cluster info log dump
PASS: Verify hourly heartbeat and cluster info log dump
PASS: Verify loss events force heartbeat and cluster info log dump

Regression:
PASS: Verify Large System DOR
PASS: Verify pmond regression test that now includes hbsAgent
PASS: Verify Lock/Unlock of inactive controller (x3)
PASS: Verify Swact behavior (x10)
PASS: Verify compute Lock/Unlock
PASS: Verify storage-0 Lock/Unlock
PASS: Verify compute Host Failure and Graceful Recovery
PASS: Verify Graceful Recovery Retry to Max:3 then Full Enable
PASS: Verify Delete Host
PASS: Verify Patching hbsAgent and hbsClient
PASS: Verify event driven cluster push

Story: 2003576
Task: 24907

Change-Id: I5baf5bcca23601a99473d039356d58250ffb01b5
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2018-11-20 19:57:18 +00:00
Eric MacDonald 66ba248389 Mtce: Increase swact receive retry delay
Maintenance is seen to intermittently fail Swact requests when
it fails to get a response from SM 500 msecs after having issued
the request successfully.

A recent instrumentation update went in which verified that the
http request was being launched properly even in the failure cases.

Seems the 500 msec timeout might not be long enough to account
for SM's scheduling/handling.

This update increases the receive retry delay from 50 msec to 1 second.

Change-Id: I29d6ba03094843a2af9d8720dd074572d76a31a4
Related-Bug: https://bugs.launchpad.net/starlingx/+bug/1791381
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2018-10-02 19:04:19 +00:00
Jim Gauld 6a5e10492c Decouple Guest-server/agent from stx-metal
This decouples the build and packaging of guest-server, guest-agent from
mtce, by splitting guest component into stx-nfv repo.

This leaves existing C++ code, scripts, and resource files untouched,
so there is no functional change. Code refactoring is beyond the scope
of this update.

Makefiles were modified to include devel headers directories
/usr/include/mtce-common and /usr/include/mtce-daemon.
This ensures there is no contamination with other system headers.

The cgts-mtce-common package is renamed and split into:
- repo stx-metal: mtce-common, mtce-common-dev
- repo stx-metal: mtce
- repo stx-nfv: mtce-guest
- repo stx-ha: updates package dependencies to mtce-pmon for
  service-mgmt, sm, and sm-api

mtce-common:
- contains common and daemon shared source utility code

mtce-common-dev:
- based on mtce-common, contains devel package required to build
  mtce-guest and mtce
- contains common library archives and headers

mtce:
- contains components: alarm, fsmon, fsync, heartbeat, hostw, hwmon,
  maintenance, mtclog, pmon, public, rmon

mtce-guest:
- contains guest component guest-server, guest-agent

Story: 2002829
Task: 22748

Change-Id: I9c7a9b846fd69fd566b31aa3f12a043c08f19f1f
Signed-off-by: Jim Gauld <james.gauld@windriver.com>
2018-09-18 17:15:08 -04:00