Commit Graph

9 Commits

Author SHA1 Message Date
Eric MacDonald 9bf231a286 Fix BMC access loss handling
Recent refactoring of the BMC handler FSM introduced a code change that
prevents the BMC Access alarm from being raised after initial BMC
accessibility was established and is then lost.

This update ensures BMC access alarm management is working properly.

This update also implements ping failure debounce so that a single ping
failure does not trigger full reconnection handling. Instead that now
requires 3 ping failures in a row. This has the effect of adding a minute
to ping failure action handling before the usual 2 minute BMC access failure
alarm is raised. ping failure logging is reduced/improved.

Test Plan: for both hwmond and mtcAgent

PASS: Verify BMC access alarm due to bad provisioning (un, pw, ip, type)
PASS: Verify BMC ping failure debounce handling, recovery and logging
PASS: Verify BMC ping persistent failure handling
PASS: Verify BMC ping periodic miss handling
PASS: Verify BMC ping and access failure recovery timing
PASS: Verify BMC ping failure and recovery handling over BMC link pull/plug
PASS: Verify BMC sensor monitoring stops/resumes over ping failure/recovery

Regression:

PASS: Verify IPv6 System Install using provisioned BMCs (wp8-12)
PASS: Verify BMC power-off request handling with BMC ping failing & recovering
PASS: Verify BMC power-on request handling with BMC ping failing & recovering
PASS: Verify BMC reset request handling with BMC ping failing & recovering
PASS: Verify BMC sensor group read failure handling & recovery
PASS: Verify sensor monitoring after ping failure handling & recovery

Change-Id: I74870816930ef6cdb11f987424ffed300ff8affe
Closes-Bug: 1858110
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2020-01-03 09:34:37 -05:00
Eric MacDonald c4b8171ddd Refactor BMC provisioning in Maintenance
The current mechanism used to preserve the learned bmc protocol in
the filesystem on the active controller is problematic over swact.

This update removes the file storage method in favor of preserving
the learned protocol in the system inventory database as a key/value
pair at the host level in already existing mtce_info database field.

The specified or learned bmc access protocol is then shared with the
hardware monitor through inter-daemon maintenance messaging.

This update refactors bmc provisioning to accommodate bmc protocol
selection at the host rather than system level. Towards that this
update removes system level bmc_access_method selection in favor of
host level selection through bm_type. A bm_type of 'bmc' specifies
that the bmc access protocol for that host be learned. This has the
effect of making it the same as what is delivered today but without
support for changing it as the system level.

A system inventory update will be delivered shortly that enables bmc
access protocol selection at the host level. That update allows the
customer to specify the bmc access protocol at the host level to be
either dynamic (aka learned) or to only use 'redfish' or 'ipmi'.
That system inventory update delivers that information to maintenance
through bm_type via bmc provisioning. Until that update is delivered
bm_type always comes in as 'bmc' which get interpreted as 'dynamic'
to maintain existing configuration.

The following additional issues were also fixed in this update.

1. The nodeTimers module defaults the 'ring' member of timers that are
   not running to false but should be true.

2. Added a pingUtil_restart function to facilitate quicker sensor
   monitoring following provisioning changes and bmc access failures.

3. Enhanced the hardware monitor sensor grouping filter to accommodate
   non-standard Redfish readout labelling so that more sensors fall
   into the existing canned groups ; leads to more monitored sensors.

4. Added a 'http security mode' to hardware monitor messaging. This
   defaults to https as that is all that is supported by the Redfish
   implementation today. This field can be used to specify non-secure
   'http' mode in the future when that gets implemented.

5. Ensure the hardware monitor performs a bmc password re-fetch on every
   provisioning change.

Test Plan:

PASS: Verify bmc access protocol store/fetched from the database (mtce_info)
PASS: Verify inventory push from mtcAgent to hwmond over mtcAgent restart
PASS: Verify inventory push from mtcAgent to hwmond over hwmon restart
PASS: Verify bmc provisioning of ipmi and redfish servers
PASS: Verify learned bmc protocol persists over process restart and swact
PASS: Verify process startup with protocol already learned

Hardware Monitor:

PASS: Verify bmc_type=ipmi handling ; protocol forced to ipmi ; (re)prov
PASS: Verify bmc_type=redfish handling ; protocol forced to redfish ; (re)prov
PASS: Verify bmc_type=dynamic handling ; protocol is learned then persisted
PASS: Verify sensor model delete and relearn over ip address change
PASS: Verify sensor model delete and relearn over bm_type change change
PASS: Verify sensor model not relearned username change
PASS: Verify bm pw is re-fetched over any (re)provisioning change
PASS: Verify bmc re-provisioning soak (test-bmc-reprovisioning.sh 50 loops)
PASS: Verify protocol change handling, file cleanup, model recreation
PASS: Verify End-2-End behavior for bm_type change from redfish to ipmi
PASS: Verify End-2-End behavior for bm_type change from ipmi to redfish
PASS: Verify End-2-End behavior for bm_type change from redfish to dynamic
PASS: Verify End-2-End behavior for bm_type change from ipmi to dynamic
PASS: Verify End-2-End behavior for bm_type change from dynamic to ipmi
PASS: Verify End-2-End behavior for bm_type change from dynamic to redfish
PASS: Verify sensor model creation waits for server power to be on
PASS: Verify sensor relearn by provisioning change during model creation. (soak)

Regression:

PASS: Verify host power off and on.
PASS: Verify BMC access alarm handling (assert and clear)
PASS: Verify mtcAgent and hwmond logs add value
PASS: Verify no core dumps / seg faults.
PASS: Verify no mtcAgent and hwmond memory leak.
PASS: Verify delete of BMC provisioned host
PASS: Verify sensor monitoring, alarming, degrade and then clear cycle
PASS: Verify static analysis report of changed modules.
PASS: Verify host level bm_type=bmc functions as would dynamic selection
PASS: Verify batch provisioning and deprovisioning (7 nodes)
PASS: Verify batch provisioning to different protocol (5 nodes)
PASS: Verify handling of flaky Redfish responses

PEND: Verify System Install

Change-Id: Ic224a9c33e0283a611725b33c90009132cab3382
Closes-Bug: #1853471
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-12-09 09:39:49 -05:00
Eric MacDonald 4c541f50d4 Maintenance Redfish support useability enhancements.
This update is a result of changes made during a suite of
end-to-end provisioning, reprovisioning and deprovisioning
customer exterience testing of the maintenance RedFish support
feature.

1. Force reconnection and password fetch on provisioning changes
2. Force reconnection and password fetch on persistent connection failures
3. Fix redfish protocol learning (string compare) in hardware monitor
4. Improve logging for some typical error paths.

Test Plan:

PASS: Verify handling of reprovisioning BMC between hosts that support
             different protocols.
PASS: Verify handling of reprovisioning ip address to host that leads to a
             different protocol select.
PASS: Verify manual relearn handling to recover from errors that result from
             the above case.
PASS: Verify host BMC deprovisioning handling and cleanup.
PASS: Verify sensor monitoring.
PASS: Verify hwmond sticks with a selected protocol once a sensor model
             has been created using that protocol.
PASS: Verify handling of BMC reprovision - ip address change only
PASS: Verify handling of BMC reprovision - username change only
FAIL: Verify handling of BMC reprovision - password change only
             https://bugs.launchpad.net/starlingx/+bug/1846418

Change-Id: I4bf52a5dc3c97d7794ff623c881dff7886234e79
Closes-Bug: #1846212
Story: 2005861
Task: 36606
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-10-03 11:57:58 -04:00
Eric MacDonald 0d63a16d8d Improve BMC password first fetch handling in hwmon
Trying to get the BMC password through barbican before
the ping succeeds leads to an early bmc access lost
failure that
 1. produces a misleading bmc access lost failure log ;
    bmc access had not even been established yet.
 2. imposes as retry wait that delays re-establishing
    bmc access and therefore overall sensor monitoring.

This update also

  1. adds hostname to some of the secretUtil  API
     interfaces so that logs ar reported against the
     correct host rather than always the current
     controller hostname.

   2. Changes some success path logging to dlogs to
      reduce log noise.

   3. simplifies a ping ok log

Change-Id: Ib3b7de212294d6dc350ee17d363f4009b3b0dcb0
Story: 2005861
Task: 36595
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-09-17 18:57:08 +00:00
zhipengl 67d4ba105f Redfish support for Sensor Monitoring in hwmond
Add redfish hwmon thread function and related parse function
for Power and Thermal sensor data.
Removed some unused old functions.
Rename common function or variable with bmc prefix

Test done for this patch on simplex bare metal setup.
system host-sensor-list
system host-sensor-show
system host-sensorgroup-list
system host-sensorgroup-show
system host-sensorgroup-relearn

Story: 2005861
Task: 35815

Depends-on: https://review.opendev.org/#/c/671340
Change-Id: If8a35581d44df15749a049eda945f23d2323fd35
Signed-off-by: zhipengl <zhipengs.liu@intel.com>
2019-09-12 01:56:42 +08:00
Eric MacDonald 804ec52227 Add redfish support detection to maintenance
This update

1. Refactors some of the common maintenance ipmi
   definitions and utilities into a more generic
   'bmcUtil' module to reduce code duplication and improve
   improve code reuse with the introduction of a second
   bmc communication protocol ; redfish.

2. Creates a new 'redFishUtil' module similar to the existing
   'ipmiUtil' module but in support of common redfish
   utilities and definitions that can be used by both
   maintenance and the hardware monitor.

3. Moves the existing 'mtcIpmiUtil' module to a more common
   'mtcBmcUtil' and renames the 'ipmi_command_send/recv' to
   the more generic 'bmc_command_send/recv' which are enhanced
   to support both ipmi and redfish bmc communication methods.

4. Renames the bmc info collection and connection monitor ;
   'bm_handler' to 'bmc_handler' and adds support necessary
   to learn if a host's bmc supports redfish.

5. Renames the existing 'mtcThread_ipmitool' to a more common
   'mtcThread_bmc' and redfishtool support for the now common
   set of bmc thread commands and the addition of the new
   redfishtool bmc query, aka 'redfish root query', used to
   detect if a host's bmc supports redfish.

   Note: This aspect is the primary feature of this update.

         Namely the ability to detect and print a log indicating
         if a host's bmc supports redfish.

Test Plan:

PASS: Verify sensor monitoring and alarming still works.
PASS: Verify power-off command handling.
PASS: Verify power-on command handling.
PASS: Verify reset command handling.
PASS: Verify reinstall (netboot) command handling.
PASS: Verify logging when redfish is not supported.
PASS: Verify logging when redfish is supported.
PASS: Verify ipmitool is used regardless of redfish support.
PASS: Verify mtce thread error handling for both protocols.

Change-Id: I72e63958f61d10f5c0d4a93a49a7f39bdd53a76f
Story: 2005861
Task: 35825
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-08-19 14:03:37 +00:00
Alex Kozyrev aeb2c1f20a Fix for MTCE race condition in BMC secret handling
There is intermittent issue in getting BMC password in MTCE.
The process of obtaining a secret from Barbican stops after
a secret reference is received. No attempts to retrieve the
actual payload is atempted. This happens when the secret
reference reply is received right after BMC queries are
initiated. It was fine before when we had an one-stage
process of getting a password from keyring. We cannot
allow it now because of a two-stage Barbican process.

Change-Id: I381f69ab6a1a54118b22dd31feefcd93698120ad
Closes-bug: 1818284
Signed-off-by: Alex Kozyrev <alex.kozyrev@windriver.com>
2019-04-05 11:10:13 -04:00
Alex Kozyrev 506ef3fd7f MTCE: reading BMC passwords from Barbican secret storage.
Use Openstack Barbican API to retrieve BMC passwords stored by SysInv.
See SysInv commit for details on how to write password to Barbican.
MTCE is going to find corresponding secret by host uuid and retrieve
secret payload associated with it. mtcSecretApi_get is used to find
secret reference, based on a hostname. mtcSecretApi_read is used to
read a password using the reference found on a prevoius step.
Also, did a little cleanup and removed old unused token handling code.

Depends-On: I7102a9662f3757c062ab310737f4ba08379d0100
Change-Id: I66011dc95bb69ff536bd5888c08e3987bd666082
Story: 2003108
Task: 27700
Signed-off-by: Alex Kozyrev <alex.kozyrev@windriver.com>
2019-02-14 09:04:46 -05:00
Jim Gauld 6a5e10492c Decouple Guest-server/agent from stx-metal
This decouples the build and packaging of guest-server, guest-agent from
mtce, by splitting guest component into stx-nfv repo.

This leaves existing C++ code, scripts, and resource files untouched,
so there is no functional change. Code refactoring is beyond the scope
of this update.

Makefiles were modified to include devel headers directories
/usr/include/mtce-common and /usr/include/mtce-daemon.
This ensures there is no contamination with other system headers.

The cgts-mtce-common package is renamed and split into:
- repo stx-metal: mtce-common, mtce-common-dev
- repo stx-metal: mtce
- repo stx-nfv: mtce-guest
- repo stx-ha: updates package dependencies to mtce-pmon for
  service-mgmt, sm, and sm-api

mtce-common:
- contains common and daemon shared source utility code

mtce-common-dev:
- based on mtce-common, contains devel package required to build
  mtce-guest and mtce
- contains common library archives and headers

mtce:
- contains components: alarm, fsmon, fsync, heartbeat, hostw, hwmon,
  maintenance, mtclog, pmon, public, rmon

mtce-guest:
- contains guest component guest-server, guest-agent

Story: 2002829
Task: 22748

Change-Id: I9c7a9b846fd69fd566b31aa3f12a043c08f19f1f
Signed-off-by: Jim Gauld <james.gauld@windriver.com>
2018-09-18 17:15:08 -04:00