Commit Graph

5 Commits

Author SHA1 Message Date
Eric MacDonald aaf9d08028 Mtce: Fix bmc password fetch error handling
The mtcAgent process sometimes segfaults while trying to fetch
the bmc password from a failing barbican process.

With that issue fixed the mtcAgent sends the bmc access
credentials to the hardware monitor (hwmond) process which
then segfaults for a reason similar

In cases where the process does not segfault but also does not
get a bmc password, the mtcAgent will flood its log file.

This update

 1. Prevents the segfault case by properly managing acquired
    json-c object releases. There was one in the mtcAgent and
    another in the hardware monitor (hwmond).

    The json_object_put object release api should only be called
    against objects that were created with very specific apis.
    See new comments in the code.

 2. Avoids log flooding error case by performing a password size
    check rather than assume the password is valid following the
    secret payload receive stage.

 3. Simplifies the secret fsm and error and retry handling.

 4. Deletes useless creation and release of a few unused json
    objects in the common jsonUtil and hwmonJson modules.

Note: This update temporarily disables sensor and sensorgroup
      suppression support for the debian hardware monitor while
      a suppression type fix in sysinv is being investigated.

Test Plan:

PASS: Verify success path bmc password secret fetch
PASS: Verify secret reference get error handling
PASS: Verify secret password read error handling
PASS: Verify 24 hr provision/deprov success path soak
PASS: Verify 24 hr provision/deprov error path path soak
PASS: Verify no memory leak over success and failure path soaking
PASS: Verify failure handling stress soak ; reduced retry delay
PASS: Verify blocking secret fetch success and error handling
PASS: Verify non-blocking secret fetch success and error handling
PASS: Verify secret fetch is set non-blocking
PASS: Verify success and failure path logging
PASS: Verify all of jsonUtil module manages object release properly
PASS: Verify hardware monitor sensor model creation, monitoring,
             alarming and relearning. This test requires suppress
             disable in order to create sensor groups in debian.
PASS: Verify both ipmi and redfish and switch between them with
             just bm_type change.
PASS: Verify all above tests in CentOS
PASS: Verify over 4000 provision/deprovision cycles across both
             failure and success path handling with no process
             failures

Closes-Bug: 1975520
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Change-Id: Ibbfdaa1de662290f641d845d3261457904b218ff
2022-06-01 15:21:05 +00:00
Eric MacDonald c4b8171ddd Refactor BMC provisioning in Maintenance
The current mechanism used to preserve the learned bmc protocol in
the filesystem on the active controller is problematic over swact.

This update removes the file storage method in favor of preserving
the learned protocol in the system inventory database as a key/value
pair at the host level in already existing mtce_info database field.

The specified or learned bmc access protocol is then shared with the
hardware monitor through inter-daemon maintenance messaging.

This update refactors bmc provisioning to accommodate bmc protocol
selection at the host rather than system level. Towards that this
update removes system level bmc_access_method selection in favor of
host level selection through bm_type. A bm_type of 'bmc' specifies
that the bmc access protocol for that host be learned. This has the
effect of making it the same as what is delivered today but without
support for changing it as the system level.

A system inventory update will be delivered shortly that enables bmc
access protocol selection at the host level. That update allows the
customer to specify the bmc access protocol at the host level to be
either dynamic (aka learned) or to only use 'redfish' or 'ipmi'.
That system inventory update delivers that information to maintenance
through bm_type via bmc provisioning. Until that update is delivered
bm_type always comes in as 'bmc' which get interpreted as 'dynamic'
to maintain existing configuration.

The following additional issues were also fixed in this update.

1. The nodeTimers module defaults the 'ring' member of timers that are
   not running to false but should be true.

2. Added a pingUtil_restart function to facilitate quicker sensor
   monitoring following provisioning changes and bmc access failures.

3. Enhanced the hardware monitor sensor grouping filter to accommodate
   non-standard Redfish readout labelling so that more sensors fall
   into the existing canned groups ; leads to more monitored sensors.

4. Added a 'http security mode' to hardware monitor messaging. This
   defaults to https as that is all that is supported by the Redfish
   implementation today. This field can be used to specify non-secure
   'http' mode in the future when that gets implemented.

5. Ensure the hardware monitor performs a bmc password re-fetch on every
   provisioning change.

Test Plan:

PASS: Verify bmc access protocol store/fetched from the database (mtce_info)
PASS: Verify inventory push from mtcAgent to hwmond over mtcAgent restart
PASS: Verify inventory push from mtcAgent to hwmond over hwmon restart
PASS: Verify bmc provisioning of ipmi and redfish servers
PASS: Verify learned bmc protocol persists over process restart and swact
PASS: Verify process startup with protocol already learned

Hardware Monitor:

PASS: Verify bmc_type=ipmi handling ; protocol forced to ipmi ; (re)prov
PASS: Verify bmc_type=redfish handling ; protocol forced to redfish ; (re)prov
PASS: Verify bmc_type=dynamic handling ; protocol is learned then persisted
PASS: Verify sensor model delete and relearn over ip address change
PASS: Verify sensor model delete and relearn over bm_type change change
PASS: Verify sensor model not relearned username change
PASS: Verify bm pw is re-fetched over any (re)provisioning change
PASS: Verify bmc re-provisioning soak (test-bmc-reprovisioning.sh 50 loops)
PASS: Verify protocol change handling, file cleanup, model recreation
PASS: Verify End-2-End behavior for bm_type change from redfish to ipmi
PASS: Verify End-2-End behavior for bm_type change from ipmi to redfish
PASS: Verify End-2-End behavior for bm_type change from redfish to dynamic
PASS: Verify End-2-End behavior for bm_type change from ipmi to dynamic
PASS: Verify End-2-End behavior for bm_type change from dynamic to ipmi
PASS: Verify End-2-End behavior for bm_type change from dynamic to redfish
PASS: Verify sensor model creation waits for server power to be on
PASS: Verify sensor relearn by provisioning change during model creation. (soak)

Regression:

PASS: Verify host power off and on.
PASS: Verify BMC access alarm handling (assert and clear)
PASS: Verify mtcAgent and hwmond logs add value
PASS: Verify no core dumps / seg faults.
PASS: Verify no mtcAgent and hwmond memory leak.
PASS: Verify delete of BMC provisioned host
PASS: Verify sensor monitoring, alarming, degrade and then clear cycle
PASS: Verify static analysis report of changed modules.
PASS: Verify host level bm_type=bmc functions as would dynamic selection
PASS: Verify batch provisioning and deprovisioning (7 nodes)
PASS: Verify batch provisioning to different protocol (5 nodes)
PASS: Verify handling of flaky Redfish responses

PEND: Verify System Install

Change-Id: Ic224a9c33e0283a611725b33c90009132cab3382
Closes-Bug: #1853471
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-12-09 09:39:49 -05:00
Eric MacDonald 0d63a16d8d Improve BMC password first fetch handling in hwmon
Trying to get the BMC password through barbican before
the ping succeeds leads to an early bmc access lost
failure that
 1. produces a misleading bmc access lost failure log ;
    bmc access had not even been established yet.
 2. imposes as retry wait that delays re-establishing
    bmc access and therefore overall sensor monitoring.

This update also

  1. adds hostname to some of the secretUtil  API
     interfaces so that logs ar reported against the
     correct host rather than always the current
     controller hostname.

   2. Changes some success path logging to dlogs to
      reduce log noise.

   3. simplifies a ping ok log

Change-Id: Ib3b7de212294d6dc350ee17d363f4009b3b0dcb0
Story: 2005861
Task: 36595
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-09-17 18:57:08 +00:00
Alex Kozyrev aeb2c1f20a Fix for MTCE race condition in BMC secret handling
There is intermittent issue in getting BMC password in MTCE.
The process of obtaining a secret from Barbican stops after
a secret reference is received. No attempts to retrieve the
actual payload is atempted. This happens when the secret
reference reply is received right after BMC queries are
initiated. It was fine before when we had an one-stage
process of getting a password from keyring. We cannot
allow it now because of a two-stage Barbican process.

Change-Id: I381f69ab6a1a54118b22dd31feefcd93698120ad
Closes-bug: 1818284
Signed-off-by: Alex Kozyrev <alex.kozyrev@windriver.com>
2019-04-05 11:10:13 -04:00
Alex Kozyrev 506ef3fd7f MTCE: reading BMC passwords from Barbican secret storage.
Use Openstack Barbican API to retrieve BMC passwords stored by SysInv.
See SysInv commit for details on how to write password to Barbican.
MTCE is going to find corresponding secret by host uuid and retrieve
secret payload associated with it. mtcSecretApi_get is used to find
secret reference, based on a hostname. mtcSecretApi_read is used to
read a password using the reference found on a prevoius step.
Also, did a little cleanup and removed old unused token handling code.

Depends-On: I7102a9662f3757c062ab310737f4ba08379d0100
Change-Id: I66011dc95bb69ff536bd5888c08e3987bd666082
Story: 2003108
Task: 27700
Signed-off-by: Alex Kozyrev <alex.kozyrev@windriver.com>
2019-02-14 09:04:46 -05:00