The mtcAgent process sometimes segfaults while trying to fetch
the bmc password from a failing barbican process.
With that issue fixed the mtcAgent sends the bmc access
credentials to the hardware monitor (hwmond) process which
then segfaults for a reason similar
In cases where the process does not segfault but also does not
get a bmc password, the mtcAgent will flood its log file.
This update
1. Prevents the segfault case by properly managing acquired
json-c object releases. There was one in the mtcAgent and
another in the hardware monitor (hwmond).
The json_object_put object release api should only be called
against objects that were created with very specific apis.
See new comments in the code.
2. Avoids log flooding error case by performing a password size
check rather than assume the password is valid following the
secret payload receive stage.
3. Simplifies the secret fsm and error and retry handling.
4. Deletes useless creation and release of a few unused json
objects in the common jsonUtil and hwmonJson modules.
Note: This update temporarily disables sensor and sensorgroup
suppression support for the debian hardware monitor while
a suppression type fix in sysinv is being investigated.
Test Plan:
PASS: Verify success path bmc password secret fetch
PASS: Verify secret reference get error handling
PASS: Verify secret password read error handling
PASS: Verify 24 hr provision/deprov success path soak
PASS: Verify 24 hr provision/deprov error path path soak
PASS: Verify no memory leak over success and failure path soaking
PASS: Verify failure handling stress soak ; reduced retry delay
PASS: Verify blocking secret fetch success and error handling
PASS: Verify non-blocking secret fetch success and error handling
PASS: Verify secret fetch is set non-blocking
PASS: Verify success and failure path logging
PASS: Verify all of jsonUtil module manages object release properly
PASS: Verify hardware monitor sensor model creation, monitoring,
alarming and relearning. This test requires suppress
disable in order to create sensor groups in debian.
PASS: Verify both ipmi and redfish and switch between them with
just bm_type change.
PASS: Verify all above tests in CentOS
PASS: Verify over 4000 provision/deprovision cycles across both
failure and success path handling with no process
failures
Closes-Bug: 1975520
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Change-Id: Ibbfdaa1de662290f641d845d3261457904b218ff
The mtcClient will perform a socket reinit if it detects a socket
failure. The mtcClient also avoids setting up its controller-1
cluster network socket for the AIO SX system type ; because there
is no controller-1 provisioned.
Most AIO SX systems have the management/cluster networks set to
the 'loopback' interface. However, when an AIO SX system is setup
with its management and cluster networks on physical interfaces,
with or without vlan, the mtcAlive send message utility will try
to send to the uninitialized controller-1 cluster socket. This
leads to a socket error that triggers a socket reinitialization
loop which causes log flooding.
This update adds a check to the mtcAlive send utility to avoid
sending mtcAlive to controller-1 for AIO SX system type where
there is no controller-1 provisioned; no send,no error,no flood.
Since this update needed to add a system type check, this update
also implemented a system type definition rename from CPE to AIO.
Other related definitions and comments were also changed to make
the code base more understandable and maintainable
Test Plan:
PASS: Verify AIO SX with mgmnt/clstr on physical (failure mode)
PASS: Verify AIO SX Install with mgmnt/clstr on 'lo'
PASS: Verify AIO SX Lock msg and ack over mgmnt and clstr
PASS: Verify AIO SX locked-disabled-online state
PASS: Verify mtcClient clstr socket error detect/auto-recovery (fit)
PASS: Verify mtcClient mgmnt socket error detect/auto-recovery (fit)
Regression:
PASS: Verify AIO SX Lock and Unlock (lazy reboot)
PASS: Verify AIO DX and DC install with pv regression and sanity
PASS: Verify Standard system install with pv regression and sanity
Change-Id: I658d33a677febda6c0e3fcb1d7c18e5b76cb3762
Closes-Bug: 1897334
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
This update is the mtc hbsAgent side of a new
SM -> hbsAgent heartbeat algorithm for the
purpose of detecting peer SM process stalls.
This update adds an 'SM Heartbeat status' bit to
the cluster view it injects into its multicast
heartbeat requests.
Its peer is able to read this on-going hbsAgent/SM
heartbeat status through the cluster.
The status bit reads 'ok' while the hbsAgent sees
the SM heartbeat as steady.
The status bit reads 'not ok' while the SM heartbeat
is lost for longer than 800 msecs.
Change-Id: I0f2079b0fafd7bce0b97ee26d29899659d66f81d
Partial-Fix: #1895350
Co-Authored-By: Bin.Qian@windriver.com
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Maintenance's success path messaging does not depend on cluster
network messaging. However, there are a number of failure mode
cases that do depend on cluster network messaging to properly
diagnose and offer a higher availability handling for some
failure cases.
For instance, when the management interface goes down, without cluster
network messaging remote hosts can be isolated. Being able to command-
reboot a host over cluster-host network offers higher availability.
Maintenance is designed to use the cluster network, if provisioned, as a
backup path for mtcAlive, node locked, reboot and several other commands
and acknowledgements.
Unfortunately, it was recently observed that maintenance is using
the 'nfs-controller' label to resolve cluster network addressing
which resolves to management network IPs. As a result all messages
intended to be going over the cluster-host network are instead just
redundant management network messages.
During debug of this issue several additional cluster network
messaging related issues were observed and fixed.
This update implements the following fixes
1. since there is no floating address for the cluster network the
mtcClient was modified to send messages to both controllers where
only the active controller will be listening and acting.
2. fixes port number mtce listens for cluster-host network messages
3. fixes port number mtce sends cluster-host network messages to.
4. mtcAlive messages are also sent on provisioned cluster network.
5. locked state notifications and acks sent on provisioned cluster network.
6. reboot request and acks sent on provisioned cluster network.
7. fixed command acknowledgement messaging.
This update also
1. envelopes the mtcAlive gate control to allow debug tracing of all gate
state changes.
2. moves graceful recovery handling heartbeat failure state clear to the
end of the recovery handler, just before heartbeat start.
3. adds sm unhealthy support to fail and automatically recover the
inactive controller from an SM UNHEALTHY state.
----------
Test Plan:
----------
Functional:
PASS: Verify management network messaging
PASS: Verify cluster-host network messaging
PASS: Verify cluster-host messages with tcpdump
PASS: Verify cluster-host network mtcAlive messaging
PASS: Verify reboot request and ack reply over management network
PASS: Verify reboot request and ack reply over cluster-host network
PASS: Verify lock state notification and ack reply over management network
PASS: Verify lock state notification and ack reply over cluster-host network
PASS: Verify acknowledgement messaging
PASS: Verify maintenance daemon logging
PASS: Verify maintenance socket initialization
System:
PASS: Verify compute system install
PASS: Verify AIO system install
Feature:
PASS: Verify sm node unhealth handling (active:ignore, inactive:recover)
Change-Id: I092596d3e22438dd8a613a073614c188f6f5721d
Closes-Bug: #835268
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Updated to read the host cluster-host parameter in /etc/hosts
file.
Replaced references of infra network with cluster-host network
Story: 2004273
Task: 29473
Change-Id: I199fb82e5f6b459b181196d0802f1a74220b796e
Signed-off-by: Teresa Ho <teresa.ho@windriver.com>
All rmon resource monitoring has been moved to collectd.
This update removes rmon from mtce and the load.
Story: 2002823
Task: 30045
Test Plan:
PASS: Build and install a standard system.
PASS: Inspect mtce rpm list
PASS: Inspect logs
PASS: Check pmon.d
Change-Id: I7cf1fa071eac89274e7fae1f307e14d548cc945b
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Use Openstack Barbican API to retrieve BMC passwords stored by SysInv.
See SysInv commit for details on how to write password to Barbican.
MTCE is going to find corresponding secret by host uuid and retrieve
secret payload associated with it. mtcSecretApi_get is used to find
secret reference, based on a hostname. mtcSecretApi_read is used to
read a password using the reference found on a prevoius step.
Also, did a little cleanup and removed old unused token handling code.
Depends-On: I7102a9662f3757c062ab310737f4ba08379d0100
Change-Id: I66011dc95bb69ff536bd5888c08e3987bd666082
Story: 2003108
Task: 27700
Signed-off-by: Alex Kozyrev <alex.kozyrev@windriver.com>
This part one of a two part HA Improvements feature that introduces
the collection of heartbeat health at the system level.
The full feature is intended to provide service management (SM)
with the last 2 seconds of maintenace's heartbeat health view that
is reflective of each controller's connectivity to each host
including its peer controller.
The heartbeat cluster summary information is additional information
for SM to draw on when needing to make a choice of which controller
is healthier, if/when to switch over and to ultimately avoid split
brain scenarios in a two controller system.
Feature Behavior: A common heartbeat cluster data structure is
introduced and published to the sysroot for SM. The heartbeat
service populates and maintains a local copy of this structure
with data that reflects the responsivness for each monitored
network of all the monitored hosts for the last 20 heartbeat
periods. Mtce sends the current cluster summary to SM upon request.
General flow of cluster feature wrt hbsAgent:
hbs_cluster_init: general data init
hbs_cluster_nums: set controller and network numbers
forever:
select:
hbs_cluster_add / hbs_cluster_del: - add/del hosts from mtcAgent
hbs_sm_handler -> hbs_cluster_send: - send cluster to SM
heartbeating:
hbs_cluster_append: add controller cluster to pulse request
hbs_cluster_update: get controller cluster data from pulse responses
hbs_cluster_save: save other controller cluster view in cluster vault
hbs_cluster_log: log cluster state changes (clog)
Test Plan:
PASS: Verify compute system install
PASS: Verify storage system install
PASS: Verify cluster data ; all members of structure
PASS: Verify storage-0 state management
PASS: Verify add of second controller
PASS: Verify add of storage-0 node
PASS: Verify behavior over Swact
PASS: Verify lock/unlock of second controller ; overall behavior
PASS: Verify lock/unlock of storage-0 ; overall behavior
PASS: Verify lock/unlock of storage-1 ; overall behavior
PASS: Verify lock/unlock of compute nodes ; overall behavior
PASS: Verify heartbeat failure and recovery of compute node
PASS: Verify heartbeat failure and recovery of storage-0
PASS: Verify heartbeat failure and recovery of controller
PASS: Verify delete of controller node
PASS: Verify delete of storage-0
PASS: Verify delete of compute node
PASS: Verify cluster when controller-1 active / controller-0 disabled
PASS: Verify MNFA and recovery handling
PASS: Verify handling in presence of multiple failure conditions
PASS: Verify hbsAgent memory leak soak test with continuous SM query.
PASS: Verify active controller-1 infra network failure behavior.
PASS: Verify inactive controller-1 infra network failure behavior.
Change-Id: I4154287f6dcf5249be5ab3180f2752ab47c5da3c
Story: 2003576
Task: 24907
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
This decouples the build and packaging of guest-server, guest-agent from
mtce, by splitting guest component into stx-nfv repo.
This leaves existing C++ code, scripts, and resource files untouched,
so there is no functional change. Code refactoring is beyond the scope
of this update.
Makefiles were modified to include devel headers directories
/usr/include/mtce-common and /usr/include/mtce-daemon.
This ensures there is no contamination with other system headers.
The cgts-mtce-common package is renamed and split into:
- repo stx-metal: mtce-common, mtce-common-dev
- repo stx-metal: mtce
- repo stx-nfv: mtce-guest
- repo stx-ha: updates package dependencies to mtce-pmon for
service-mgmt, sm, and sm-api
mtce-common:
- contains common and daemon shared source utility code
mtce-common-dev:
- based on mtce-common, contains devel package required to build
mtce-guest and mtce
- contains common library archives and headers
mtce:
- contains components: alarm, fsmon, fsync, heartbeat, hostw, hwmon,
maintenance, mtclog, pmon, public, rmon
mtce-guest:
- contains guest component guest-server, guest-agent
Story: 2002829
Task: 22748
Change-Id: I9c7a9b846fd69fd566b31aa3f12a043c08f19f1f
Signed-off-by: Jim Gauld <james.gauld@windriver.com>