The introduction of the new pxeboot network requires maintenance
verify and report on messaging failures over that network.
Towards that, this update introduces periodic mtcAlive messaging
between the mtcAgent and mtcClinet.
Test Plan:
PASS: Verify install and provision each system type with a mix
of networking modes ; ethernet, bond and vlan
- AIO SX, AIO DX, AIO DX plus
- Standard System 2+1
- Storage System 2+1+1
PASS: Verify feature with physical on management interface
PASS: Verify feature with vlan on management interface
PASS: Verify feature with bonded management interface
PASS: Verify feature with bonded vlans on management interface
PASS: Verify in bonded cases handling with 2, 1 or no slaves found
PASS: Verify mgmt-combined or separate cluster-host network
PASS: Verify mtcClient pxeboot interface address learning
- for worker and storage nodes ; dhcp leases file
- for controller nodes before unlock ; dhcp leases file
- for controller nodes after unlock ; static from ifcfg
- from controller within 10 seconds of process restart
PASS: Verify mtcAgent pxeboot interface address learning from
dnsmasq.hosts file
PASS: Verify pxeboot mtcAlive initiation, handling, loss detection
and recovery
PASS: Verify success and failure handling of all new pxeboot ip
address learning functions ;
- dhcp - all system node installs.
- dnsmasq.hosts - active controller for all hosts.
- interfaces.d - controller's mtcClient pxeboot address.
- pxeboot req mtcAlive - mtcAgent mtcAlive request message.
PASS: Verify mtcClient pxeboot network 'mtcAlive request' and 'reboot'
command handling for ethernet, vlan and bond configs.
PASS: Verify mtcAlive sequence number monitoring, out-of-sequence
detection, handling and logging.
PASS: Verify pxeboot rx socket binding and non-blocking attribute
PASS: Verify mtcAgent handling stress soaking of sustained incoming
500+ msgs/sec ; batch handling and logging.
PASS: Verify mtcAgent and mtcClient pxeboot tx and rx socket messaging,
failure recovery handling and logging.
PASS: Verify pxeboot receiver is not setup on the oam interface on
controller-0 first install until after initial config
complete.
Regression:
PASS: Verify mtcAgent/mtcClient online and offline state management
PASS: Verify mtcAgent/mtcClient command handling
- over management network
- over cluster-host network
PASS: Verify mtcClient interface chain log for all iface types
- bond : vlan123 -> pxeboot0 (802.3ad 4) -> enp0s8 and enp0s9
- vlan : vlan123 -> enp0s8
- ethernet: enp0s8
PASS: Verify mtcAgent/mtcClient handling and logging including debug
logging for standard operations
- node install and unlock
- node lock and unlock
- node reinstall, reboot, reset
PASS: Verify graceful recovery handling of heartbeat loss failure.
- node reboot
- management interface down
PASS: Verify systemcontroller and subcloud install with dc-libvirt
PASS: Verify no log flooding, coredumps, memory leaks
Story: 2010940
Task: 49541
Change-Id: Ibc87b85e3e0e07c3b8c40b5291bd3372506fbdfb
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Maintenance interfaces with sysinv, sm and the vim using http requests.
Request timeout's have an implicit delay between retries. However,
command failures or outright connection failures don't.
This has only become obvious in mtce's communication with the vim
where there appears to be a process startup timing change that leads
to the 'vim' not being ready to handle commands before mtcAgent
startup starts sending them after a platform services group startup
by sm.
This update adds a 10 second http retry wait as a configuration option
to mtc.conf. The mtcAgent loads this value at startup and uses it
in a new HTTP__RETRY_WAIT state of http request work FSM.
The number of retries remains unchanged. This update is only forcing
a minimum wait time between retries, regardless of cause.
Failure path testing was done using Fault Insertion Testing (FIT).
Test Plan:
PASS: Verify the reported issue is resolved by this update.
PASS: Verify http retry config value load on process startup.
PASS: Verify updated value is used over a process -sighup.
PASS: Verify default value if new mtc.conf config value is not found.
PASS: Verify http connection failure http retry handling.
PASS: Verify http request timeout failure retry handling.
PASS: Verify http request operation failure retry handling.
Regression:
PASS: Build and install ISO - Standard and AIO DX.
PASS: Verify http failures do not fail a lock operation.
PASS: Verify host unlock fails if its http done queue shows failures.
PASS: Verify host swact.
PASS: Verify handling of random and persistent http errors involving
the need for retries.
Closes-Bug: 2047958
Change-Id: Icc758b0782be2a4f2882efd56f5de1a8dddea490
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
luks-fs-mgr service creates and unseals the LUKS volume used to store
keys/secrets. This change handles the failure case if this essential
service is inactive. It introduces an alarm LUKS_ALARM_ID which is
raised if service is inactive which implies that there is an issue in
creating or unsealing the LUKS volume.
Test Plan:
PASS" build-pkgs -c -p mtce-common
PASS: build-pkgs -c -p mtce
PASS: build-image
PASS: AIO-SX bootstrap with luks volume status active
PASS: AIO-DX bootstrap with volume status active
PASS: Standard setup with 2 controllers and 1 compute node with luks
volume status active. There should not be any alarm and node
status should be unlocked/enabled/available.
PASS: AIO-DX node enable failure on the controller where luks volume
is inactive. Node availability should be failed. A critical
alarm with id 200.016 should be displayed with 'fm alarm-list'
PASS: AIO-SX node enable failure on the controller-0. Node availability
should be failed. A critical alarm with id 200.016 should be
displayed with 'fm alarm-list'
PASS: Standard- node enable failure on the node (controller-0,
controller-1, storage-0, compute-1). Node availability
should be failed. A critical alarm with id 200.016 should be
displayed with 'fm alarm-list' for the failed host.
PASS: AIO-DX In service volume inactive should be detected and a
critical alarm should be raised with ID 200.016. Node
availability should be changed to degraded.
PASS: AIO-SX In service volume inactive status should be detected
and a critical alarm should be raised with ID 200.016. Node
availability should be changed to degraded.
PASS: Standard ( 2 controller, 1 storage, 1 compute) In service
volume inactive status should be detected and a
critical alarm should be raised with ID 200.016. Node
availability should be changed to degraded.
PASS: AIO-DX In service: If volume becomes active and a LUKS alarm
is active, alarm should be cleared. Node availability should
be changed to available.
PASS: AIO-SX In service: If volume becomes active and a LUKS alarm is
active, alarm should be cleared. Node availability should be
changed to available.
PASS: Standard ( 2 controller, 1 storage, 1 compute) In service:
If volume becomes active and a LUKS alarm is active, alarm
should be cleared. Node availability should be changed to
available.
PASS: AIO-SX, AIO-DX, Standard- If intest fails and node availability
is 'failed'. After fixing the volume issue, a lock/unlock should
make the node available.
Story: 2010872
Task: 49108
Change-Id: I4621e7c546078c3cc22fe47079ba7725fbea5c8f
Signed-off-by: Jagatguru Prasad Mishra <jagatguruprasad.mishra@windriver.com>
This update solves two issues involving bmc reset.
Issue #1: A race condition can occur if the mtcAgent finds an
unlocked-disabled or heartbeat failing node early in
its startup sequence, say over a swact or an SM service
restart and needs to issue a one-time-reset. If at that
point it has not yet established access to the BMC then
the one-time-reset request is skipped.
Issue #2: When issue #1 race conbdition does not occur before BMC
access is established the mtcAgent will issue its one-time
reset to a node. If this occurs as a result of a crashdump
then this one-time reset can interrupt the collection of
the vmcore crashdump file.
This update solves both of these issues by introducing a bmc reset
delay following the detection and in the handling of a failed node
that 'may' need to be reset to recover from being network isolated.
The delay prevents the crashdump from being interrupted and removes
the race condition by giving maintenance more time to establish bmc
access required to send the reset command.
To handle significantly long bmc reset delay values this update
cancels the posted 'in waiting' reset if the target recovers online
before the delay expires.
It is recommended to use a bmc reset delay that is longer than a
typical node reboot time. This is so that in the typical case, where
there is no crashdump happening, we don't reset the node late in its
almost done recovery. The number of seconds till the pending reset
countdown is logged periodically.
It can take upwards of 2-3 minutes for a crashdump to complete.
To avoid the double reboot, in the typical case, the bmc reset delay
is set to 5 minutes which is longer than a typical boot time.
This means that if the node recovers online before the delay expires
then great, the reset wasn't needed and is cancelled.
However, if the node is truely isolated or the shutdown sequence
hangs then although the recovery is delayed a bit to accomodate for
the crashdump case, the node is still recovered after the bmc reset
delay period. This could lead to a double reboot if the node
recovery-to-online time is longer than the bmc reset delay.
This update implements this change by adding a new 'reset send wait'
phase to the exhisting reset progression command handler.
Some consistency driven logging improvements were also implemented.
Test Plan:
PASS: Verify failed node crashdump is not interrupted by bmc reset.
PASS: Verify bmc is accessible after the bmc reset delay.
PASS: Verify handling of a node recovery case where the node does not
come back before bmc_reset_delay timeout.
PASS: Verify posted reset is cancelled if the node goes online before
the bmc reset delay and uptime shows less than 5 mins.
PASS: Verify reset is not cancelled if node comes back online without
reboot before bmc reset delay and still seeing mtcAlive on one
or more links.Handles the cluster-host only heartbeat loss case.
The node is still rebooted with the bmc reset delay as backup.
PASS: Verify reset progression command handling, with and
without reboot ACKs, with and without bmc
PASS: Verify reset delay defaults to 5 minutes
PASS: Verify reset delay change over a manual change and sighup
PASS: Verify bmc reset delay of 0, 10, 60, 120, 300 (default), 500
PASS: Verify host-reset when host is already rebooting
PASS: Verify host-reboot when host is already rebooting
PASS: Verify timing of retries and bmc reset timeout
PASS: Verify posted reset throttled log countdown
Failure Mode Cases:
PASS: Verify recovery handling of failed powered off node
PASS: Verify recovery handling of failed node that never comes online
PASS: Verify recovery handling when bmc is never accessible
PASS: Verify recovery handling cluster-host network heartbeat loss
PASS: Verify recovery handling management network heartbeat loss
PASS: Verify recovery handling both heartbeat loss
PASS: Verify mtcAgent restart handling finding unlocked disabled host
Regression:
PASS: Verify build and DX system install
PASS: Verify lock/unlock (soak 10 loops)
PASS: Verify host-reboot
PASS: Verify host-reset
PASS: Verify host-reinstall
PASS: Verify reboot graceful recovery (force and no force)
PASS: Verify transient heartbeat failure handling
PASS: Verify persistent heartbeat loss handling of mgmt and/or cluster networks
PASS: Verify SM peer reset handling when standby controller is rebooted
PASS: Verify logging and issue debug ability
Closes-Bug: 2042567
Closes-Bug: 2042571
Change-Id: I195661702b0d843d0bac19f3d1ae70195fdec308
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Mtce polls/queries the remote host for mtcAlive messages
for 42 x 100 ms intervals over unlock or host failed cases.
Absence of mtcAlive during this (~5 sec) period indicates
the node is offline.
However, in the rare case where shutdown is slow, 5 seconds
is not long enough. Rare cases have been seen where 7 or 8
second wait time is required to properly declare offline.
To avoid the rare transient 200.004 host alarm over an
unlock operation, this update increases the mtce host
offline window from 5 to 10 seconds (approx) by modifying
the mtce configuration file offline threshold from 42 to 90.
Test Plan:
PASS: Verify unchallenged failed to offline period to be ~10 secs
PASS: Verify algorithm restarts if there is mtcAlive received
anytime during the polls/queries (challenge) window.
PASS: Verify challenge handling leads to a longer but
successful offline declaration.
PASS: Verify above handling for both unlock and spontaneous
failure handling cases.
Closes-Bug: 2024249
Change-Id: Ice41ed611b4ba71d9cf8edbfe98da4b65dcd05cf
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
This update adds the ability for SM to passively
request the mtcClient to BMC reset its peer controller
as a means to recover a severely loaded active controller.
To do this the mtcAgent is modified keep the controllers'
mtcClients updated with the BMC info of its peer.
The mtcClient is modified to audit for the SM signal
and then when asserted issue a BMC reset of its peer
controller using ipmitool system call.
The ability to command the peer mtcCient to 'sync'
prior to the BMC reset is implemented but configured
disabled for now.
Change-Id: Ibe4c8aaa3a980cbe5f34c3e22f015698a6453c1a
Partial-Bug: #1895350
Co-Authored-By: Bin.Qian@windriver.com
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
This update is the mtc hbsAgent side of a new
SM -> hbsAgent heartbeat algorithm for the
purpose of detecting peer SM process stalls.
This update adds an 'SM Heartbeat status' bit to
the cluster view it injects into its multicast
heartbeat requests.
Its peer is able to read this on-going hbsAgent/SM
heartbeat status through the cluster.
The status bit reads 'ok' while the hbsAgent sees
the SM heartbeat as steady.
The status bit reads 'not ok' while the SM heartbeat
is lost for longer than 800 msecs.
Change-Id: I0f2079b0fafd7bce0b97ee26d29899659d66f81d
Partial-Fix: #1895350
Co-Authored-By: Bin.Qian@windriver.com
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
This update lowers the hbsClient process RT priority
from 99 to 45 and enhances its event driven select base
receive messaging interface with non-blocking sockets.
Update also removes a few stale define and ifdef blocks.
Change-Id: Ib0e60d67f055a03b537aef5679e3eea1ba5bf0b3
Closes-Bug: 1898582
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
This update delivers redfish support for Power-On/Off, Reset
and Netboot Reinstall handling to maintenance.
Test Plan: (Testing Continues)
PASS: Verify Redfish Power-Off action handling
PASS: Verify Redfish Power-On action handling
PASS: Verify Redfish Reset action handling
PASS: Verify compute Redfish Reinstall action handling from controller-0
PASS: Verify compute Redfish Reinstall action handling from controller-1
PASS: Verify Redfish Power-Off Action failure handling
PASS: Verify Redfish Power-On action failure handling
PASS: Verify Redfish Reset action failure handling
PASS: Verify Redfish Re-Install action failure handling
PASS: verify Reset progression cycle does not leak memory.
PASS: Verify bmc_handler failure handling does not leak memory.
PASS: Verify Inservice BMC access (ping) failure and recovery handling.
PASS: Verify BMC access failure alarm handling
PASS: Verify BMC provisioning and deprovisioning soak (redfish - wolfpass)
PASS: Verify BMC provisioning and deprovisioning does not leak memory.
PASS: Verify BMC provisioning handling with bad ip and/or bad username
PASS: Verify BMC reprovisioning to same protocol
PASS: Verify BMC reprovisioning from ipmi host to redfish host
PASS: Verify BMC reprovisioning from redfish host to ipmi host
PASS: Verify mixed protocol support in same lab
PASS: Verify mixed server support in same lab
PASS: Verify Large System Install with BMCs provisioned (wp8-12)
PASS: Verify bmc access method (learn,ipmi,redfish) learned from mtc.init
PASS: Verify Swact with BMCs provisioned.
PASS: Verify no segfaults.
PASS: Verify AIO System Install in lab that supports redfish (WC3-6, WP8-12, Dell 720 3-7)
PASS: Verify AIO Simplex Install with Redfish Support (SM1, SM3)
PASS: Verify AIO Duplex Install with Redfish Support (SM 5-6, Dell 720 1-2
Useability:
PASS: Verify handling of reprovisioning BMC between hosts that support
different protocols.
PASS: Verify handling of reprovisioning ip address to host that leads to a
different protocol select.
PASS: Verify manual relearn handling to recover from errors that result from
the above case.
PASS: Verify host BMC deprovisioning handling and cleanup.
PASS: Verify sensor monitoring.
PASS: Verify fault insertion for both protocols and action handling.
PASS: Verify protocol select handover.
PASS: Verify hwmond sticks with a selected protocol once a sensor model
has been created using that protocol.
PASS: Verify handling of missing bmc_access_method configuration select.
PASS: Verify inservice bmc_access_method service parameter modification handling.
Regression:
PASS: Verify redfish BMC info query logging.
PASS: Verify sensor monitoring and alarming still works.
PASS: Verify all power/reset/netboot commands for IPMI
PASS: Verify reprovisioning soak of Wolfpass servers
PASS: Verify reprovisioning soak of SM servers
Depends-on: https://review.opendev.org/#/c/679178/
Change-Id: I984057e04d7426e37d675cf4d334a4e35419f2e8
Story: 2005861
Task: 35826
Task: 36606
Task: 36467
Task: 36456
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Updated to read the host cluster-host parameter in /etc/hosts
file.
Replaced references of infra network with cluster-host network
Story: 2004273
Task: 29473
Change-Id: I199fb82e5f6b459b181196d0802f1a74220b796e
Signed-off-by: Teresa Ho <teresa.ho@windriver.com>
All rmon resource monitoring has been moved to collectd.
This update removes rmon from mtce and the load.
Story: 2002823
Task: 30045
Test Plan:
PASS: Build and install a standard system.
PASS: Inspect mtce rpm list
PASS: Inspect logs
PASS: Check pmon.d
Change-Id: I7cf1fa071eac89274e7fae1f307e14d548cc945b
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Use Openstack Barbican API to retrieve BMC passwords stored by SysInv.
See SysInv commit for details on how to write password to Barbican.
MTCE is going to find corresponding secret by host uuid and retrieve
secret payload associated with it. mtcSecretApi_get is used to find
secret reference, based on a hostname. mtcSecretApi_read is used to
read a password using the reference found on a prevoius step.
Also, did a little cleanup and removed old unused token handling code.
Depends-On: I7102a9662f3757c062ab310737f4ba08379d0100
Change-Id: I66011dc95bb69ff536bd5888c08e3987bd666082
Story: 2003108
Task: 27700
Signed-off-by: Alex Kozyrev <alex.kozyrev@windriver.com>
This update stops trying to recover hosts that have failed the
Enable sequence after a thresholded number of back-to-back tries.
A host that has reached a particular failure modes' max failure
threshold then maintenance puts it into a 'unlocked-disabled-failed'
state and left that way with no further recovery action until
it is manually locked and unlocked.
The thresholded Enable failure causes are
Configuration Failure ....... threshold:2 retry interval:30 secs
In-Test GoEnabled Failure ... threshold:2 retry interval:30 sec
Start Host Services Failure . threshold:2 retry interval:30 sec
Heartbeat Soak Failure ...... threshold:2 retry interval:10 minute
This update refactors the old auto recovery for AIO SX into this
more generic framework.
Story: 2003576
Task: 24905
Test Plan:
PASS: Verify AIO DX System Install
PASS: Verify AIO SX DOR
PASS: Verify Auto recovery disabled state is maintained over AIO SX DOR
PASS: Verify Lock/Unlock recovers host from Auto recovery disabled state
PASS: Verify AIO SX Main Config Failure handling
PASS: Verify AIO SX Main Config Timeout handling
PASS: Verify AIO SX Main GoEnabled Failure Handling
PASS; Verify AIO SX Main Host Services Failure handling
PASS; Verify AIO SX Main Host Services Timeout handling
PASS; Verify AIO SX Subf Config Failure handling
PASS: Verify AIO SX Subf Config Timeout handling
PASS: Verify AIO SX Subf GoEnabled Failure Handling
PASS: Verify AIO SX Subf Host Services Failure handling
PASS: Verify AIO DX System Install
PASS: Verify AIO DX DOR
PASS: Verify AIO DX DOR ; one time active controller GoEnabled failure ; swact requested
PASS: Verify AIO DX Main First Unlock Failure handling
PASS: Verify AIO DX Main Config Failure handling (inactive ctrl)
PASS: Verify AIO DX Main one time Config Failure handling
PASS: Verify AIO DX Main one time GoEnabled Failure handling.
PASS: Verify AIO DX SUBF Inactive Controller 1 GoEnable Failure handling.
PASS: Verify AIO DX Inactive Controller 1 GoEnable Failure with recovery on retry.
PASS: Verify AIO DX Active controller Enable failure with no or locked peer controller.
PASS: Verify AIO DX Reboot Active controller with peer in auto recovery disabled state.
PASS: Verify AIO DX Active controller failure with peer in auto recovery disabled state. (vswitch process)
PASS: Verify AIo DX Active controller failure then recovery after reboot with peer in auto recovery disabled state. (goenabled)
PASS: Verify AIO DX Inactive Controller Enable Heartbeat Soak Failure handling.
PASS: Verify AIO DX Active controller unhealthy detection and handling. (degrade)
PASS: Verify AIO DX Inactive controller unhealthy detection and handling. (fail)
PASS: Verify Normal System Install
PASS: Verify Compute Enable Configuration Failure handling (wc71-75)
PASS: Verify Compute Enable GoEnabled Failure handling (recover after 1)
PASS: Verify Compute Enable Start Host Services Failure handling
PASS: Verify Compute Enable Heartbeat Soak Failure handling
PASS: Verify Inactive Controller Enable Heartbeat Soak Failure handling
PASS: Verify Inactive Controller Configuration Failure handling
PASS; Verify Inactive Controller GoEnabled Failure handling
PASS; Verify Inactive Controller Host Services Failure handling
PASS; Verify goEnabled failure after active controller reboot with no peer controller (C0 rebooted with C1 locked) - no SM startup
PASS: Verify auto recovery threshold number is configurable
PASS: Verify auto recovery retry interval is configurable
PASS: Verify auto recovery host state and status message
Regression:
PASS: Verify Swact behavior, over and back
PASS: Verify 5 node DOR
PASS: Verify 3 host MNFA behavior
PASS: verify in-service heartbeat failure handling
PASS: verify no segfaults during UT
Corner Cases:
PASS: Verify mtcAlive boot failure behavior. reset progression. retry forever. - sleep in config script
PASS: Verify AIO SX mtcAgent process restart while in autorecovery disabled state
PASS: Verify autorecovery disabled state is preserved over mtcAgent process restart.
Change-Id: I7098f16243caef27c5295971ef3c9de5be975755
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
After discussion with Eslimi, this patch disables DDNS on dhclient,
as the network port 2105 used by dhclient conflict with same port
used on mtcClient. Now we change the port used by mtcClient from 2105
to 2118 to fix conflict, then we can remove this patch.
Deployment test pass.
Story: 2003757
Task: 26445
Change-Id: I70559d73f51f85c840042cc4fc206fcd5bc3de27
Signed-off-by: zhipengl <zhipengs.liu@intel.com>
This decouples the build and packaging of guest-server, guest-agent from
mtce, by splitting guest component into stx-nfv repo.
This leaves existing C++ code, scripts, and resource files untouched,
so there is no functional change. Code refactoring is beyond the scope
of this update.
Makefiles were modified to include devel headers directories
/usr/include/mtce-common and /usr/include/mtce-daemon.
This ensures there is no contamination with other system headers.
The cgts-mtce-common package is renamed and split into:
- repo stx-metal: mtce-common, mtce-common-dev
- repo stx-metal: mtce
- repo stx-nfv: mtce-guest
- repo stx-ha: updates package dependencies to mtce-pmon for
service-mgmt, sm, and sm-api
mtce-common:
- contains common and daemon shared source utility code
mtce-common-dev:
- based on mtce-common, contains devel package required to build
mtce-guest and mtce
- contains common library archives and headers
mtce:
- contains components: alarm, fsmon, fsync, heartbeat, hostw, hwmon,
maintenance, mtclog, pmon, public, rmon
mtce-guest:
- contains guest component guest-server, guest-agent
Story: 2002829
Task: 22748
Change-Id: I9c7a9b846fd69fd566b31aa3f12a043c08f19f1f
Signed-off-by: Jim Gauld <james.gauld@windriver.com>