The existing /var/run/.node_locked flag file is volatile.
Meaning it is lost over a host reboot which has DOR implications.
Service Management (SM) sometimes selects and activates services
on a locked controller following a DOR (Dead Office Recovery).
This update is part one of a two-part update that solves both
of the above problems. Part two is a change to SM in the ha git.
This update can be merged without part two.
This update maintains the existing volatile node locked file because
it is looked at by other system services. So to minimize the change
and therefore patchback impact, a new non-volatile 'backup' of the
existing node locked flag file is created.
This update incorporates modifications to the mtcAgent and mtcClient,
introducing a new backup file and ensuring their synchronized
management to guarantee their simultaneous presence or absence.
Note: A design choice was made to not use a symlink of one to the
other rather than add support to manage symlinks in the code.
This approach was chosen for its simplicity and reliability
in directly managing both files. At some point in the future
volatile file could be deprecated contingent upon identifying
and updating all services that directly reference it.
This update also removes some dead code that was adjacent to my update.
Test Plan: This test plan covers the maintenance management of
both files to ensure they always align and the expected
behavior exists.
PASS: Verify AIO DX Install.
PASS: Verify Storage System Install.
PASS: Verify Swact back and forth.
PASS: Verify mtcClient and mtcAgent logging.
PASS: Verify node lock/unlock soak.
Non-volatile (Nv) node locked management test cases:
PASS: Verify Nv node locked file is present when a node is locked.
Confirmed on all node types.
PASS: Verify any system node install comes up locked with both node
locked flag files present.
PASS: Verify mtcClient logs when a node is locked and unlocked.
PASS: Verify Nv node locked file present/absent state mirrors the
already existing /var/run/.node_locked flag file.
PASS: Verify node locked file is present on controller-0 during
ansible run following initial install and removed as part
of the self-unlock.
PASS: Verify the Nv node locked file is removed over the unlock
along with the administrative state change prior to the
unlock reboot.
PASS: Verify both node locked files are always present or absent
together.
PASS: Verify node locked file management while the management
interface is down. File is still managed over cluster network.
PASS: Verify node locked file management while the cluster interface
is down. File is still managed over management network.
PASS: Verify behavior if the new unlocked message is received by a
mtcClient process that does not support it ; unknown command log.
PASS: Verify a node locked state is auto corrected while not in a
locked/unlocked action change state.
... Manually remove either file on locked node and verify
they are both recreated within 5 seconds.
... Manually create either node locked file on unlocked worker
or storage node and verify the created files are removed
within 5 seconds.
Note: doing this to the new backup file on the active
controller will cause SM to shutdown as expected.
PASS: Verify Nv node locked file is auto created on a node that
spontaneously rebooted while it was unlocked. During the
reboot the node was administratively locked.
The node should come online with both node locked files present.
Partial-Bug: 2051578
Change-Id: I0c279b92491e526682d43d78c66f8736934221de
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Maintenance interfaces with sysinv, sm and the vim using http requests.
Request timeout's have an implicit delay between retries. However,
command failures or outright connection failures don't.
This has only become obvious in mtce's communication with the vim
where there appears to be a process startup timing change that leads
to the 'vim' not being ready to handle commands before mtcAgent
startup starts sending them after a platform services group startup
by sm.
This update adds a 10 second http retry wait as a configuration option
to mtc.conf. The mtcAgent loads this value at startup and uses it
in a new HTTP__RETRY_WAIT state of http request work FSM.
The number of retries remains unchanged. This update is only forcing
a minimum wait time between retries, regardless of cause.
Failure path testing was done using Fault Insertion Testing (FIT).
Test Plan:
PASS: Verify the reported issue is resolved by this update.
PASS: Verify http retry config value load on process startup.
PASS: Verify updated value is used over a process -sighup.
PASS: Verify default value if new mtc.conf config value is not found.
PASS: Verify http connection failure http retry handling.
PASS: Verify http request timeout failure retry handling.
PASS: Verify http request operation failure retry handling.
Regression:
PASS: Build and install ISO - Standard and AIO DX.
PASS: Verify http failures do not fail a lock operation.
PASS: Verify host unlock fails if its http done queue shows failures.
PASS: Verify host swact.
PASS: Verify handling of random and persistent http errors involving
the need for retries.
Closes-Bug: 2047958
Change-Id: Icc758b0782be2a4f2882efd56f5de1a8dddea490
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
This update improves on and drives consistency into the
maintenance power on/off and reset handling in terms of
retries and use of graceful and immediate commands.
This update maintains the 10 retries for both power-on
and power-off commands and increases the number of retries
for the reset command from 5 to 10 to line up with the
power operation commands.
This update also ensures that the first 5 retries are done
with the graceful action command while the last 5 are with
the immediate.
This update also removed a power on handling case that could
have lead to a stuck state. This case was virtually impossible
to hit based on the required sequence of intermittent command
failures but that scenario handling was fixed up anyway.
Issues have been seen with the power-off handling on some servers.
Suspect that those servers need more time to power-off. So, this
introduced a 30 seconds delay following a power-off command before
issuing the power status query to give the server some time to
power-off before retrying the power-off command.
Test Plan: Both IPMI and Redfish
PASS: Verify power on/off and reset handling support up to 10 retries
PASS: Verify graceful command is used for the first power on/off
or reset try and the first 5 retries
PASS: Verify immediate command is used for the final 5 retries
PASS: Verify reset handling with/without retries (none/mid/max)
PASS: Verify power-on handling with/without retries (none/mid/max)
PASS: Verify power-off handling with/without retries (none/mid/max)
PASS: Verify power status command failure handling for power on/off
NOTE: FIT (fault insertion testing) was used to create retry scenarios
PASS: Verify power-off inter retry delay feature
PASS: Verify 30 second power-off to power query delay
PASS: Verify redfish power/reset commands used are logged by default
PASS: Verify power-off/on and reset logging
Regression:
PASS: verify power-on/off and reset handling without retries
PASS: Verify power-off handling when power is already off
PASS: Verify power-on handling when power is already on
Closes-Bug: 2031945
Signed-off-by: Eric Macdonald <eric.macdonald@windriver.com>
Change-Id: Ie39326bcb205702df48ff9dd090f461c7110dd36
luks-fs-mgr service creates and unseals the LUKS volume used to store
keys/secrets. This change handles the failure case if this essential
service is inactive. It introduces an alarm LUKS_ALARM_ID which is
raised if service is inactive which implies that there is an issue in
creating or unsealing the LUKS volume.
Test Plan:
PASS" build-pkgs -c -p mtce-common
PASS: build-pkgs -c -p mtce
PASS: build-image
PASS: AIO-SX bootstrap with luks volume status active
PASS: AIO-DX bootstrap with volume status active
PASS: Standard setup with 2 controllers and 1 compute node with luks
volume status active. There should not be any alarm and node
status should be unlocked/enabled/available.
PASS: AIO-DX node enable failure on the controller where luks volume
is inactive. Node availability should be failed. A critical
alarm with id 200.016 should be displayed with 'fm alarm-list'
PASS: AIO-SX node enable failure on the controller-0. Node availability
should be failed. A critical alarm with id 200.016 should be
displayed with 'fm alarm-list'
PASS: Standard- node enable failure on the node (controller-0,
controller-1, storage-0, compute-1). Node availability
should be failed. A critical alarm with id 200.016 should be
displayed with 'fm alarm-list' for the failed host.
PASS: AIO-DX In service volume inactive should be detected and a
critical alarm should be raised with ID 200.016. Node
availability should be changed to degraded.
PASS: AIO-SX In service volume inactive status should be detected
and a critical alarm should be raised with ID 200.016. Node
availability should be changed to degraded.
PASS: Standard ( 2 controller, 1 storage, 1 compute) In service
volume inactive status should be detected and a
critical alarm should be raised with ID 200.016. Node
availability should be changed to degraded.
PASS: AIO-DX In service: If volume becomes active and a LUKS alarm
is active, alarm should be cleared. Node availability should
be changed to available.
PASS: AIO-SX In service: If volume becomes active and a LUKS alarm is
active, alarm should be cleared. Node availability should be
changed to available.
PASS: Standard ( 2 controller, 1 storage, 1 compute) In service:
If volume becomes active and a LUKS alarm is active, alarm
should be cleared. Node availability should be changed to
available.
PASS: AIO-SX, AIO-DX, Standard- If intest fails and node availability
is 'failed'. After fixing the volume issue, a lock/unlock should
make the node available.
Story: 2010872
Task: 49108
Change-Id: I4621e7c546078c3cc22fe47079ba7725fbea5c8f
Signed-off-by: Jagatguru Prasad Mishra <jagatguruprasad.mishra@windriver.com>
This update solves two issues involving bmc reset.
Issue #1: A race condition can occur if the mtcAgent finds an
unlocked-disabled or heartbeat failing node early in
its startup sequence, say over a swact or an SM service
restart and needs to issue a one-time-reset. If at that
point it has not yet established access to the BMC then
the one-time-reset request is skipped.
Issue #2: When issue #1 race conbdition does not occur before BMC
access is established the mtcAgent will issue its one-time
reset to a node. If this occurs as a result of a crashdump
then this one-time reset can interrupt the collection of
the vmcore crashdump file.
This update solves both of these issues by introducing a bmc reset
delay following the detection and in the handling of a failed node
that 'may' need to be reset to recover from being network isolated.
The delay prevents the crashdump from being interrupted and removes
the race condition by giving maintenance more time to establish bmc
access required to send the reset command.
To handle significantly long bmc reset delay values this update
cancels the posted 'in waiting' reset if the target recovers online
before the delay expires.
It is recommended to use a bmc reset delay that is longer than a
typical node reboot time. This is so that in the typical case, where
there is no crashdump happening, we don't reset the node late in its
almost done recovery. The number of seconds till the pending reset
countdown is logged periodically.
It can take upwards of 2-3 minutes for a crashdump to complete.
To avoid the double reboot, in the typical case, the bmc reset delay
is set to 5 minutes which is longer than a typical boot time.
This means that if the node recovers online before the delay expires
then great, the reset wasn't needed and is cancelled.
However, if the node is truely isolated or the shutdown sequence
hangs then although the recovery is delayed a bit to accomodate for
the crashdump case, the node is still recovered after the bmc reset
delay period. This could lead to a double reboot if the node
recovery-to-online time is longer than the bmc reset delay.
This update implements this change by adding a new 'reset send wait'
phase to the exhisting reset progression command handler.
Some consistency driven logging improvements were also implemented.
Test Plan:
PASS: Verify failed node crashdump is not interrupted by bmc reset.
PASS: Verify bmc is accessible after the bmc reset delay.
PASS: Verify handling of a node recovery case where the node does not
come back before bmc_reset_delay timeout.
PASS: Verify posted reset is cancelled if the node goes online before
the bmc reset delay and uptime shows less than 5 mins.
PASS: Verify reset is not cancelled if node comes back online without
reboot before bmc reset delay and still seeing mtcAlive on one
or more links.Handles the cluster-host only heartbeat loss case.
The node is still rebooted with the bmc reset delay as backup.
PASS: Verify reset progression command handling, with and
without reboot ACKs, with and without bmc
PASS: Verify reset delay defaults to 5 minutes
PASS: Verify reset delay change over a manual change and sighup
PASS: Verify bmc reset delay of 0, 10, 60, 120, 300 (default), 500
PASS: Verify host-reset when host is already rebooting
PASS: Verify host-reboot when host is already rebooting
PASS: Verify timing of retries and bmc reset timeout
PASS: Verify posted reset throttled log countdown
Failure Mode Cases:
PASS: Verify recovery handling of failed powered off node
PASS: Verify recovery handling of failed node that never comes online
PASS: Verify recovery handling when bmc is never accessible
PASS: Verify recovery handling cluster-host network heartbeat loss
PASS: Verify recovery handling management network heartbeat loss
PASS: Verify recovery handling both heartbeat loss
PASS: Verify mtcAgent restart handling finding unlocked disabled host
Regression:
PASS: Verify build and DX system install
PASS: Verify lock/unlock (soak 10 loops)
PASS: Verify host-reboot
PASS: Verify host-reset
PASS: Verify host-reinstall
PASS: Verify reboot graceful recovery (force and no force)
PASS: Verify transient heartbeat failure handling
PASS: Verify persistent heartbeat loss handling of mgmt and/or cluster networks
PASS: Verify SM peer reset handling when standby controller is rebooted
PASS: Verify logging and issue debug ability
Closes-Bug: 2042567
Closes-Bug: 2042571
Change-Id: I195661702b0d843d0bac19f3d1ae70195fdec308
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
- reduced log level in http util to warning
- use inservice test handler to ensure state change notification
is sent to vim
- reduce retry count from 3 to 1 for add_handler state_change
vim notification
Test plan:
PASS - AIO-SX: ansible controller startup (race condition)
PASS - AIO-DX: ansible controller startup
PASS - AIO-DX: SWACT
PASS - AIO-DX: power off restart
PASS - AIO-DX: full ISO install
PASS - AIO-DX: Lock Host
PASS - AIO-DX: Unlock Host
PASS - AIO-DX: Fail Host ( by rebooting unlocked-enabled standby controller)
Story: 2010533
Task: 47338
Signed-off-by: Kyale, Eliud <Eliud.Kyale@windriver.com>
Change-Id: I7576e2642d33c69a4b355be863bd7183fbb81f45
StarlingX Maintenance supports host power and reset control through
both IPMI and Redfish Platform Management protocols when the host's
BMC (Board Management Controller) is provisioned.
The power and reset action commands for Redfish are learned through
HTTP payload annotations at the Systems level; "/redfish/v1/Systems.
The existing maintenance implementation only supports the
"ResetType@Redfish.AllowableValues" payload property annotation at
the #ComputerSystem.Reset Actions property level.
However, the Redfish schema also supports an 'ActionInfo' extension
at /redfish/v1/Systems/1/ResetActionInfo.
This update adds support for the 'ActionInfo' extension for Reset
and power control command learning.
For more information refer to the section 6.3 ActionInfo 1.3.0 of
the Redfish Data Model Specification link in the launchpad report.
Test Plan:
PASS: Verify CentOS build and patch install.
PASS: Verify Debian build and ISO install.
PASS: Verify with Debian redfishtool 1.1.0 and 1.5.0
PASS: Verify reset/power control cmd load from newly added second
level query from ActionInfo service.
Failure Handling: Significant failure path testing with this update
PASS: Verify Redfish protocol is periodically retried from start
when bm_type=redfish fails to connect.
PASS: Verify BMC access protocol defaults to IPMI when
bm_type=dynamic but failed connect using redfish.
Connection failures in the above cases include
- redfish bmc root query fails
- redfish bmc info query fails
- redfish bmc load power/reset control actions fails
- missing second level Parameters label list
- missing second level AllowableValues label list
PASS: Verify sensor monitoring is relearned to ipmi from failed and
retried with bm_type=redfish after switch to bm_type=dynamic
or bm_type=ipmi by sysinv update command.
Regression:
PASS: Verify with CentOS redfishtool 1.1.0
PASS: Verify switch back and forth between ipmi and redfish using
update bm_type=ipmi and bm_type=redfish commands
PASS: Verify switch from ipmi to redfish usinf bm_type=dynamic for
hosts that support redfish
PASS: Verify redfish protocol is preferred in bm_type=dynamic mode
PASS: Verify IPMI sensor monitoring when bm_type=ipmi
PASS: Verify IPMI sensor monitoring when bm_type=dynamic
and redfish connect fails.
PASS: Verify redfish sensor event assert/clear handling with
alarm and degrade condition for both IPMI and redfish.
PASS: Verify reset/power command learn by single level query.
PASS: Verify mtcAgent.log logging
Closes-Bug: 1992286
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Change-Id: Ie8cdbd18104008ca46fc6edf6f215e73adc3bb35
Performing a forced reboot of the active controller sometimes
results in a second reboot of that controller. The cause of the
second reboot was due to its reported uptime in the first mtcAlive
message, following the reboot, as greater than 10 minutes.
Maintenance has a long standing graceful recovery threshold of
10 minutes. Meaning that if a host looses heartbeat and enters
Graceful Recovery, if the uptime value extracted from the first
mtcAlive message following the recovery of that host exceeds 10
minutes, then maintenance interprets that the host did not reboot.
If a host goes absent for longer than this threshold then for
reasons not limited to security, maintenance declares the host
as 'failed' and force re-enables it through a reboot.
With the introduction of containers and addition of new features
over the last few releases, boot times on some servers are
approaching the 10 minute threshold and in this case exceeded
the threshold.
The primary fix in this update is to increase this long standing
threshold to 15 minutes to account for evolution of the product.
During the debug of this issue a few other related undesirable
behaviors related to Graceful Recovery were observed with the
following additional changes implemented.
- Remove hbsAgent process restart in ha service management
failover failure recovery handling. This change is in the
ha git with a loose dependency placed on this update.
Reason: https://review.opendev.org/c/starlingx/ha/+/788299
- Prevent the hbsAgent from sending heartbeat clear events
to maintenance in response to a heartbeat stop command.
Reason: Maintenance receiving these clear events while in
Graceful Recovery causes it to pop out of graceful
recovery only to re-enter as a retry and therefore
needlessly consumes one (of a max of 5) retry count.
- Prevent successful Graceful Recovery until all heartbeat
monitored networks recover.
Reason: If heartbeat of one network, say cluster recovers but
another (management) does not then its possible the
max Graceful Recovery Retries could be reached quite
quickly, while one network recovered but the other
may not have, causing maintenance to fail the host and
force a full enable with reboot.
- Extend the wait for the hbsClient ready event in the graceful
recovery handler timout from 1 minute to worker config timeout.
Reason: To give the worker config time to complete before force
starting the recovery handler's heartbeat soak.
- Add Graceful Recovery Wait state recovery over process restart.
Reason: Avoid double reboot of Gracefully Recovering host over
SM service bounce.
- Add requirement for a valid out-of-band mtce flags value before
declaring configuration error in the subfunction enable handler.
Reason: rebooting the active controller can sometimes result in
a falsely reported configation error due to the
subfunction enable handler interpreting a zero value as
a configuration error.
- Add uptime to all Graceful Recovery 'Connectivity Recovered' logs.
Reason: To assist log analysis and issue debug
Test Plan:
PASS: Verify handling active controller reboot
cases: AIO DC, AIO DX, Standard, and Storage
PASS: Verify Graceful Recovery Wait behavior
cases: with and without timeout, with and without bmc
cases: uptime > 15 mins and 10 < uptime < 15 mins
PASS: Verify Graceful Recovery continuation over mtcAgent restart
cases: peer controller, compute, MNFA 4 computes
PASS: Verify AIO DX and DC active controller reboot to standby
takeover that up for less than 15 minutes.
Regression:
PASS: Verify MNFA feature ; 4 computes in 8 node Storage system
PASS: Verify cluster network only heartbeat loss handling
cases: worker and standby controller in all systems.
PASS: Verify Dead Office Recovery (DOR)
cases: AIO DC, AIO DX, Standard, Storage
PASS: Verify system installations
cases: AIO SX/DC/DX and 8 node Storage system
PASS: Verify heartbeat and graceful recovery of both 'standby
controller' and worker nodes in AIO Plus.
PASS: Verify logging and no coredumps over all of testing
PASS: Verify no missing or stuck alarms over all of testing
Change-Id: I3d16d8627b7e838faf931a3c2039a6babf2a79ef
Closes-Bug: 1922584
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
The mtcClient will perform a socket reinit if it detects a socket
failure. The mtcClient also avoids setting up its controller-1
cluster network socket for the AIO SX system type ; because there
is no controller-1 provisioned.
Most AIO SX systems have the management/cluster networks set to
the 'loopback' interface. However, when an AIO SX system is setup
with its management and cluster networks on physical interfaces,
with or without vlan, the mtcAlive send message utility will try
to send to the uninitialized controller-1 cluster socket. This
leads to a socket error that triggers a socket reinitialization
loop which causes log flooding.
This update adds a check to the mtcAlive send utility to avoid
sending mtcAlive to controller-1 for AIO SX system type where
there is no controller-1 provisioned; no send,no error,no flood.
Since this update needed to add a system type check, this update
also implemented a system type definition rename from CPE to AIO.
Other related definitions and comments were also changed to make
the code base more understandable and maintainable
Test Plan:
PASS: Verify AIO SX with mgmnt/clstr on physical (failure mode)
PASS: Verify AIO SX Install with mgmnt/clstr on 'lo'
PASS: Verify AIO SX Lock msg and ack over mgmnt and clstr
PASS: Verify AIO SX locked-disabled-online state
PASS: Verify mtcClient clstr socket error detect/auto-recovery (fit)
PASS: Verify mtcClient mgmnt socket error detect/auto-recovery (fit)
Regression:
PASS: Verify AIO SX Lock and Unlock (lazy reboot)
PASS: Verify AIO DX and DC install with pv regression and sanity
PASS: Verify Standard system install with pv regression and sanity
Change-Id: I658d33a677febda6c0e3fcb1d7c18e5b76cb3762
Closes-Bug: 1897334
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
A configuration failure alarm can get stuck asserted if
that node experiences an uncontrolled reboot that recovers
without a configuration failure.
This update adds an in-service test that audits host health
while there is a configuration failure alarm raised and
clear that alarm if the failure condition goes away. This
could be a result of an in-service manifest that runs and
corrects the configuration or if the node reboots and comes
back up in a healthy (properly configured) state.
Fixed bug that was clearing config alarm severity state
when a heartbeat clear event is received.
This update also goes a step further and introduces an
alarms state audit that detects and corrects maintenance
alarm state mismatches.
Test Plan:
PASS: Verify the add handler loads config alarm state
PASS: Verify in-service test clears stale config alarm
PASS: Verify in-service test acts on new config failure
... degrade - active controller
... fail - other hosts
PASS: Verify audit fixes mtce alarm state mismatches
PASS: Verify audit handles fm not running case
PASS: Verify audit handling behavior with valid alarm cases
PASS: Verify locked alarm management over process restart
PASS: Verify audit only logs active alarms list changes
PASS: Verify audit runs for both locked/unlocked nodes
PASS: Verify update as a patch
Regression:
PASS: Verify enable sequence config failure handling
PASS: ... active controller - recoverable degrade
PASS: ... other nodes - threshold fail
PASS: ... auto recovery disable - config failure
PASS: Verify mtcAgent process logging
PASS: Verify heartbeat handling and alarming
PASS: Verify Standard system install
PASS: Verify AIO system install
Change-Id: If9957229810435e9faeb08374f2b5fbcb5b0f826
Closes-Bug: 1918195
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
This update adds the ability for SM to passively
request the mtcClient to BMC reset its peer controller
as a means to recover a severely loaded active controller.
To do this the mtcAgent is modified keep the controllers'
mtcClients updated with the BMC info of its peer.
The mtcClient is modified to audit for the SM signal
and then when asserted issue a BMC reset of its peer
controller using ipmitool system call.
The ability to command the peer mtcCient to 'sync'
prior to the BMC reset is implemented but configured
disabled for now.
Change-Id: Ibe4c8aaa3a980cbe5f34c3e22f015698a6453c1a
Partial-Bug: #1895350
Co-Authored-By: Bin.Qian@windriver.com
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
When Multi-Node Failure Avoidance (MNFA) occurs,
maintenance commands the Heartbeat Agent to slow
down by a factor of 4.
The rate recovery following a MNFA is not occurring.
Update https://review.opendev.org/#/c/701057 made
a condition check change that introduced this issue
by requiring mnfa_timeout to be non-zero before an
attempt is made to recover heartbeat period following
MNFA recovery.
This update switches that condition check to use more
specific mnfa_backoff state tracker and because MNFA
is a global maintenance mode feature rather than a
node specific feature, moves the recovery check code
from the node level fsm into a mnfa_recovery_handler
called in the main select loop.
Test Plan:
PASS: Verify MNFA handling/recovery with mnfa_timeout!=0
that expires.
PASS: Verify MNFA handling/recovery when mnfa_timeout!=0
but before the timeout expires.
PASS: Verify MNFA handling/recovery when mnfa_timeout=0
PASS: Verify MNFA backoff rate recovery over mtcAgent
process restart.
PASS: Verify MNFA backoff rate is sent to hbsAgent if
hbsAgent restarts while MNFA his active.
Change-Id: I8da5a000ab503692c7cfa620233ed8aa772c50f8
Closes-Bug: #1893212
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
The current mechanism used to preserve the learned bmc protocol in
the filesystem on the active controller is problematic over swact.
This update removes the file storage method in favor of preserving
the learned protocol in the system inventory database as a key/value
pair at the host level in already existing mtce_info database field.
The specified or learned bmc access protocol is then shared with the
hardware monitor through inter-daemon maintenance messaging.
This update refactors bmc provisioning to accommodate bmc protocol
selection at the host rather than system level. Towards that this
update removes system level bmc_access_method selection in favor of
host level selection through bm_type. A bm_type of 'bmc' specifies
that the bmc access protocol for that host be learned. This has the
effect of making it the same as what is delivered today but without
support for changing it as the system level.
A system inventory update will be delivered shortly that enables bmc
access protocol selection at the host level. That update allows the
customer to specify the bmc access protocol at the host level to be
either dynamic (aka learned) or to only use 'redfish' or 'ipmi'.
That system inventory update delivers that information to maintenance
through bm_type via bmc provisioning. Until that update is delivered
bm_type always comes in as 'bmc' which get interpreted as 'dynamic'
to maintain existing configuration.
The following additional issues were also fixed in this update.
1. The nodeTimers module defaults the 'ring' member of timers that are
not running to false but should be true.
2. Added a pingUtil_restart function to facilitate quicker sensor
monitoring following provisioning changes and bmc access failures.
3. Enhanced the hardware monitor sensor grouping filter to accommodate
non-standard Redfish readout labelling so that more sensors fall
into the existing canned groups ; leads to more monitored sensors.
4. Added a 'http security mode' to hardware monitor messaging. This
defaults to https as that is all that is supported by the Redfish
implementation today. This field can be used to specify non-secure
'http' mode in the future when that gets implemented.
5. Ensure the hardware monitor performs a bmc password re-fetch on every
provisioning change.
Test Plan:
PASS: Verify bmc access protocol store/fetched from the database (mtce_info)
PASS: Verify inventory push from mtcAgent to hwmond over mtcAgent restart
PASS: Verify inventory push from mtcAgent to hwmond over hwmon restart
PASS: Verify bmc provisioning of ipmi and redfish servers
PASS: Verify learned bmc protocol persists over process restart and swact
PASS: Verify process startup with protocol already learned
Hardware Monitor:
PASS: Verify bmc_type=ipmi handling ; protocol forced to ipmi ; (re)prov
PASS: Verify bmc_type=redfish handling ; protocol forced to redfish ; (re)prov
PASS: Verify bmc_type=dynamic handling ; protocol is learned then persisted
PASS: Verify sensor model delete and relearn over ip address change
PASS: Verify sensor model delete and relearn over bm_type change change
PASS: Verify sensor model not relearned username change
PASS: Verify bm pw is re-fetched over any (re)provisioning change
PASS: Verify bmc re-provisioning soak (test-bmc-reprovisioning.sh 50 loops)
PASS: Verify protocol change handling, file cleanup, model recreation
PASS: Verify End-2-End behavior for bm_type change from redfish to ipmi
PASS: Verify End-2-End behavior for bm_type change from ipmi to redfish
PASS: Verify End-2-End behavior for bm_type change from redfish to dynamic
PASS: Verify End-2-End behavior for bm_type change from ipmi to dynamic
PASS: Verify End-2-End behavior for bm_type change from dynamic to ipmi
PASS: Verify End-2-End behavior for bm_type change from dynamic to redfish
PASS: Verify sensor model creation waits for server power to be on
PASS: Verify sensor relearn by provisioning change during model creation. (soak)
Regression:
PASS: Verify host power off and on.
PASS: Verify BMC access alarm handling (assert and clear)
PASS: Verify mtcAgent and hwmond logs add value
PASS: Verify no core dumps / seg faults.
PASS: Verify no mtcAgent and hwmond memory leak.
PASS: Verify delete of BMC provisioned host
PASS: Verify sensor monitoring, alarming, degrade and then clear cycle
PASS: Verify static analysis report of changed modules.
PASS: Verify host level bm_type=bmc functions as would dynamic selection
PASS: Verify batch provisioning and deprovisioning (7 nodes)
PASS: Verify batch provisioning to different protocol (5 nodes)
PASS: Verify handling of flaky Redfish responses
PEND: Verify System Install
Change-Id: Ic224a9c33e0283a611725b33c90009132cab3382
Closes-Bug: #1853471
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
The maintenance alarm handling daemon (mtcalarmd) should not
drop alarm requests simply because FM process is not running.
Insteads it should retry for it and other FM error cases that
will likely succeed in time if they are retried.
Some error cases however do need to be dropped such as those
that are unlikely to succeed with retries.
Reviewed FM return codes with FM designer which lead to a list
of errors that should drop and others that should retry.
This update implements that handling with a posting and
servicing of a first-in / first-out alarm queue.
Typical retry case is the NOCONNECT error code which occurs
when FM is not running.
Alarm ordering and first try timestamp is maintained.
Retries and logs are throttled to avoid flooding.
Test Plan:
PASS: Verify success path alarm handling End-to-End.
PASS: Verify retry handling while FM is not running.
PASS: Verify handling of all FM error codes (fit tool).
PASS: Verify alarm handling under stress (inject-alarm script) soak.
PASS: verify no memory leak over stress soak.
PASS: Verify logging (success, retry, failure)
PASS: Verify alarm posted date is maintained over retry success.
Change-Id: Icd1e75583ef660b767e0788dd4af7f184bdb9e86
Closes-Bug: 1841653
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
This update delivers redfish support for Power-On/Off, Reset
and Netboot Reinstall handling to maintenance.
Test Plan: (Testing Continues)
PASS: Verify Redfish Power-Off action handling
PASS: Verify Redfish Power-On action handling
PASS: Verify Redfish Reset action handling
PASS: Verify compute Redfish Reinstall action handling from controller-0
PASS: Verify compute Redfish Reinstall action handling from controller-1
PASS: Verify Redfish Power-Off Action failure handling
PASS: Verify Redfish Power-On action failure handling
PASS: Verify Redfish Reset action failure handling
PASS: Verify Redfish Re-Install action failure handling
PASS: verify Reset progression cycle does not leak memory.
PASS: Verify bmc_handler failure handling does not leak memory.
PASS: Verify Inservice BMC access (ping) failure and recovery handling.
PASS: Verify BMC access failure alarm handling
PASS: Verify BMC provisioning and deprovisioning soak (redfish - wolfpass)
PASS: Verify BMC provisioning and deprovisioning does not leak memory.
PASS: Verify BMC provisioning handling with bad ip and/or bad username
PASS: Verify BMC reprovisioning to same protocol
PASS: Verify BMC reprovisioning from ipmi host to redfish host
PASS: Verify BMC reprovisioning from redfish host to ipmi host
PASS: Verify mixed protocol support in same lab
PASS: Verify mixed server support in same lab
PASS: Verify Large System Install with BMCs provisioned (wp8-12)
PASS: Verify bmc access method (learn,ipmi,redfish) learned from mtc.init
PASS: Verify Swact with BMCs provisioned.
PASS: Verify no segfaults.
PASS: Verify AIO System Install in lab that supports redfish (WC3-6, WP8-12, Dell 720 3-7)
PASS: Verify AIO Simplex Install with Redfish Support (SM1, SM3)
PASS: Verify AIO Duplex Install with Redfish Support (SM 5-6, Dell 720 1-2
Useability:
PASS: Verify handling of reprovisioning BMC between hosts that support
different protocols.
PASS: Verify handling of reprovisioning ip address to host that leads to a
different protocol select.
PASS: Verify manual relearn handling to recover from errors that result from
the above case.
PASS: Verify host BMC deprovisioning handling and cleanup.
PASS: Verify sensor monitoring.
PASS: Verify fault insertion for both protocols and action handling.
PASS: Verify protocol select handover.
PASS: Verify hwmond sticks with a selected protocol once a sensor model
has been created using that protocol.
PASS: Verify handling of missing bmc_access_method configuration select.
PASS: Verify inservice bmc_access_method service parameter modification handling.
Regression:
PASS: Verify redfish BMC info query logging.
PASS: Verify sensor monitoring and alarming still works.
PASS: Verify all power/reset/netboot commands for IPMI
PASS: Verify reprovisioning soak of Wolfpass servers
PASS: Verify reprovisioning soak of SM servers
Depends-on: https://review.opendev.org/#/c/679178/
Change-Id: I984057e04d7426e37d675cf4d334a4e35419f2e8
Story: 2005861
Task: 35826
Task: 36606
Task: 36467
Task: 36456
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Add redfish hwmon thread function and related parse function
for Power and Thermal sensor data.
Removed some unused old functions.
Rename common function or variable with bmc prefix
Test done for this patch on simplex bare metal setup.
system host-sensor-list
system host-sensor-show
system host-sensorgroup-list
system host-sensorgroup-show
system host-sensorgroup-relearn
Story: 2005861
Task: 35815
Depends-on: https://review.opendev.org/#/c/671340
Change-Id: If8a35581d44df15749a049eda945f23d2323fd35
Signed-off-by: zhipengl <zhipengs.liu@intel.com>
In some parts of the mtce code some implicit fallthrough are used.
This causes a warning in the compiler and in OSes like openSUSE the
-Werror flag is enforced leading to a build error.
In this commit the MTCE_FALLTHROUGH macro is used to tell the
compiler to not worry about this implicit fallthroughs as the it works
as intended.
Change-Id: I608d80eaa7298d0613ffa62ee82e03463d193d87
Signed-off-by: Erich Cordoba <erich.cordoba.malibran@intel.com>
This update
1. Refactors some of the common maintenance ipmi
definitions and utilities into a more generic
'bmcUtil' module to reduce code duplication and improve
improve code reuse with the introduction of a second
bmc communication protocol ; redfish.
2. Creates a new 'redFishUtil' module similar to the existing
'ipmiUtil' module but in support of common redfish
utilities and definitions that can be used by both
maintenance and the hardware monitor.
3. Moves the existing 'mtcIpmiUtil' module to a more common
'mtcBmcUtil' and renames the 'ipmi_command_send/recv' to
the more generic 'bmc_command_send/recv' which are enhanced
to support both ipmi and redfish bmc communication methods.
4. Renames the bmc info collection and connection monitor ;
'bm_handler' to 'bmc_handler' and adds support necessary
to learn if a host's bmc supports redfish.
5. Renames the existing 'mtcThread_ipmitool' to a more common
'mtcThread_bmc' and redfishtool support for the now common
set of bmc thread commands and the addition of the new
redfishtool bmc query, aka 'redfish root query', used to
detect if a host's bmc supports redfish.
Note: This aspect is the primary feature of this update.
Namely the ability to detect and print a log indicating
if a host's bmc supports redfish.
Test Plan:
PASS: Verify sensor monitoring and alarming still works.
PASS: Verify power-off command handling.
PASS: Verify power-on command handling.
PASS: Verify reset command handling.
PASS: Verify reinstall (netboot) command handling.
PASS: Verify logging when redfish is not supported.
PASS: Verify logging when redfish is supported.
PASS: Verify ipmitool is used regardless of redfish support.
PASS: Verify mtce thread error handling for both protocols.
Change-Id: I72e63958f61d10f5c0d4a93a49a7f39bdd53a76f
Story: 2005861
Task: 35825
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Long hostname support introduced a bug that causes the mtcAgent
to reject hardware monitor degrade requests due to the originating
service (daemon) not recognized.
This update fixes the parsed parameters in mtcAgent and adds
a sensor parm to the degrade API so that the sensor name
accompanying the degrade event can be logged in mtcAgent.
Test Plan: for hwmond degrade handling
PASS: verify degrade assert and sensor name in mtcAgent degrade assert log
PASS: Verify degrade clear handling and log
Change-Id: I5c11cc5f679f21e6aadd4d5be25e6c08a241e80b
Closes-Bug: 1838020
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Maintenance's success path messaging does not depend on cluster
network messaging. However, there are a number of failure mode
cases that do depend on cluster network messaging to properly
diagnose and offer a higher availability handling for some
failure cases.
For instance, when the management interface goes down, without cluster
network messaging remote hosts can be isolated. Being able to command-
reboot a host over cluster-host network offers higher availability.
Maintenance is designed to use the cluster network, if provisioned, as a
backup path for mtcAlive, node locked, reboot and several other commands
and acknowledgements.
Unfortunately, it was recently observed that maintenance is using
the 'nfs-controller' label to resolve cluster network addressing
which resolves to management network IPs. As a result all messages
intended to be going over the cluster-host network are instead just
redundant management network messages.
During debug of this issue several additional cluster network
messaging related issues were observed and fixed.
This update implements the following fixes
1. since there is no floating address for the cluster network the
mtcClient was modified to send messages to both controllers where
only the active controller will be listening and acting.
2. fixes port number mtce listens for cluster-host network messages
3. fixes port number mtce sends cluster-host network messages to.
4. mtcAlive messages are also sent on provisioned cluster network.
5. locked state notifications and acks sent on provisioned cluster network.
6. reboot request and acks sent on provisioned cluster network.
7. fixed command acknowledgement messaging.
This update also
1. envelopes the mtcAlive gate control to allow debug tracing of all gate
state changes.
2. moves graceful recovery handling heartbeat failure state clear to the
end of the recovery handler, just before heartbeat start.
3. adds sm unhealthy support to fail and automatically recover the
inactive controller from an SM UNHEALTHY state.
----------
Test Plan:
----------
Functional:
PASS: Verify management network messaging
PASS: Verify cluster-host network messaging
PASS: Verify cluster-host messages with tcpdump
PASS: Verify cluster-host network mtcAlive messaging
PASS: Verify reboot request and ack reply over management network
PASS: Verify reboot request and ack reply over cluster-host network
PASS: Verify lock state notification and ack reply over management network
PASS: Verify lock state notification and ack reply over cluster-host network
PASS: Verify acknowledgement messaging
PASS: Verify maintenance daemon logging
PASS: Verify maintenance socket initialization
System:
PASS: Verify compute system install
PASS: Verify AIO system install
Feature:
PASS: Verify sm node unhealth handling (active:ignore, inactive:recover)
Change-Id: I092596d3e22438dd8a613a073614c188f6f5721d
Closes-Bug: #835268
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
The enabled_count member inside nodeLinkClass has a declared bool
datatype, but this variable is used as a counter, where an int data
type is a better choice. This change updates the datatype to int.
Also, there was a misleading-indentation warning in mtcNodeHdlrs.cpp,
which is also fixed in this change.
Change-Id: Ib154c5b6ae2e7068870733b5ee8971e20cedb43f
Story: 2005862
Task: 34163
Signed-off-by: Erich Cordoba <erich.cordoba.malibran@intel.com>
Updated to read the host cluster-host parameter in /etc/hosts
file.
Replaced references of infra network with cluster-host network
Story: 2004273
Task: 29473
Change-Id: I199fb82e5f6b459b181196d0802f1a74220b796e
Signed-off-by: Teresa Ho <teresa.ho@windriver.com>
The introduction of Ansible requires the execution of a manifest
as part of the first controller's initial unlock.
Unfortunately maintenance issues the lazy self reboot immediately
upon receiving the unlock command, interrupting the in-progress
manifest apply.
This update identifies the initial self reboot of the only
provisioned host condition and waits for up to a timeout
period for an unlock ready signal that is provided by
successful completion of the 'initial-unlock-manfest'.
Seeing the unlock ready signal prior to the timeout allows
the unlock self reboot to proceed normally.
Depends-On:https://review.openstack.org/#/c/643914
Story:2004695
Task:30243
Test Plan:
PASS: Verify timeout handling - allowing retry
PASS: Verify with signal - immediate
PASS: Verify with signal - before timeout
Change-Id: I3633e772310c36af5df57364f66c14f037b2ea8f
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
All rmon resource monitoring has been moved to collectd.
This update removes rmon from mtce and the load.
Story: 2002823
Task: 30045
Test Plan:
PASS: Build and install a standard system.
PASS: Inspect mtce rpm list
PASS: Inspect logs
PASS: Check pmon.d
Change-Id: I7cf1fa071eac89274e7fae1f307e14d548cc945b
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Use Openstack Barbican API to retrieve BMC passwords stored by SysInv.
See SysInv commit for details on how to write password to Barbican.
MTCE is going to find corresponding secret by host uuid and retrieve
secret payload associated with it. mtcSecretApi_get is used to find
secret reference, based on a hostname. mtcSecretApi_read is used to
read a password using the reference found on a prevoius step.
Also, did a little cleanup and removed old unused token handling code.
Depends-On: I7102a9662f3757c062ab310737f4ba08379d0100
Change-Id: I66011dc95bb69ff536bd5888c08e3987bd666082
Story: 2003108
Task: 27700
Signed-off-by: Alex Kozyrev <alex.kozyrev@windriver.com>
This update replaces compute references to worker in mtce,
kickstarts, installer and bsp files.
Tests Performed:
Non-containerized deployment
AIO-SX: Sanity and Nightly automated test suite
AIO-DX: Sanity and Nightly automated test suite
2+2 System: Sanity and Nightly automated test suite
2+2 System: Horizon Patch Orchestration
Kubernetes deployment:
AIO-SX: Create, delete, reboot and rebuild instances
2+2+2 System: worker nodes are unlock enable and no alarms
Story: 2004022
Task: 27013
Depends-On: https://review.openstack.org/#/c/624452/
Change-Id: I225f7d7143d841f80459603b27b95ac3f846c46f
Signed-off-by: Tao Liu <tao.liu@windriver.com>
This update stops trying to recover hosts that have failed the
Enable sequence after a thresholded number of back-to-back tries.
A host that has reached a particular failure modes' max failure
threshold then maintenance puts it into a 'unlocked-disabled-failed'
state and left that way with no further recovery action until
it is manually locked and unlocked.
The thresholded Enable failure causes are
Configuration Failure ....... threshold:2 retry interval:30 secs
In-Test GoEnabled Failure ... threshold:2 retry interval:30 sec
Start Host Services Failure . threshold:2 retry interval:30 sec
Heartbeat Soak Failure ...... threshold:2 retry interval:10 minute
This update refactors the old auto recovery for AIO SX into this
more generic framework.
Story: 2003576
Task: 24905
Test Plan:
PASS: Verify AIO DX System Install
PASS: Verify AIO SX DOR
PASS: Verify Auto recovery disabled state is maintained over AIO SX DOR
PASS: Verify Lock/Unlock recovers host from Auto recovery disabled state
PASS: Verify AIO SX Main Config Failure handling
PASS: Verify AIO SX Main Config Timeout handling
PASS: Verify AIO SX Main GoEnabled Failure Handling
PASS; Verify AIO SX Main Host Services Failure handling
PASS; Verify AIO SX Main Host Services Timeout handling
PASS; Verify AIO SX Subf Config Failure handling
PASS: Verify AIO SX Subf Config Timeout handling
PASS: Verify AIO SX Subf GoEnabled Failure Handling
PASS: Verify AIO SX Subf Host Services Failure handling
PASS: Verify AIO DX System Install
PASS: Verify AIO DX DOR
PASS: Verify AIO DX DOR ; one time active controller GoEnabled failure ; swact requested
PASS: Verify AIO DX Main First Unlock Failure handling
PASS: Verify AIO DX Main Config Failure handling (inactive ctrl)
PASS: Verify AIO DX Main one time Config Failure handling
PASS: Verify AIO DX Main one time GoEnabled Failure handling.
PASS: Verify AIO DX SUBF Inactive Controller 1 GoEnable Failure handling.
PASS: Verify AIO DX Inactive Controller 1 GoEnable Failure with recovery on retry.
PASS: Verify AIO DX Active controller Enable failure with no or locked peer controller.
PASS: Verify AIO DX Reboot Active controller with peer in auto recovery disabled state.
PASS: Verify AIO DX Active controller failure with peer in auto recovery disabled state. (vswitch process)
PASS: Verify AIo DX Active controller failure then recovery after reboot with peer in auto recovery disabled state. (goenabled)
PASS: Verify AIO DX Inactive Controller Enable Heartbeat Soak Failure handling.
PASS: Verify AIO DX Active controller unhealthy detection and handling. (degrade)
PASS: Verify AIO DX Inactive controller unhealthy detection and handling. (fail)
PASS: Verify Normal System Install
PASS: Verify Compute Enable Configuration Failure handling (wc71-75)
PASS: Verify Compute Enable GoEnabled Failure handling (recover after 1)
PASS: Verify Compute Enable Start Host Services Failure handling
PASS: Verify Compute Enable Heartbeat Soak Failure handling
PASS: Verify Inactive Controller Enable Heartbeat Soak Failure handling
PASS: Verify Inactive Controller Configuration Failure handling
PASS; Verify Inactive Controller GoEnabled Failure handling
PASS; Verify Inactive Controller Host Services Failure handling
PASS; Verify goEnabled failure after active controller reboot with no peer controller (C0 rebooted with C1 locked) - no SM startup
PASS: Verify auto recovery threshold number is configurable
PASS: Verify auto recovery retry interval is configurable
PASS: Verify auto recovery host state and status message
Regression:
PASS: Verify Swact behavior, over and back
PASS: Verify 5 node DOR
PASS: Verify 3 host MNFA behavior
PASS: verify in-service heartbeat failure handling
PASS: verify no segfaults during UT
Corner Cases:
PASS: Verify mtcAlive boot failure behavior. reset progression. retry forever. - sleep in config script
PASS: Verify AIO SX mtcAgent process restart while in autorecovery disabled state
PASS: Verify autorecovery disabled state is preserved over mtcAgent process restart.
Change-Id: I7098f16243caef27c5295971ef3c9de5be975755
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
A few small issues were found during integration testing with SM.
This update delivers those integration tested fixes.
1. Send cluster event to SM only after the first 10 heartbeat
pulses are received.
2. Only send inventory to hbsAgent on provisioned controllers.
3. Add new OOB SM_UNHEALTHY flag to detect and act on an SM
declared unhealthy controller.
4. Network monitoring enable fix.
5. Fix oldest entry tracking when a network history is not full.
6. Prevent clearing local uptime for a host that is being enabled.
7. Refactor cluster state change notification logging and handling.
These fixes were both UT and IT tested in multiple labs
Change-Id: I28485f241ac47bb3ed3ec1e2a8f4c09a1ca2070a
Story: 2003576
Task: 24907
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
This update introduces mtce changes to support Active-Active Heartbeating.
The purpose of Active-Active Heartbeating is help avoid Split-Brain.
Active-Active heartbeating has each controller maintain a 5 second
heartbeat response history cache of each network for all monitored
hosts as well as the on-going health of storage-0 if provisioned and
enabled.
This is referred to as the 'heartbeat cluster history'
Each controller then includes its cluster history in each heartbeat
pulse request message.
The hbsClient, now modified to handle heartbeat from both controllers,
saves each controllers' heartbeat cluster history in a local cache and
criss-crosses the data in its pulse responses.
So when the hbsClient receives a pulse request from controller-0 it
saves its reported history and then replaces that history information
in its response to controller-0 with what it saved from controller-1's
last pulse request ; i.e. its view of the system.
Controller-0, receiving a host's pulse response, saves its peers
heartbeat cluster history so that it has summary of heartbeat
cluster history for the last 5 seconds for each monitored network
of every monitored host in the system from both controllers'
perspectives. Same for controller-1 with controller-0's history.
The hbsAgent is then further enhanced to support a query request
for this information.
So now SM, when it needs to make a decision to avoid Split-Brain
or otherwise, can query either controller for its heartbeat cluster
history and get the last 5 second summary view of heartbeat (network)
responsivness from both controllers perspectives to help decide which
controller to make active.
This involved removing the hbsAgent process from SM control and monitor
and adding a new hbsAgent LSB init script for process launch, service
file to run the init script and pmon config file for hbsAgent process
monitoring.
With hbsAgent now running on both controllers, changes to maintenance
were required to send inventory to hbsAgent on both controllers,
listen for hbsAgent event messages over the management interface
and inform both hbsAgents which controller is active.
The hbsAgent running on the inactive controller does not
- does not send heartbeat events to maintenance
- does not send raise or clear alarms or produce customer logs
Test Plan:
Feature:
PASS: Verify hbsAgent runs on both controllers
PASS: Verify hbsAgent as pmon monitored process (not SM)
PASS: Verify system install and cluster collection in all system types (10+)
PASS: Verify active controller hbsAgent detects and handles heartbeat loss
PASS: Verify inactive controller hbsAgent detects and logs heartbeat loss
PASS: Verify heartbeat cluster history collection functions properly.
PASS: Verify storage-0 state tracking in cluster into.
PASS: Verify storage-0 not responding handling
PASS: Verify heartbeat response is sent back to only the requesting controller.
PASS: Verify heartbeat history is correct from each controller
PASS: Verify MNFA from active controller after install to controller-0
PASS: Verify MNFA from active controller after swact to controller-1
PASS: Verify MNFA for 80%+ of the hosts in the storage system
PASS: Verify SM cluster query operation and content from both controllers
PASS: Verify restart of inactive hbsAgent doesn't clear existing heartbeat alarms
Logging:
PASS: Verify cluster info logs.
PASS: Verify feature design logging.
PASS: Verify hbsAgent and hbsClient design logs on all hosts add value
PASS: Verify design logging from both controllers in heartbeat loss case
PASS: Verify design logging from both controllers in MNFA case
PASS: Verify clog logs cluster info vault status and updates for controllers
PASS: Verify clog1 logs full cluster state change for all hosts
PASS: Verify clog2 logs cluster info save/append logs for controllers
PASS: Verify clog3 memory dumps a cluster history
PASS: Verify USR2 forces heartbeat and cluster info log dump
PASS: Verify hourly heartbeat and cluster info log dump
PASS: Verify loss events force heartbeat and cluster info log dump
Regression:
PASS: Verify Large System DOR
PASS: Verify pmond regression test that now includes hbsAgent
PASS: Verify Lock/Unlock of inactive controller (x3)
PASS: Verify Swact behavior (x10)
PASS: Verify compute Lock/Unlock
PASS: Verify storage-0 Lock/Unlock
PASS: Verify compute Host Failure and Graceful Recovery
PASS: Verify Graceful Recovery Retry to Max:3 then Full Enable
PASS: Verify Delete Host
PASS: Verify Patching hbsAgent and hbsClient
PASS: Verify event driven cluster push
Story: 2003576
Task: 24907
Change-Id: I5baf5bcca23601a99473d039356d58250ffb01b5
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
This part one of a two part HA Improvements feature that introduces
the collection of heartbeat health at the system level.
The full feature is intended to provide service management (SM)
with the last 2 seconds of maintenace's heartbeat health view that
is reflective of each controller's connectivity to each host
including its peer controller.
The heartbeat cluster summary information is additional information
for SM to draw on when needing to make a choice of which controller
is healthier, if/when to switch over and to ultimately avoid split
brain scenarios in a two controller system.
Feature Behavior: A common heartbeat cluster data structure is
introduced and published to the sysroot for SM. The heartbeat
service populates and maintains a local copy of this structure
with data that reflects the responsivness for each monitored
network of all the monitored hosts for the last 20 heartbeat
periods. Mtce sends the current cluster summary to SM upon request.
General flow of cluster feature wrt hbsAgent:
hbs_cluster_init: general data init
hbs_cluster_nums: set controller and network numbers
forever:
select:
hbs_cluster_add / hbs_cluster_del: - add/del hosts from mtcAgent
hbs_sm_handler -> hbs_cluster_send: - send cluster to SM
heartbeating:
hbs_cluster_append: add controller cluster to pulse request
hbs_cluster_update: get controller cluster data from pulse responses
hbs_cluster_save: save other controller cluster view in cluster vault
hbs_cluster_log: log cluster state changes (clog)
Test Plan:
PASS: Verify compute system install
PASS: Verify storage system install
PASS: Verify cluster data ; all members of structure
PASS: Verify storage-0 state management
PASS: Verify add of second controller
PASS: Verify add of storage-0 node
PASS: Verify behavior over Swact
PASS: Verify lock/unlock of second controller ; overall behavior
PASS: Verify lock/unlock of storage-0 ; overall behavior
PASS: Verify lock/unlock of storage-1 ; overall behavior
PASS: Verify lock/unlock of compute nodes ; overall behavior
PASS: Verify heartbeat failure and recovery of compute node
PASS: Verify heartbeat failure and recovery of storage-0
PASS: Verify heartbeat failure and recovery of controller
PASS: Verify delete of controller node
PASS: Verify delete of storage-0
PASS: Verify delete of compute node
PASS: Verify cluster when controller-1 active / controller-0 disabled
PASS: Verify MNFA and recovery handling
PASS: Verify handling in presence of multiple failure conditions
PASS: Verify hbsAgent memory leak soak test with continuous SM query.
PASS: Verify active controller-1 infra network failure behavior.
PASS: Verify inactive controller-1 infra network failure behavior.
Change-Id: I4154287f6dcf5249be5ab3180f2752ab47c5da3c
Story: 2003576
Task: 24907
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
This decouples the build and packaging of guest-server, guest-agent from
mtce, by splitting guest component into stx-nfv repo.
This leaves existing C++ code, scripts, and resource files untouched,
so there is no functional change. Code refactoring is beyond the scope
of this update.
Makefiles were modified to include devel headers directories
/usr/include/mtce-common and /usr/include/mtce-daemon.
This ensures there is no contamination with other system headers.
The cgts-mtce-common package is renamed and split into:
- repo stx-metal: mtce-common, mtce-common-dev
- repo stx-metal: mtce
- repo stx-nfv: mtce-guest
- repo stx-ha: updates package dependencies to mtce-pmon for
service-mgmt, sm, and sm-api
mtce-common:
- contains common and daemon shared source utility code
mtce-common-dev:
- based on mtce-common, contains devel package required to build
mtce-guest and mtce
- contains common library archives and headers
mtce:
- contains components: alarm, fsmon, fsync, heartbeat, hostw, hwmon,
maintenance, mtclog, pmon, public, rmon
mtce-guest:
- contains guest component guest-server, guest-agent
Story: 2002829
Task: 22748
Change-Id: I9c7a9b846fd69fd566b31aa3f12a043c08f19f1f
Signed-off-by: Jim Gauld <james.gauld@windriver.com>