Commit Graph

59 Commits

Author SHA1 Message Date
Eric MacDonald 649e94c8da Add pxeboot mtcAlive messaging alarm handling
This update adds alarm handling to the recently introduced pxeboot
network mtcAlive messaging, see depends on review below.

A new 200.003 maintenance alarm is introduced with the second depends
on update below. This new alarm is MINOR but also Management Affecting
because the pxeboot network is required for node installation.

This update enhances the new pxeboot_mtcAlive_monitor FSM for the
purpose of detecting pxeboot mtcAlive message loss, alarming and
then clearing the alarm once pxceboot mtcAlive messaging resumes.

The new alarm assertion and clear is debounced:
 - alarm is asserted if message loss persists to the accumulation of
   12 missed messages or after 2 minutes of complete message loss.
 - alarm is cleared after decrementing the message missed counter to
   zero or 1 minute of loss-less messaging.

Upgrades are supported with the addition of a features list to the
mtcClient ready event. All new mtcClients that support pxeboot network
messaging now publish pxeboot mtcAlive support through this new
features list. This is rendered in the logs like this:

    <hostname> mtcClient ready ; with pxeboot mtcAlive support

The mtcAgent does not expect/monitor pxeboot mtcAlive messages from
hosts that don't publish the feature support.

Test Plan:

PASS: Verify mtcAlive period is 5 seconds.
PASS: Verify pxeboot mtcAlive monitor period is 10 seconds.
PASS: Verify mtcAgent sends mtcClient a mtcAlive request on every
      mtcAlive monitor miss.
PASS: Verify pxeboot mtcAlive alarm is not raised while a node is
      locked.

Alarm attributes:

PASS: Verify severity is minor.
PASS: Verify alarm is cleared while node is locked.
PASS: Verify alarm can be suppressed while unlocked.
PASS: Verify asserted alarm is management affecting.
PASS: Verify alarm-show output format including cause and repair
      action text.

Process Restart Handling:

PASS: Verify alarm is maintained over a mtcAgent process restart.
PASS: Verify pxeboot monitoring resumes with or without asserted alarm
      immediately following a mtcAgent process restart.
PASS: Verify mtcClient learns and starts pxeboot mtcAlive messaging
      immediately following mtcClient process restart for locked or
      unlocked nodes.

Alarm Debounce Handling:

PASS: Verify alarm assertion only after 2 minutes of mtcAlive loss.
PASS: Verify alarm clear after 1 minutes of mtcAlive recovery.
PASS: Verify assertion and recovery debounce logging.
PASS: Verify alarm management miss and loss controls handle all
      boundary conditions exercised by a 12 hr soak with randomized
      period between message loss and recovery.

Host Action Handling:

PASS: Verify mtcAlive alarm is not raised over a Host Unlock Enable.
PASS: Verify mtcAlive alarm is not raised over a Host Graceful Recovery.
PASS: Verify mtcAlive alarm is not raised over a Host Power Off/On.
PASS: Verify mtcAlive alarm is not raised over a Host Reboot/Reset.
PASS: Verify mtcAlive alarm is not raised over a Host Reinstall.
PASS: Verify pxeboot mtcAlive is factored into Host Offline Handling.
PASS: Verify pxeboot alarm handling for node that does not send
      pxeboot mtcAlive after unlock.

Stuck Alarm Avoidance Handling:

PASS: Verify typical alarm assertion and clear handling.
PASS: Verify alarm is maintained or cleared over node reboot if the
      messaging issue persists or resolves over the reboot recovery.
PASS: Verify mtcAlive alarm is maintained over a Swact and cleared
      if the messaging is ok on the newly active controller.
PASS: Verify mtcAlive alarm assertion recovery case over uncontrolled
      Swact due to active controller reboot.
PASS: Verify alarm is cleared over a spontaneous reboot if pxeboot
      messaging recovers over that reboot.

Upgrades Case:

PASS: Verify pxeboot mtcAlive monitoring only occurs on mtcClients
      that actually support pxeboot network mtcAlive monitoring.

PASS: Verify mtcClient new features list, parsing which enables
      pxeboot  mtcAlive monitoring for that node.

PASS: Verify pxeboot mtcAlive messaging monitoring is not enabled
      towards nodes whose mtcClient does publish pxeboot mtcAlive
      messaging feature support.
PROG: Verify AIO DX upgrade from 22.12 to current master branch.
      Focus on pxeboot messaging over the upgrade process.

Depends-On: https://review.opendev.org/c/starlingx/metal/+/912654
Depends-On: https://review.opendev.org/c/starlingx/fault/+/914660
Story: 2010940
Task: 49542
Change-Id: I1b51ad9ebcf010f5dee9a86c0295be3da6e2f9b1
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2024-04-09 14:13:23 +00:00
Eric MacDonald 14bb67789e Add pxeboot network mtcAlive messaging to Maintenance
The introduction of the new pxeboot network requires maintenance
verify and report on messaging failures over that network.

Towards that, this update introduces periodic mtcAlive messaging
between the mtcAgent and mtcClinet.

Test Plan:

PASS: Verify install and provision each system type with a mix
             of networking modes ; ethernet, bond and vlan
             - AIO SX, AIO DX, AIO DX plus
             - Standard System 2+1
             - Storage System 2+1+1
PASS: Verify feature with physical on management interface
PASS: Verify feature with vlan on management interface
PASS: Verify feature with bonded management interface
PASS: Verify feature with bonded vlans on management interface
PASS: Verify in bonded cases handling with 2, 1 or no slaves found
PASS: Verify mgmt-combined or separate cluster-host network
PASS: Verify mtcClient pxeboot interface address learning
             - for worker and storage nodes       ; dhcp leases file
             - for controller nodes before unlock ; dhcp leases file
             - for controller nodes after unlock  ; static from ifcfg
             - from controller within 10 seconds of process restart
PASS: Verify mtcAgent pxeboot interface address learning from
             dnsmasq.hosts file
PASS: Verify pxeboot mtcAlive initiation, handling, loss detection
             and recovery
PASS: Verify success and failure handling of all new pxeboot ip
             address learning functions ;
             - dhcp - all system node installs.
             - dnsmasq.hosts - active controller for all hosts.
             - interfaces.d - controller's mtcClient pxeboot address.
             - pxeboot req mtcAlive - mtcAgent mtcAlive request message.
PASS: Verify mtcClient pxeboot network 'mtcAlive request' and 'reboot'
             command handling for ethernet, vlan and bond configs.
PASS: Verify mtcAlive sequence number monitoring, out-of-sequence
             detection, handling and logging.
PASS: Verify pxeboot rx socket binding and non-blocking attribute
PASS: Verify mtcAgent handling stress soaking of sustained incoming
             500+ msgs/sec ; batch handling and logging.
PASS: Verify mtcAgent and mtcClient pxeboot tx and rx socket messaging,
             failure recovery handling and logging.
PASS: Verify pxeboot receiver is not setup on the oam interface on
             controller-0 first install until after initial config
             complete.

Regression:

PASS: Verify mtcAgent/mtcClient online and offline state management
PASS: Verify mtcAgent/mtcClient command handling
      - over management network
      - over cluster-host network
PASS: Verify mtcClient interface chain log for all iface types
      - bond    : vlan123 -> pxeboot0 (802.3ad 4) -> enp0s8 and enp0s9
      - vlan    : vlan123 -> enp0s8
      - ethernet: enp0s8
PASS: Verify mtcAgent/mtcClient handling and logging including debug
      logging for standard operations
      - node install and unlock
      - node lock and unlock
      - node reinstall, reboot, reset
PASS: Verify graceful recovery handling of heartbeat loss failure.
      - node reboot
      - management interface down
PASS: Verify systemcontroller and subcloud install with dc-libvirt
PASS: Verify no log flooding, coredumps, memory leaks

Story: 2010940
Task: 49541
Change-Id: Ibc87b85e3e0e07c3b8c40b5291bd3372506fbdfb
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2024-03-28 15:28:27 +00:00
Eric MacDonald d9982a3b7e Mtce: Create non-volatile backup of node locked flag file
The existing /var/run/.node_locked flag file is volatile.
Meaning it is lost over a host reboot which has DOR implications.

Service Management (SM) sometimes selects and activates services
on a locked controller following a DOR (Dead Office Recovery).

This update is part one of a two-part update that solves both
of the above problems. Part two is a change to SM in the ha git.
This update can be merged without part two.

This update maintains the existing volatile node locked file because
it is looked at by other system services. So to minimize the change
and therefore patchback impact, a new non-volatile 'backup' of the
existing node locked flag file is created.

This update incorporates modifications to the mtcAgent and mtcClient,
introducing a new backup file and ensuring their synchronized
management to guarantee their simultaneous presence or absence.

Note: A design choice was made to not use a symlink of one to the
      other rather than add support to manage symlinks in the code.
      This approach was chosen for its simplicity and reliability
      in directly managing both files. At some point in the future
      volatile file could be deprecated contingent upon identifying
      and updating all services that directly reference it.

This update also removes some dead code that was adjacent to my update.

Test Plan: This test plan covers the maintenance management of
           both files to ensure they always align and the expected
           behavior exists.

PASS: Verify AIO DX Install.
PASS: Verify Storage System Install.
PASS: Verify Swact back and forth.
PASS: Verify mtcClient and mtcAgent logging.
PASS: Verify node lock/unlock soak.

Non-volatile (Nv) node locked management test cases:

PASS: Verify Nv node locked file is present when a node is locked.
      Confirmed on all node types.
PASS: Verify any system node install comes up locked with both node
      locked flag files present.
PASS: Verify mtcClient logs when a node is locked and unlocked.
PASS: Verify Nv node locked file present/absent state mirrors the
      already existing /var/run/.node_locked flag file.
PASS: Verify node locked file is present on controller-0 during
      ansible run following initial install and removed as part
      of the self-unlock.
PASS: Verify the Nv node locked file is removed over the unlock
      along with the administrative state change prior to the
      unlock reboot.
PASS: Verify both node locked files are always present or absent
      together.
PASS: Verify node locked file management while the management
      interface is down. File is still managed over cluster network.
PASS: Verify node locked file management while the cluster interface
      is down. File is still managed over management network.
PASS: Verify behavior if the new unlocked message is received by a
      mtcClient process that does not support it ; unknown command log.
PASS: Verify a node locked state is auto corrected while not in a
      locked/unlocked action change state.
      ... Manually remove either file on locked node and verify
          they are both recreated within 5 seconds.
      ... Manually create either node locked file on unlocked worker
          or storage node and verify the created files are removed
          within 5 seconds.
          Note: doing this to the new backup file on the active
                controller will cause SM to shutdown as expected.
PASS: Verify Nv node locked file is auto created on a node that
      spontaneously rebooted while it was unlocked. During the
      reboot the node was administratively locked.
      The node should come online with both node locked files present.

Partial-Bug: 2051578
Change-Id: I0c279b92491e526682d43d78c66f8736934221de
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2024-02-14 00:54:11 +00:00
Eric Macdonald 50dc29f6c0 Improve maintenance power/reset control command retry handling
This update improves on and drives consistency into the
maintenance power on/off and reset handling in terms of
retries and use of graceful and immediate commands.

This update maintains the 10 retries for both power-on
and power-off commands and increases the number of retries
for the reset command from 5 to 10 to line up with the
power operation commands.

This update also ensures that the first 5 retries are done
with the graceful action command while the last 5 are with
the immediate.

This update also removed a power on handling case that could
have lead to a stuck state. This case was virtually impossible
to hit based on the required sequence of intermittent command
failures but that scenario handling was fixed up anyway.

Issues have been seen with the power-off handling on some servers.
Suspect that those servers need more time to power-off. So, this
introduced a 30 seconds delay following a power-off command before
issuing the power status query to give the server some time to
power-off before retrying the power-off command.

Test Plan: Both IPMI and Redfish

PASS: Verify power on/off and reset handling support up to 10 retries
PASS: Verify graceful command is used for the first power on/off
      or reset try and the first 5 retries
PASS: Verify immediate command is used for the final 5 retries
PASS: Verify reset handling with/without retries (none/mid/max)
PASS: Verify power-on  handling with/without retries (none/mid/max)
PASS: Verify power-off handling  with/without retries (none/mid/max)
PASS: Verify power status command failure handling for power on/off
NOTE: FIT (fault insertion testing) was used to create retry scenarios

PASS: Verify power-off inter retry delay feature
PASS: Verify 30 second power-off to power query delay
PASS: Verify redfish power/reset commands used are logged by default
PASS: Verify power-off/on and reset logging

Regression:

PASS: verify power-on/off and reset handling without retries
PASS: Verify power-off handling when power is already off
PASS: Verify power-on handling when power is already on

Closes-Bug: 2031945
Signed-off-by: Eric Macdonald <eric.macdonald@windriver.com>
Change-Id: Ie39326bcb205702df48ff9dd090f461c7110dd36
2024-01-25 22:42:26 +00:00
Zuul 125601c2f9 Merge "Failure case handling of LUKS service" 2023-12-14 18:09:46 +00:00
Jagatguru Prasad Mishra 1210ed450a Failure case handling of LUKS service
luks-fs-mgr service creates and unseals the LUKS volume used to store
keys/secrets. This change handles the failure case if this essential
service is inactive. It introduces an alarm LUKS_ALARM_ID which is
raised if service is inactive which implies that there is an issue in
creating or unsealing the LUKS volume.

Test Plan:
PASS" build-pkgs -c -p mtce-common
PASS: build-pkgs -c -p mtce
PASS: build-image
PASS: AIO-SX bootstrap with luks volume status active
PASS: AIO-DX bootstrap with volume status active
PASS: Standard setup with 2 controllers and 1 compute node with luks
      volume status active. There should not be any alarm and node
      status should be unlocked/enabled/available.
PASS: AIO-DX node enable failure on the controller where luks volume
      is inactive. Node availability should be failed. A critical
      alarm with id 200.016 should be displayed with 'fm alarm-list'
PASS: AIO-SX node enable failure on the controller-0. Node availability
      should be failed. A critical alarm with id 200.016 should be
      displayed with 'fm alarm-list'
PASS: Standard- node enable failure on the node (controller-0,
      controller-1, storage-0, compute-1). Node availability
      should be failed. A critical alarm with id 200.016 should be
      displayed with 'fm alarm-list' for the failed host.
PASS: AIO-DX In service volume inactive should be detected and a
      critical alarm should be raised with ID 200.016. Node
      availability should be changed to degraded.
PASS: AIO-SX In service volume inactive  status should be detected
      and a critical alarm should be raised with ID 200.016. Node
      availability should be changed to degraded.
PASS: Standard ( 2 controller, 1 storage, 1 compute) In service
      volume inactive status should be detected and a
      critical alarm should be raised with ID 200.016. Node
      availability should be changed to degraded.
PASS: AIO-DX In service: If volume becomes active and a LUKS alarm
      is active, alarm should be cleared. Node availability should
      be changed to available.
PASS: AIO-SX In service: If volume becomes active and a  LUKS alarm is
      active, alarm should be cleared. Node availability should be
      changed to available.
PASS: Standard ( 2 controller, 1 storage, 1 compute) In service:
      If volume becomes active and a LUKS alarm is active, alarm
      should be cleared. Node availability should be changed to
      available.
PASS: AIO-SX, AIO-DX, Standard- If intest fails and node availability
      is 'failed'. After fixing the volume issue, a lock/unlock should
      make the node available.

Story: 2010872
Task: 49108

Change-Id: I4621e7c546078c3cc22fe47079ba7725fbea5c8f
Signed-off-by: Jagatguru Prasad Mishra <jagatguruprasad.mishra@windriver.com>
2023-12-06 00:34:02 -05:00
Teresa Ho 36814db843 Increase timeout for runtime manifest
In management network reconfiguration for AIO-SX, the runtime manifest
executed during host unlock could take more than five minutes to complete.
This commit is to extend the timeout period from five minutes to eight
minutes.

Test Plan:
PASS: AIO-SX subcloud mgmt network reconfiguration

Story: 2010722
Task: 49133

Change-Id: I6bc0bacad86e82cc1385132f9cf10b56002f385e
Signed-off-by: Teresa Ho <teresa.ho@windriver.com>
2023-11-23 16:51:22 -05:00
Eric MacDonald 79d8644b1e Add bmc reset delay in the reset progression command handler
This update solves two issues involving bmc reset.

Issue #1: A race condition can occur if the mtcAgent finds an
          unlocked-disabled or heartbeat failing node early in
          its startup sequence, say over a swact or an SM service
          restart and needs to issue a one-time-reset. If at that
          point it has not yet established access to the BMC then
          the one-time-reset request is skipped.

Issue #2: When issue #1 race conbdition does not occur before BMC
          access is established the mtcAgent will issue its one-time
          reset to a node. If this occurs as a result of a crashdump
          then this one-time reset can interrupt the collection of
          the vmcore crashdump file.

This update solves both of these issues by introducing a bmc reset
delay following the detection and in the handling of a failed node
that 'may' need to be reset to recover from being network isolated.

The delay prevents the crashdump from being interrupted and removes
the race condition by giving maintenance more time to establish bmc
access required to send the reset command.

To handle significantly long bmc reset delay values this update
cancels the posted 'in waiting' reset if the target recovers online
before the delay expires.

It is recommended to use a bmc reset delay that is longer than a
typical node reboot time. This is so that in the typical case, where
there is no crashdump happening, we don't reset the node late in its
almost done recovery. The number of seconds till the pending reset
countdown is logged periodically.

It can take upwards of 2-3 minutes for a crashdump to complete.
To avoid the double reboot, in the typical case, the bmc reset delay
is set to 5 minutes which is longer than a typical boot time.
This means that if the node recovers online before the delay expires
then great, the reset wasn't needed and is cancelled.

However, if the node is truely isolated or the shutdown sequence
hangs then although the recovery is delayed a bit to accomodate for
the crashdump case, the node is still recovered after the bmc reset
delay period. This could lead to a double reboot if the node
recovery-to-online time is longer than the bmc reset delay.

This update implements this change by adding a new 'reset send wait'
phase to the exhisting reset progression command handler.

Some consistency driven logging improvements were also implemented.

Test Plan:

PASS: Verify failed node crashdump is not interrupted by bmc reset.
PASS: Verify bmc is accessible after the bmc reset delay.
PASS: Verify handling of a node recovery case where the node does not
      come back before bmc_reset_delay timeout.
PASS: Verify posted reset is cancelled if the node goes online before
      the bmc reset delay and uptime shows less than 5 mins.
PASS: Verify reset is not cancelled if node comes back online without
      reboot before bmc reset delay and still seeing mtcAlive on one
      or more links.Handles the cluster-host only heartbeat loss case.
      The node is still rebooted with the bmc reset delay as backup.
PASS: Verify reset progression command handling, with and
      without reboot ACKs, with and without bmc
PASS: Verify reset delay defaults to 5 minutes
PASS: Verify reset delay change over a manual change and sighup
PASS: Verify bmc reset delay of 0, 10, 60, 120, 300 (default), 500
PASS: Verify host-reset when host is already rebooting
PASS: Verify host-reboot when host is already rebooting
PASS: Verify timing of retries and bmc reset timeout
PASS: Verify posted reset throttled log countdown

Failure Mode Cases:

PASS: Verify recovery handling of failed powered off node
PASS: Verify recovery handling of failed node that never comes online
PASS: Verify recovery handling when bmc is never accessible
PASS: Verify recovery handling cluster-host network heartbeat loss
PASS: Verify recovery handling management network heartbeat loss
PASS: Verify recovery handling both heartbeat loss
PASS: Verify mtcAgent restart handling finding unlocked disabled host

Regression:

PASS: Verify build and DX system install
PASS: Verify lock/unlock (soak 10 loops)
PASS: Verify host-reboot
PASS: Verify host-reset
PASS: Verify host-reinstall
PASS: Verify reboot graceful recovery (force and no force)
PASS: Verify transient heartbeat failure handling
PASS: Verify persistent heartbeat loss handling of mgmt and/or cluster networks
PASS: Verify SM peer reset handling when standby controller is rebooted
PASS: Verify logging and issue debug ability

Closes-Bug: 2042567
Closes-Bug: 2042571
Change-Id: I195661702b0d843d0bac19f3d1ae70195fdec308
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2023-11-02 20:58:00 +00:00
Kyale, Eliud 502662a8a7 Cleanup mtcAgent error logging during startup
- reduced log level in http util to warning
- use inservice test handler to ensure state change notification
  is sent to vim
- reduce retry count from 3 to 1 for add_handler state_change
  vim notification

Test plan:
PASS - AIO-SX: ansible controller startup (race condition)
PASS - AIO-DX: ansible controller startup
PASS - AIO-DX: SWACT
PASS - AIO-DX: power off restart
PASS - AIO-DX: full ISO install
PASS - AIO-DX: Lock Host
PASS - AIO-DX: Unlock Host
PASS - AIO-DX: Fail Host ( by rebooting unlocked-enabled standby controller)

Story: 2010533
Task: 47338

Signed-off-by: Kyale, Eliud <Eliud.Kyale@windriver.com>
Change-Id: I7576e2642d33c69a4b355be863bd7183fbb81f45
2023-02-14 14:18:02 -05:00
Eric MacDonald da398e0c5f Debian: Make Mtce offline handler more resilient to slow shutdowns
The current offline handler assumes the node is offline after
'offline_search_count' reaches 'offline_threshold' count
regardless of whether mtcAlive messages were received during
the search window.

The offline algorithm requires that no mtcAlive messages
be seen for the full offline_threshold count.

During a slow shutdown the mtcClient runs for longer than
it should and as a result can lead to maintenance seeing
the node as recovered before it should.

This update manages the offline search counter to ensure that
it only reached the count threshold after seeing no mtcAlive
messages for the full search count. Any mtcAlive message seen
during the count triggers a count reset.

This update also
1. Adjusts the reset retry cadence from 7 to 12 secs
   to prevent unnecessary reboot thrash during
   the current shutdown.
2. Clears the hbsClient ready event at the start of the
   subfunction handler so the heartbeat soak is only
   started after seeing heartbeat client ready events
   that follow the main config.

Test Plan:

PASS: Debian and CentOS Build and DX install
PASS: Verify search count management
PASS: Verify issue does not occur over lock/unlock soak (100+)
      - where the same test without update did show issue.
PASS: Monitor alive logs for behavioral correctness
PASS: Verify recovery reset occurs after expected extended time.

Closes-Bug: 1993656
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Change-Id: If10bb75a1fb01d0ecd3f88524d74c232658ca29e
2022-10-24 15:57:43 +00:00
Eric MacDonald 3f4c2cbb45 Mtce: Add ActionInfo extension support for reset operations.
StarlingX Maintenance supports host power and reset control through
both IPMI and Redfish Platform Management protocols when the host's
BMC (Board Management Controller) is provisioned.

The power and reset action commands for Redfish are learned through
HTTP payload annotations at the Systems level; "/redfish/v1/Systems.

The existing maintenance implementation only supports the
"ResetType@Redfish.AllowableValues" payload property annotation at
the #ComputerSystem.Reset Actions property level.

However, the Redfish schema also supports an 'ActionInfo' extension
at /redfish/v1/Systems/1/ResetActionInfo.

This update adds support for the 'ActionInfo' extension for Reset
and power control command learning.

For more information refer to the section 6.3 ActionInfo 1.3.0 of
the Redfish Data Model Specification link in the launchpad report.

Test Plan:

PASS: Verify CentOS build and patch install.
PASS: Verify Debian build and ISO install.
PASS: Verify with Debian redfishtool 1.1.0 and 1.5.0
PASS: Verify reset/power control cmd load from newly added second
      level query from ActionInfo service.

Failure Handling: Significant failure path testing with this update

PASS: Verify Redfish protocol is periodically retried from start
      when bm_type=redfish fails to connect.
PASS: Verify BMC access protocol defaults to IPMI when
      bm_type=dynamic but failed connect using redfish.
      Connection failures in the above cases include
      - redfish bmc root query fails
      - redfish bmc info query fails
      - redfish bmc load power/reset control actions fails
      - missing second level Parameters label list
      - missing second level AllowableValues label list
PASS: Verify sensor monitoring is relearned to ipmi from failed and
      retried with bm_type=redfish after switch to bm_type=dynamic
      or bm_type=ipmi by sysinv update command.

Regression:

PASS: Verify with CentOS redfishtool 1.1.0
PASS: Verify switch back and forth between ipmi and redfish using
      update bm_type=ipmi and bm_type=redfish commands
PASS: Verify switch from ipmi to redfish usinf bm_type=dynamic for
      hosts that support redfish
PASS: Verify redfish protocol is preferred in bm_type=dynamic mode
PASS: Verify IPMI sensor monitoring when bm_type=ipmi
PASS: Verify IPMI sensor monitoring when bm_type=dynamic
      and redfish connect fails.
PASS: Verify redfish sensor event assert/clear handling with
      alarm and degrade condition for both IPMI and redfish.
PASS: Verify reset/power command learn by single level query.
PASS: Verify mtcAgent.log logging

Closes-Bug: 1992286
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Change-Id: Ie8cdbd18104008ca46fc6edf6f215e73adc3bb35
2022-10-13 17:40:05 +00:00
Eric MacDonald aaf9d08028 Mtce: Fix bmc password fetch error handling
The mtcAgent process sometimes segfaults while trying to fetch
the bmc password from a failing barbican process.

With that issue fixed the mtcAgent sends the bmc access
credentials to the hardware monitor (hwmond) process which
then segfaults for a reason similar

In cases where the process does not segfault but also does not
get a bmc password, the mtcAgent will flood its log file.

This update

 1. Prevents the segfault case by properly managing acquired
    json-c object releases. There was one in the mtcAgent and
    another in the hardware monitor (hwmond).

    The json_object_put object release api should only be called
    against objects that were created with very specific apis.
    See new comments in the code.

 2. Avoids log flooding error case by performing a password size
    check rather than assume the password is valid following the
    secret payload receive stage.

 3. Simplifies the secret fsm and error and retry handling.

 4. Deletes useless creation and release of a few unused json
    objects in the common jsonUtil and hwmonJson modules.

Note: This update temporarily disables sensor and sensorgroup
      suppression support for the debian hardware monitor while
      a suppression type fix in sysinv is being investigated.

Test Plan:

PASS: Verify success path bmc password secret fetch
PASS: Verify secret reference get error handling
PASS: Verify secret password read error handling
PASS: Verify 24 hr provision/deprov success path soak
PASS: Verify 24 hr provision/deprov error path path soak
PASS: Verify no memory leak over success and failure path soaking
PASS: Verify failure handling stress soak ; reduced retry delay
PASS: Verify blocking secret fetch success and error handling
PASS: Verify non-blocking secret fetch success and error handling
PASS: Verify secret fetch is set non-blocking
PASS: Verify success and failure path logging
PASS: Verify all of jsonUtil module manages object release properly
PASS: Verify hardware monitor sensor model creation, monitoring,
             alarming and relearning. This test requires suppress
             disable in order to create sensor groups in debian.
PASS: Verify both ipmi and redfish and switch between them with
             just bm_type change.
PASS: Verify all above tests in CentOS
PASS: Verify over 4000 provision/deprovision cycles across both
             failure and success path handling with no process
             failures

Closes-Bug: 1975520
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Change-Id: Ibbfdaa1de662290f641d845d3261457904b218ff
2022-06-01 15:21:05 +00:00
Eric MacDonald fd5dd4254a Clear bmc alarm over mtcAgent process restart for ALL system types
If a host's BMC is provisioned and the mtcAgent process
is restarted then remove the gating condition that avoids
clearing the BMC access alarm in AIO SX.

Change-Id: I0734c2203a7acaee27c40c3c0d259b4cc5726b5d
Closes-Bug: 1931906
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2021-06-14 16:46:41 -04:00
Eric MacDonald ba6c61584d Refactor background in-service start host services handling
The maintenance add_handler fsm loads inventory and recovers
host state over a process restart. If the active controller's
uptime is less than 15 minutes the restart event is treated as
a Dead Office Recovery (DOR) and is more forgiving to host
recovery by scheduling the 'start host services' as a
background operation so as to not hold up the add operation.

The current implementation of the background handling of
'start host services' is not handling the AIO subfunction
case properly in DOR mode as well as being difficult to
follow and therfore fix and maintain. This miss handling
leads to maintenance incorrectly failing the node with a
subfunction configuration error over the DOR case.

This update refactors the background handling of 'start host
services' to fix the issue and improve its clearity and
maintainability.

Test Cases:

PASS: Verify AIO DX DOR handling
PASS: Verify AIO DX active controller reboot handling
      - standby with uptime ; < 15 min and > 15 min
PASS: Verify AIO DX standby controller reboot handling
PASS: Verify subfunction configuration error handling

Regression:

PASS: Verify start host services wait/retry handling.
PASS: Verify start host services failure handling.
PASS: Verify DOR of Standard system
PASS: Verify DOR of AIO Plus system
PASS: Verify AIO System Install
PASS: Verify Standard System Install
PASS: Verify AIO plus system install

Change-Id: Ia4683672e3a2852b5b4837167b2dcd2a1e4e6d57
Closes-Bug: 1928095
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2021-05-11 12:25:27 -04:00
Eric MacDonald 48978d804d Improved maintenance handling of spontaneous active controller reboot
Performing a forced reboot of the active controller sometimes
results in a second reboot of that controller. The cause of the
second reboot was due to its reported uptime in the first mtcAlive
message, following the reboot, as greater than 10 minutes.

Maintenance has a long standing graceful recovery threshold of
10 minutes. Meaning that if a host looses heartbeat and enters
Graceful Recovery, if the uptime value extracted from the first
mtcAlive message following the recovery of that host exceeds 10
minutes, then maintenance interprets that the host did not reboot.
If a host goes absent for longer than this threshold then for
reasons not limited to security, maintenance declares the host
as 'failed' and force re-enables it through a reboot.

With the introduction of containers and addition of new features
over the last few releases, boot times on some servers are
approaching the 10 minute threshold and in this case exceeded
the threshold.

The primary fix in this update is to increase this long standing
threshold to 15 minutes to account for evolution of the product.

During the debug of this issue a few other related undesirable
behaviors related to Graceful Recovery were observed with the
following additional changes implemented.

 - Remove hbsAgent process restart in ha service management
   failover failure recovery handling. This change is in the
   ha git with a loose dependency placed on this update.
   Reason: https://review.opendev.org/c/starlingx/ha/+/788299

 - Prevent the hbsAgent from sending heartbeat clear events
   to maintenance in response to a heartbeat stop command.
   Reason: Maintenance receiving these clear events while in
           Graceful Recovery causes it to pop out of graceful
           recovery only to re-enter as a retry and therefore
           needlessly consumes one (of a max of 5) retry count.

 - Prevent successful Graceful Recovery until all heartbeat
   monitored networks recover.
   Reason: If heartbeat of one network, say cluster recovers but
           another (management) does not then its possible the
           max Graceful Recovery Retries could be reached quite
           quickly, while one network recovered but the other
           may not have, causing maintenance to fail the host and
           force a full enable with reboot.

 - Extend the wait for the hbsClient ready event in the graceful
   recovery handler timout from 1 minute to worker config timeout.
   Reason: To give the worker config time to complete before force
           starting the recovery handler's heartbeat soak.

 - Add Graceful Recovery Wait state recovery over process restart.
   Reason: Avoid double reboot of Gracefully Recovering host over
           SM service bounce.

 - Add requirement for a valid out-of-band mtce flags value before
   declaring configuration error in the subfunction enable handler.
   Reason: rebooting the active controller can sometimes result in
           a falsely reported configation error due to the
           subfunction enable handler interpreting a zero value as
           a configuration error.

 - Add uptime to all Graceful Recovery 'Connectivity Recovered' logs.
   Reason: To assist log analysis and issue debug

Test Plan:

PASS: Verify handling active controller reboot
             cases: AIO DC, AIO DX, Standard, and Storage
PASS: Verify Graceful Recovery Wait behavior
             cases: with and without timeout, with and without bmc
             cases: uptime > 15 mins and 10 < uptime < 15 mins
PASS: Verify Graceful Recovery continuation over mtcAgent restart
             cases: peer controller, compute, MNFA 4 computes
PASS: Verify AIO DX and DC active controller reboot to standby
             takeover that up for less than 15 minutes.

Regression:

PASS: Verify MNFA feature ; 4 computes in 8 node Storage system
PASS: Verify cluster network only heartbeat loss handling
             cases: worker and standby controller in all systems.
PASS: Verify Dead Office Recovery (DOR)
             cases: AIO DC, AIO DX, Standard, Storage
PASS: Verify system installations
             cases: AIO SX/DC/DX and 8 node Storage system
PASS: Verify heartbeat and graceful recovery of both 'standby
             controller' and worker nodes in AIO Plus.

PASS: Verify logging and no coredumps over all of testing
PASS: Verify no missing or stuck alarms over all of testing

Change-Id: I3d16d8627b7e838faf931a3c2039a6babf2a79ef
Closes-Bug: 1922584
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2021-04-30 15:35:53 +00:00
Eric MacDonald 7539d36c3f Prevent mtcClient from sending to uninitialized socket in AIO SX
The mtcClient will perform a socket reinit if it detects a socket
failure. The mtcClient also avoids setting up its controller-1
cluster network socket for the AIO SX system type ; because there
is no controller-1 provisioned.

Most AIO SX systems have the management/cluster networks set to
the 'loopback' interface. However, when an AIO SX system is setup
with its management and cluster networks on physical interfaces,
with or without vlan, the mtcAlive send message utility will try
to send to the uninitialized controller-1 cluster socket. This
leads to a socket error that triggers a socket reinitialization
loop which causes log flooding.

This update adds a check to the mtcAlive send utility to avoid
sending mtcAlive to controller-1 for AIO SX system type where
there is no controller-1 provisioned; no send,no error,no flood.

Since this update needed to add a system type check, this update
also implemented a system type definition rename from CPE to AIO.
Other related definitions and comments were also changed to make
the code base more understandable and maintainable

Test Plan:

PASS: Verify AIO SX with mgmnt/clstr on physical (failure mode)
PASS: Verify AIO SX Install with mgmnt/clstr on 'lo'
PASS: Verify AIO SX Lock msg and ack over mgmnt and clstr
PASS: Verify AIO SX locked-disabled-online state
PASS: Verify mtcClient clstr socket error detect/auto-recovery (fit)
PASS: Verify mtcClient mgmnt socket error detect/auto-recovery (fit)

Regression:

PASS: Verify AIO SX Lock and Unlock (lazy reboot)
PASS: Verify AIO DX and DC install with pv regression and sanity
PASS: Verify Standard system install with pv regression and sanity

Change-Id: I658d33a677febda6c0e3fcb1d7c18e5b76cb3762
Closes-Bug: 1897334
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2021-04-21 10:20:10 -04:00
Eric MacDonald 031818e55b Add in-service test to clear stale config failure alarm
A configuration failure alarm can get stuck asserted if
that node experiences an uncontrolled reboot that recovers
without a configuration failure.

This update adds an in-service test that audits host health
while there is a configuration failure alarm raised and
clear that alarm if the failure condition goes away. This
could be a result of an in-service manifest that runs and
corrects the configuration or if the node reboots and comes
back up in a healthy (properly configured) state.

Fixed bug that was clearing config alarm severity state
when a heartbeat clear event is received.

This update also goes a step further and introduces an
alarms state audit that detects and corrects maintenance
alarm state mismatches.

Test Plan:

PASS: Verify the add handler loads config alarm state
PASS: Verify in-service test clears stale config alarm
PASS: Verify in-service test acts on new config failure
      ... degrade - active controller
      ... fail    - other hosts
PASS: Verify audit fixes mtce alarm state mismatches
PASS: Verify audit handles fm not running case
PASS: Verify audit handling behavior with valid alarm cases
PASS: Verify locked alarm management over process restart
PASS: Verify audit only logs active alarms list changes
PASS: Verify audit runs for both locked/unlocked nodes
PASS: Verify update as a patch

Regression:

PASS: Verify enable sequence config failure handling
PASS: ... active controller     - recoverable degrade
PASS: ... other nodes           - threshold fail
PASS: ... auto recovery disable - config failure
PASS: Verify mtcAgent process logging
PASS: Verify heartbeat handling and alarming
PASS: Verify Standard system install
PASS: Verify AIO system install

Change-Id: If9957229810435e9faeb08374f2b5fbcb5b0f826
Closes-Bug: 1918195
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2021-03-29 16:39:52 -04:00
Eric MacDonald 5c83453fdf Fix Graceful Recovery handling while in Graceful Recovery handling
The current Graceful Recovery handler is not properly handling
back-to-back Multi Node Failure Avoidance (MNFA) events.

There are two phases to MNFA

 phase 1: waiting for number of failed nodes to fall below
          mnfa_threahold as each affected node's heartbeat
          is recovered.
 phase 2: then a Graceful Recovery Wait period which is an
          11 second heartbeat soak to verify that a stable
          heartbeat is regained before declaring the NMFA
          event complete.

The Graceful Recovery Wait status of one or more affected nodes
has been seen to be left uncleared (stuck) on one or more of the
affected nodes if phase 2 of MNFA is interrupted by another MNFA
event ; aka MNFA Nesting.

Although this stuck status is not service affecting it does leave
one or more nodes' host.task field, as observed under host-show,
with "Graceful Recovery Wait" rather than empty.

This update makes Multi Node Failure Avoidance (MNFA) handling
changes to ensure that, upon MNFA exit, the recovery handler
is properly restarted if MNFA Nesting occurs.

Two additional Graceful Recovery phase issues were identified
and fixed by this update.

 1. Cut Graceful recovery handling in half

    - Found and removed a redundant 11 second heartbeat soak
      at the very end of the recovery handler.
    - This cuts the graceful recovery handling time down from
      22 to 11 seconds thereby cutting potential for nesting
      in half.

 2. Increased supported Graceful Recovery nesting from 3 to 5

    - Found that some links bounce more than others so a nesting
      count of 3 can lead to an occasional single node failure.
    - This adds a bit more resiliency to MNFA handling of cases
      that exhibit more link messaging bounce.

Test Plan: Verified 60+ MNFA occurrences across 4 different
           system types including AIO plus, Standard and Storage

PASS: Verify Single Node Graceful Recovery Handling
PASS: Verify Multi Node Graceful Recovery Handling
PASS: Verify Single Node Graceful Recovery Nesting Handling
PASS: Verify Multi Node Graceful Recovery Nesting Handling
PASS: Verify MNFA of up to 5 nests can be gracefully recovered
PASS: Verify MNFA of 6 nests lead to full enable of affected nodes
PASS: Verify update as a patch
PASS: Verify mtcAgent logging

Regression:

PASS: Verify standard system install
PASS: Verify product verification maintenance regression (4 runs)
PASS: Verify MNFA threshold increase and below threshold behavior
PASS: Verify MNFA with reduced timeout behavior for
      ... nested case that does not timeout
      ... case that does not timeout
      ... case that does timeout

Closes Bug: 1892877
Change-Id: I6b7d4478b5cae9521583af78e1370dadacd9536e
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2021-03-17 14:25:19 -04:00
Eric MacDonald 9ab726b0eb Add support for peer controller reset via mtcClient
This update adds the ability for SM to passively
request the mtcClient to BMC reset its peer controller
as a means to recover a severely loaded active controller.

To do this the mtcAgent is modified keep the controllers'
mtcClients updated with the BMC info of its peer.

The mtcClient is modified to audit for the SM signal
and then when asserted issue a BMC reset of its peer
controller using ipmitool system call.

The ability to command the peer mtcCient to 'sync'
prior to the BMC reset is implemented but configured
disabled for now.

Change-Id: Ibe4c8aaa3a980cbe5f34c3e22f015698a6453c1a
Partial-Bug: #1895350
Co-Authored-By: Bin.Qian@windriver.com
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2021-01-14 16:44:14 -05:00
Eric MacDonald 1350502720 Make Mtce Power-Off FSM verify power-off
If a host's BMC server accepts a power-off command without
error but does not actually power-off the host, the power-off
FSM reports success yet the host power is still on.

This update adds a verification component to the power-off
FSM. Once the power-off command is issued and succeeds at the
command level, the power-off FSM will now query power status
and retry the power-off command until the server is verified
to be powered-off or the retry max (10) is reached and the
power-off command is failed.

Test Plan:

PASS: Verify 200+ Mtce Power Off/On cycles (ipmi & redfish)
PASS: Verify 100+ Mtce Reinstalls with FIT (ipmi & redfish)

Change-Id: Iddd120d89d1152fc0b26915df123f586c38b909b
Closes-Bug: 1865087
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2020-11-22 13:38:33 +00:00
Eric MacDonald 1196056612 Disable Redfish BMC audit and improve reinstall failure handling
The Mtce Reinstall Handler can collide with the BMC Redfish
audit resulting in reinstall failure. BMC handler's 2 minute
connection audit can colliding with other BMC commands.

The reinstall handler, with 4 bmc command operations is
particularly suseptable.

Two additional bmc communication improvements are implemented:

1. Add 'retry' handling to all BMC requests in the Maintenance
   Reinstall Handler FSM to handle transient command failures.

   Note: There are already retries to all but the power status
   query and the netboot requests in that handler and retries
   in other administrative commands that involve bmc requests.

2. Switch BMC power control command management from 'static' to
   'learned' lists. Some BMCs don't support both graceful and
   immediate power commands; Graceful Restart and Force Restart.
   To remove the possibility of using an unsupported BMC command,
   this update switches from static to learned power command lists
   with log produced if a server is missing command support.

   Power commands escalate from graceful to immediate in the
   presence of retries.

Test Cases:

PASS: Verify bmc handler redfish audit is disabled
PASS: Verify reinstall soak using redfish
PASS: Verify reinstall netboot and power status retry handling
PASS: Verify all power control commands using redfish
PASS: Verify graceful operations are used if available
PASS: Verify immediate operations are used for retries

Regression:

PASS: Verify bmc ping audit success and failure handling

PASS: Verify Reset        Handling soak (redfish and ipmi)
PASS: Verify Power-Off/On Handling soak (redfish and ipmi)
PASS: Verify Reinstall    Handling soak (redfish and ipmi)
PASS: Verify Standard System Install    (redfish and ipmi)
PASS: Verify AIO DX   System Install    (redfish and ipmi)

PASS: Verify this update as a patch

Change-Id: Idb484512ccb1b16e2d0ea9aff4ab7965347b1322
Closes-Bug: 1880578
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2020-11-16 15:15:22 +00:00
Eric MacDonald 16fcba3976 Remove 15 second delay following Swact Request status update
The Maintenance swact handler posts swact actions to the host
controller's task status field. The Swact Request posting was
followed by a 15 second wait so that Horizon would be
displaying "Swact: Request" while the swact occurred.

Unfortunately, this delayed the actual swact request for that
entire wait period thereby adding 15 seconds to the overall
manual swact operation.

Since it's better to run swact faster compared to waiting for
the status, this update removes that delay at the risk the
"Swact: Request" status not get displayed prior to the swact
taking place.

Change-Id: I635c896327dca2312efbe02dec67d3e920fa3e90
Closes-Bug: 1895767
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2020-09-21 12:29:42 -04:00
Eric MacDonald da7b2e94f1 Modify Mtce Reinstall FSM to first power-off BMC provisioned hosts
This update only applies to servers that support and are provisioned
for Board Management Control (BMC).

The BMC of some servers silently reject the 'set next boot device',
a command while it is executing BIOS.

The current reinstall algorithm when the BMC is provisioned starts by
detecting the power state of the target server. If the power is off
it will 'first power it on' and then proceed to 'set next boot device'
to pxe followed by a reset. For the initial power off state case, the
timing of these operations is such that the server is in BIOS when the
'set next boot device' command is issued.

This update modifies the host reinstall algorithm to first power-off
a server followed by setting the next boot device while the server is
confirmed to be powered off, then powered on. This ensures the server
gets and handles the set next boot device command operation properly.

This update also fixes a race condition between the bmc_handler and
power_handler by moving the final power state update in the power
handler to the power done phase.

Test Plan:

Verify all new reinstall failure path handling via fault insertion testing
Verify reinstall of powered off host
Verify reinstall of powered on host
Verify reinstall of Wildcat server with ipmi
Verify reinstall of Supermicro server with ipmi and redfish
Verify reinstall of Ironpass server with ipmi
Verify reinstall of WolfPass server with redfish and ipmi
Verify reinstall of Dell server with ipmi

Over 30 reinstalls were performed across all server types, with initial
power on and off using both ipmi and redfish (where supported).

Change-Id: Iefb17e9aa76c45f2ceadf83f23b1231ae82f000f
Closes-Bug: 1862065
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2020-02-12 15:44:26 +00:00
Eric MacDonald 9bf231a286 Fix BMC access loss handling
Recent refactoring of the BMC handler FSM introduced a code change that
prevents the BMC Access alarm from being raised after initial BMC
accessibility was established and is then lost.

This update ensures BMC access alarm management is working properly.

This update also implements ping failure debounce so that a single ping
failure does not trigger full reconnection handling. Instead that now
requires 3 ping failures in a row. This has the effect of adding a minute
to ping failure action handling before the usual 2 minute BMC access failure
alarm is raised. ping failure logging is reduced/improved.

Test Plan: for both hwmond and mtcAgent

PASS: Verify BMC access alarm due to bad provisioning (un, pw, ip, type)
PASS: Verify BMC ping failure debounce handling, recovery and logging
PASS: Verify BMC ping persistent failure handling
PASS: Verify BMC ping periodic miss handling
PASS: Verify BMC ping and access failure recovery timing
PASS: Verify BMC ping failure and recovery handling over BMC link pull/plug
PASS: Verify BMC sensor monitoring stops/resumes over ping failure/recovery

Regression:

PASS: Verify IPv6 System Install using provisioned BMCs (wp8-12)
PASS: Verify BMC power-off request handling with BMC ping failing & recovering
PASS: Verify BMC power-on request handling with BMC ping failing & recovering
PASS: Verify BMC reset request handling with BMC ping failing & recovering
PASS: Verify BMC sensor group read failure handling & recovery
PASS: Verify sensor monitoring after ping failure handling & recovery

Change-Id: I74870816930ef6cdb11f987424ffed300ff8affe
Closes-Bug: 1858110
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2020-01-03 09:34:37 -05:00
Eric MacDonald c4b8171ddd Refactor BMC provisioning in Maintenance
The current mechanism used to preserve the learned bmc protocol in
the filesystem on the active controller is problematic over swact.

This update removes the file storage method in favor of preserving
the learned protocol in the system inventory database as a key/value
pair at the host level in already existing mtce_info database field.

The specified or learned bmc access protocol is then shared with the
hardware monitor through inter-daemon maintenance messaging.

This update refactors bmc provisioning to accommodate bmc protocol
selection at the host rather than system level. Towards that this
update removes system level bmc_access_method selection in favor of
host level selection through bm_type. A bm_type of 'bmc' specifies
that the bmc access protocol for that host be learned. This has the
effect of making it the same as what is delivered today but without
support for changing it as the system level.

A system inventory update will be delivered shortly that enables bmc
access protocol selection at the host level. That update allows the
customer to specify the bmc access protocol at the host level to be
either dynamic (aka learned) or to only use 'redfish' or 'ipmi'.
That system inventory update delivers that information to maintenance
through bm_type via bmc provisioning. Until that update is delivered
bm_type always comes in as 'bmc' which get interpreted as 'dynamic'
to maintain existing configuration.

The following additional issues were also fixed in this update.

1. The nodeTimers module defaults the 'ring' member of timers that are
   not running to false but should be true.

2. Added a pingUtil_restart function to facilitate quicker sensor
   monitoring following provisioning changes and bmc access failures.

3. Enhanced the hardware monitor sensor grouping filter to accommodate
   non-standard Redfish readout labelling so that more sensors fall
   into the existing canned groups ; leads to more monitored sensors.

4. Added a 'http security mode' to hardware monitor messaging. This
   defaults to https as that is all that is supported by the Redfish
   implementation today. This field can be used to specify non-secure
   'http' mode in the future when that gets implemented.

5. Ensure the hardware monitor performs a bmc password re-fetch on every
   provisioning change.

Test Plan:

PASS: Verify bmc access protocol store/fetched from the database (mtce_info)
PASS: Verify inventory push from mtcAgent to hwmond over mtcAgent restart
PASS: Verify inventory push from mtcAgent to hwmond over hwmon restart
PASS: Verify bmc provisioning of ipmi and redfish servers
PASS: Verify learned bmc protocol persists over process restart and swact
PASS: Verify process startup with protocol already learned

Hardware Monitor:

PASS: Verify bmc_type=ipmi handling ; protocol forced to ipmi ; (re)prov
PASS: Verify bmc_type=redfish handling ; protocol forced to redfish ; (re)prov
PASS: Verify bmc_type=dynamic handling ; protocol is learned then persisted
PASS: Verify sensor model delete and relearn over ip address change
PASS: Verify sensor model delete and relearn over bm_type change change
PASS: Verify sensor model not relearned username change
PASS: Verify bm pw is re-fetched over any (re)provisioning change
PASS: Verify bmc re-provisioning soak (test-bmc-reprovisioning.sh 50 loops)
PASS: Verify protocol change handling, file cleanup, model recreation
PASS: Verify End-2-End behavior for bm_type change from redfish to ipmi
PASS: Verify End-2-End behavior for bm_type change from ipmi to redfish
PASS: Verify End-2-End behavior for bm_type change from redfish to dynamic
PASS: Verify End-2-End behavior for bm_type change from ipmi to dynamic
PASS: Verify End-2-End behavior for bm_type change from dynamic to ipmi
PASS: Verify End-2-End behavior for bm_type change from dynamic to redfish
PASS: Verify sensor model creation waits for server power to be on
PASS: Verify sensor relearn by provisioning change during model creation. (soak)

Regression:

PASS: Verify host power off and on.
PASS: Verify BMC access alarm handling (assert and clear)
PASS: Verify mtcAgent and hwmond logs add value
PASS: Verify no core dumps / seg faults.
PASS: Verify no mtcAgent and hwmond memory leak.
PASS: Verify delete of BMC provisioned host
PASS: Verify sensor monitoring, alarming, degrade and then clear cycle
PASS: Verify static analysis report of changed modules.
PASS: Verify host level bm_type=bmc functions as would dynamic selection
PASS: Verify batch provisioning and deprovisioning (7 nodes)
PASS: Verify batch provisioning to different protocol (5 nodes)
PASS: Verify handling of flaky Redfish responses

PEND: Verify System Install

Change-Id: Ic224a9c33e0283a611725b33c90009132cab3382
Closes-Bug: #1853471
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-12-09 09:39:49 -05:00
Zuul 069daf1e22 Merge "Add mtcAgent support for sm_node_unhealthy condition" 2019-10-16 19:15:03 +00:00
Eric MacDonald 675f49d556 Add mtcAgent support for sm_node_unhealthy condition
When heartbeat over both networks fail, mtcAgent
provides a 5 second grace period for heartbeat to
recover before failing the node.

However, when heartbeat fails over only one of the
networks (management or cluster) the mtcAgent does
not honour that 5 second grace period ; a bug.

When it comes to peer controller heartbeat failure
handling, SM needs that 5 second grace period to handle
swact before mtcAgent declares the peer controller as
failed, resets the node and updates the database.

This update implements a change that forces a 2 second
wait time between each fast enable and fixes the fast
enable threshold count to be the intended 3 retries.
This ensures that at least 5 seconds, actually 6 in
the case of single network heartbeat loss, passes
before declaring the node as failed.

In addition to that, a special condition is added to
detect and stop work if the active controller is
sm_node_unhealthy. We don't want mtcAgent to make
any database updates while in this failure mode.
This gives SM the time to handle the failure
according to the system's controllers' high
availability handling feature.

Test Plan:

PASS: Verify mtcAgent behavior on set and clear of
      SM node unhealthy state.
PASS: Verify SM has at least 5 seconds to shut down
      mtcAgent when heartbeat to peer controller fails
      for one or both networks.
PASS: Test real case scenario with link pull.
PASS: Verify logging in presence of real failure condition.

Change-Id: I8f8d6688040fe899aff6fc40aadda37894c2d5e9
Closes-Bug: 1847657
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-10-15 15:24:34 -04:00
Eric MacDonald df50847580 Ensure hbsClient ready event is cleared over a reboot.
A host sometimes (rarely) fails heartbeat immediately
following unlock.

The hbsClient sends its ready event every 5 seconds.
Mtce uses this event message as a clue that the target
host is ready to start heartbeat following Graceful
Recovery or in this case Enable sequence.

This update fixes a potential race condition where the
hbsClient ready event snuck through immediately following
the unlock reboot. This tricked mtc into starting heartbeat
too early following the online event that follows a reboot
which lead to a heartbeat failure.

Test Plan:
PASS: compute system install
PASS: standby controller lock/unlock soak (25 loops)
PASS: 2 compute async locked/unlock soak (50 loops each)

Regression:
PASS: inservice hearbeat failure detection and handling

Change-Id: I21699dbb2f0ab7355a9384d78b47a1fd1cea496d
Closes-Bug: 1847656
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-10-15 18:52:42 +00:00
Eric MacDonald 4c541f50d4 Maintenance Redfish support useability enhancements.
This update is a result of changes made during a suite of
end-to-end provisioning, reprovisioning and deprovisioning
customer exterience testing of the maintenance RedFish support
feature.

1. Force reconnection and password fetch on provisioning changes
2. Force reconnection and password fetch on persistent connection failures
3. Fix redfish protocol learning (string compare) in hardware monitor
4. Improve logging for some typical error paths.

Test Plan:

PASS: Verify handling of reprovisioning BMC between hosts that support
             different protocols.
PASS: Verify handling of reprovisioning ip address to host that leads to a
             different protocol select.
PASS: Verify manual relearn handling to recover from errors that result from
             the above case.
PASS: Verify host BMC deprovisioning handling and cleanup.
PASS: Verify sensor monitoring.
PASS: Verify hwmond sticks with a selected protocol once a sensor model
             has been created using that protocol.
PASS: Verify handling of BMC reprovision - ip address change only
PASS: Verify handling of BMC reprovision - username change only
FAIL: Verify handling of BMC reprovision - password change only
             https://bugs.launchpad.net/starlingx/+bug/1846418

Change-Id: I4bf52a5dc3c97d7794ff623c881dff7886234e79
Closes-Bug: #1846212
Story: 2005861
Task: 36606
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-10-03 11:57:58 -04:00
Eric MacDonald df9343b0cc Add redfish power/reset/reinstall bmc support to maintenance
This update delivers redfish support for Power-On/Off, Reset
and Netboot Reinstall handling to maintenance.

Test Plan: (Testing Continues)

PASS: Verify Redfish Power-Off action handling
PASS: Verify Redfish Power-On action handling
PASS: Verify Redfish Reset action handling
PASS: Verify compute Redfish Reinstall action handling from controller-0
PASS: Verify compute Redfish Reinstall action handling from controller-1
PASS: Verify Redfish Power-Off Action failure handling
PASS: Verify Redfish Power-On action failure handling
PASS: Verify Redfish Reset action failure handling
PASS: Verify Redfish Re-Install action failure handling
PASS: verify Reset progression cycle does not leak memory.
PASS: Verify bmc_handler failure handling does not leak memory.
PASS: Verify Inservice BMC access (ping) failure and recovery handling.
PASS: Verify BMC access failure alarm handling
PASS: Verify BMC provisioning and deprovisioning soak (redfish - wolfpass)
PASS: Verify BMC provisioning and deprovisioning does not leak memory.
PASS: Verify BMC provisioning handling with bad ip and/or bad username
PASS: Verify BMC reprovisioning to same protocol
PASS: Verify BMC reprovisioning from ipmi host to redfish host
PASS: Verify BMC reprovisioning from redfish host to ipmi host
PASS: Verify mixed protocol support in same lab
PASS: Verify mixed server support in same lab
PASS: Verify Large System Install with BMCs provisioned (wp8-12)
PASS: Verify bmc access method (learn,ipmi,redfish) learned from mtc.init
PASS: Verify Swact with BMCs provisioned.
PASS: Verify no segfaults.
PASS: Verify AIO System Install in lab that supports redfish (WC3-6, WP8-12, Dell 720 3-7)
PASS: Verify AIO Simplex Install with Redfish Support (SM1, SM3)
PASS: Verify AIO Duplex Install with Redfish Support (SM 5-6, Dell 720 1-2

Useability:

PASS: Verify handling of reprovisioning BMC between hosts that support
             different protocols.
PASS: Verify handling of reprovisioning ip address to host that leads to a
             different protocol select.
PASS: Verify manual relearn handling to recover from errors that result from
             the above case.
PASS: Verify host BMC deprovisioning handling and cleanup.
PASS: Verify sensor monitoring.
PASS: Verify fault insertion for both protocols and action handling.
PASS: Verify protocol select handover.
PASS: Verify hwmond sticks with a selected protocol once a sensor model
             has been created using that protocol.
PASS: Verify handling of missing bmc_access_method configuration select.
PASS: Verify inservice bmc_access_method service parameter modification handling.

Regression:

PASS: Verify redfish BMC info query logging.
PASS: Verify sensor monitoring and alarming still works.
PASS: Verify all power/reset/netboot commands for IPMI
PASS: Verify reprovisioning soak of Wolfpass servers
PASS: Verify reprovisioning soak of SM servers

Depends-on: https://review.opendev.org/#/c/679178/
Change-Id: I984057e04d7426e37d675cf4d334a4e35419f2e8
Story: 2005861
Task: 35826
Task: 36606
Task: 36467
Task: 36456
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-09-26 15:59:35 -04:00
Eric MacDonald 0d63a16d8d Improve BMC password first fetch handling in hwmon
Trying to get the BMC password through barbican before
the ping succeeds leads to an early bmc access lost
failure that
 1. produces a misleading bmc access lost failure log ;
    bmc access had not even been established yet.
 2. imposes as retry wait that delays re-establishing
    bmc access and therefore overall sensor monitoring.

This update also

  1. adds hostname to some of the secretUtil  API
     interfaces so that logs ar reported against the
     correct host rather than always the current
     controller hostname.

   2. Changes some success path logging to dlogs to
      reduce log noise.

   3. simplifies a ping ok log

Change-Id: Ib3b7de212294d6dc350ee17d363f4009b3b0dcb0
Story: 2005861
Task: 36595
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-09-17 18:57:08 +00:00
zhipengl 67d4ba105f Redfish support for Sensor Monitoring in hwmond
Add redfish hwmon thread function and related parse function
for Power and Thermal sensor data.
Removed some unused old functions.
Rename common function or variable with bmc prefix

Test done for this patch on simplex bare metal setup.
system host-sensor-list
system host-sensor-show
system host-sensorgroup-list
system host-sensorgroup-show
system host-sensorgroup-relearn

Story: 2005861
Task: 35815

Depends-on: https://review.opendev.org/#/c/671340
Change-Id: If8a35581d44df15749a049eda945f23d2323fd35
Signed-off-by: zhipengl <zhipengs.liu@intel.com>
2019-09-12 01:56:42 +08:00
Zuul 1e7b6aa8d9 Merge "Supress implicit-fallthrough warnings." 2019-09-09 19:34:29 +00:00
Eric MacDonald 4d2383818f Add bmc protocol select to maintenance
This update adds BMC Info Query command handling and
info logging to maintenance.

Example of the logs produced by the BMC Query are

  compute-2 manufacturer is Intel Corporation
  compute-2 model number:<str>  part number:<str>  serial number:<str>
  compute-2 BIOS firmware version is SE5C620.86B.00.01.0013.030920180427
  compute-2 BMC  firmware version is unavailable
  compute-2 power is on
  compute-2 has 2 processors
  compute-2 has 192 GiB of memory

Please note that the default protocol remains IPMI even
if Redfish support is detected. This is because the
power/reset/netboot control implementation for Redfish
has not yet been implemented.

Test Plan:

PASS: Verify redfish BMC info query logging.
PASS: Verify IPMI remains the default selected protocol.

Regression:

PASS: Verify sensor monitoring and alarming still works.
PASS: Verify power-off command handling.
PASS: Verify power-on command handling.
PASS: Verify reset command handling.
PASS: Verify reboot command handling.
PASS: Verify reinstall (netboot) command handling.

Change-Id: I654056119018a1751a70495e3df8b541d9e00b93
Story: 2005861
Task: 35826
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-09-08 14:14:15 -04:00
Erich Cordoba 7af0a828c1 Supress implicit-fallthrough warnings.
In some parts of the mtce code some implicit fallthrough are used.
This causes a warning in the compiler and in OSes like openSUSE the
-Werror flag is enforced leading to a build error.

In this commit the MTCE_FALLTHROUGH macro is used to tell the
compiler to not worry about this implicit fallthroughs as the it works
as intended.

Change-Id: I608d80eaa7298d0613ffa62ee82e03463d193d87
Signed-off-by: Erich Cordoba <erich.cordoba.malibran@intel.com>
2019-08-26 12:24:22 -05:00
Eric MacDonald 804ec52227 Add redfish support detection to maintenance
This update

1. Refactors some of the common maintenance ipmi
   definitions and utilities into a more generic
   'bmcUtil' module to reduce code duplication and improve
   improve code reuse with the introduction of a second
   bmc communication protocol ; redfish.

2. Creates a new 'redFishUtil' module similar to the existing
   'ipmiUtil' module but in support of common redfish
   utilities and definitions that can be used by both
   maintenance and the hardware monitor.

3. Moves the existing 'mtcIpmiUtil' module to a more common
   'mtcBmcUtil' and renames the 'ipmi_command_send/recv' to
   the more generic 'bmc_command_send/recv' which are enhanced
   to support both ipmi and redfish bmc communication methods.

4. Renames the bmc info collection and connection monitor ;
   'bm_handler' to 'bmc_handler' and adds support necessary
   to learn if a host's bmc supports redfish.

5. Renames the existing 'mtcThread_ipmitool' to a more common
   'mtcThread_bmc' and redfishtool support for the now common
   set of bmc thread commands and the addition of the new
   redfishtool bmc query, aka 'redfish root query', used to
   detect if a host's bmc supports redfish.

   Note: This aspect is the primary feature of this update.

         Namely the ability to detect and print a log indicating
         if a host's bmc supports redfish.

Test Plan:

PASS: Verify sensor monitoring and alarming still works.
PASS: Verify power-off command handling.
PASS: Verify power-on command handling.
PASS: Verify reset command handling.
PASS: Verify reinstall (netboot) command handling.
PASS: Verify logging when redfish is not supported.
PASS: Verify logging when redfish is supported.
PASS: Verify ipmitool is used regardless of redfish support.
PASS: Verify mtce thread error handling for both protocols.

Change-Id: I72e63958f61d10f5c0d4a93a49a7f39bdd53a76f
Story: 2005861
Task: 35825
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-08-19 14:03:37 +00:00
Eric MacDonald 62532a7eac Fix maintenance cluster-host messaging
Maintenance's success path messaging does not depend on cluster
network messaging. However, there are a number of failure mode
cases that do depend on cluster network messaging to properly
diagnose and offer a higher availability handling for some
failure cases.

For instance, when the management interface goes down, without cluster
network messaging remote hosts can be isolated. Being able to command-
reboot a host over cluster-host network offers higher availability.

Maintenance is designed to use the cluster network, if provisioned, as a
backup path for mtcAlive, node locked, reboot and several other commands
and acknowledgements.

Unfortunately, it was recently observed that maintenance is using
the 'nfs-controller' label to resolve cluster network addressing
which resolves to management network IPs. As a result all messages
intended to be going over the cluster-host network are instead just
redundant management network messages.

During debug of this issue several additional cluster network
messaging related issues were observed and fixed.

This update implements the following fixes

1. since there is no floating address for the cluster network the
   mtcClient was modified to send messages to both controllers where
   only the active controller will be listening and acting.
2. fixes port number mtce listens for cluster-host network messages
3. fixes port number mtce sends cluster-host network messages to.
4. mtcAlive messages are also sent on provisioned cluster network.
5. locked state notifications and acks sent on provisioned cluster network.
6. reboot request and acks sent on provisioned cluster network.
7. fixed command acknowledgement messaging.

This update also

1. envelopes the mtcAlive gate control to allow debug tracing of all gate
   state changes.
2. moves graceful recovery handling heartbeat failure state clear to the
   end of the recovery handler, just before heartbeat start.
3. adds sm unhealthy support to fail and automatically recover the
   inactive controller from an SM UNHEALTHY state.

----------
Test Plan:
----------

Functional:

PASS: Verify management network messaging
PASS: Verify cluster-host network messaging
PASS: Verify cluster-host messages with tcpdump
PASS: Verify cluster-host network mtcAlive messaging
PASS: Verify reboot request and ack reply over management network
PASS: Verify reboot request and ack reply over cluster-host network
PASS: Verify lock state notification and ack reply over management network
PASS: Verify lock state notification and ack reply over cluster-host network
PASS: Verify acknowledgement messaging
PASS: Verify maintenance daemon logging
PASS: Verify maintenance socket initialization

System:

PASS: Verify compute system install
PASS: Verify AIO system install

Feature:

PASS: Verify sm node unhealth handling (active:ignore, inactive:recover)

Change-Id: I092596d3e22438dd8a613a073614c188f6f5721d
Closes-Bug: #835268
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-07-18 14:54:45 -04:00
Zuul b6fd92cada Merge "Change enabled_count variable from bool to int in nodeLinkClass" 2019-06-24 17:20:25 +00:00
Zuul fb01944750 Merge "Revert "Supress implicit-fallthrough warnings."" 2019-06-20 23:07:30 +00:00
Erich Cordoba 05a7f2f4d3 Revert "Supress implicit-fallthrough warnings."
This reverts commit 717f285c64.

This proposed changed is not compatible with the old GCC 4.3, a better patch needs to be done compatible with both gcc versions.

Change-Id: I657a4b92f1d30dac16d3a611baec8140ade3ad80
2019-06-20 21:56:00 +00:00
Zuul 04495f5015 Merge "Supress implicit-fallthrough warnings." 2019-06-20 18:44:47 +00:00
Erich Cordoba 1292fd8ca0 Change enabled_count variable from bool to int in nodeLinkClass
The enabled_count member inside nodeLinkClass has a declared bool
datatype, but this variable is used as a counter, where an int data
type is a better choice. This change updates the datatype to int.

Also, there was a misleading-indentation warning in mtcNodeHdlrs.cpp,
which is also fixed in this change.

Change-Id: Ib154c5b6ae2e7068870733b5ee8971e20cedb43f
Story: 2005862
Task: 34163
Signed-off-by: Erich Cordoba <erich.cordoba.malibran@intel.com>
2019-06-17 16:59:27 -05:00
Erich Cordoba 717f285c64 Supress implicit-fallthrough warnings.
In some parts of the mtce code some implicit fallthrough are used.
This causes a warning in the compiler and in OSes like openSUSE the
-Werror flag is enforced leading to a build error.

In this commit the __attribute__ ((fallthrough)) is used to tell the
compiler to not worry about this implicit fallthroughs as the it works
as intended.

Change-Id: I219ff9d490f3f86ad045e0f0e891f40467baaf06
Story: 2005862
Task: 33667
Signed-off-by: Erich Cordoba <erich.cordoba.malibran@intel.com>
2019-06-17 16:30:35 -05:00
Saul Wold 67025c3bb2 metal: Convert wrsroot -> sysadmin
This also changes the group wrs_protected to sys_protected
to de-brand the user and group names.

Depends-On: I887464a20fc17d66529caea03be2b445156f9426
Change-Id: Icfd2faec0ba8236762c8045f5c244eaf13008ee4
Story: 2004716
Task: 28749
Signed-off-by: Saul Wold <sgw@linux.intel.com>
2019-06-14 15:12:03 -07:00
Eric MacDonald 1011fd8a1a Add network boot support to mtce reinstall handling
The current maintenance 'reinstall' handling requires
a host to be booted and online in order to perform
a reinstall by asking the mtcClient to wipe the
disks and self reboot thereby forcing a network boot
and reinstall.

This re-install process is problematic for hosts that
don't install properly and never come online or on new
system installs where the existing boot image on disk
is still valid ; local disk as the first boot device.

Getting around these issues prior to this update
requires manual BIOS intervention to force-select
a network boot.

This update continues to support the online-wipedisk
method for hosts that are not BMC provisioned and
adds offline reinstall support through IPMI commands
for hosts that are BMC provisioned.

For hosts that have the BMC provisioned, the re-install
handler will wait up to 10 minutes for maintenance to
to establish connectivity to the BMC if it has not already.

Then it will issue a devboot pxe IPMI command to tell the
BMC to boot from the network on the 'next' reset and then
maintenance proceeds to reset that host by a second IPMI
command. This way the host will boot from the network and
perform a local install even if the current image on disk
is valid. No manual BIOS actions required.

This update requires a small system inventory update to
relax the online requirement for BMC provisioned hosts so
that the reinstall to proceed. That update depends on this.

This update also does some minor cleanup in the unused
mtcAgent test head to fix a static analysis error.

Test Plan:
With BMC Test Cases: Success
----------------------------
PASS: Verify install requiring power on with valid image on disk ; pass case
PASS: Verify install while powered on but offline with invalid image on disk ; pass case
PASS: Verify install while powered on but offline with valid image on disk ; pass case
PASS: Verify install with UEFI boot
PASS: Verify BMC Reinstall on Dell (720)
PASS: Verify BMC Reinstall on WC
PASS: Verify BMC Reinstall on HP (hp380)
PEND: Verify BMC Reinstall on SM
PEND: Verify BMC Reinstall on WP
COND: Verify install Secure boot - 430 1-2 fails

With BMC Test Cases: Failure
---------------------------
PASS: Verify reinstall handling during install during online wait ; restarts the install
PASS: Verify reinstall handling during install before online wait ; no install interruption
PASS: Verify BMC not accessible at ReInstall start ; recovery
PASS: Verify BMC not accessible at ReInstall start ; timeout
PASS: Verify BMC accessibility loss over Install process
PASS: Verify netboot request failure handling ; no/bad response ; max retry
PASS: Verify reset request failure handling ; no retries
PASS: Verify BMC de-provisioning over install ; failure handling
PASS: Verify BMC re-provisioning over install ; BMC initially not accessible
PASS: Verify BMC re-provisioning over install ; BMC initially accessible
PASS: Verify install requiring power on but gets power-on receive failure
PASS: Verify install requiring power on but gets power-on request failure

No BMC Test Cases: Success
--------------------------
PASS: Verify install when host is powered on and online

No BMC Test Cases: Failure
--------------------------
PASS: Verify reinstall action handling during reinstall ; no install interruption
PASS: Verify install when host is powered off ; install fails
PASS: Verify install when host is powered on and offline ; install fails

Regression:
-----------
PASS: Verify host reset
PASS: Verify host power-off
PASS: Verify host power-on
PASS: Verify host sensor model and monitoring

Change-Id: Ic8c8232167c570e4f75c0bbe1604697966157184
Story: 2005650
Task: 30935
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-05-23 18:30:04 -04:00
Teresa Ho 8e51a1660a Refactor infrastructure network in mtce code
Updated to read the host cluster-host parameter in /etc/hosts
file.
Replaced references of infra network with cluster-host network

Story: 2004273
Task: 29473

Change-Id: I199fb82e5f6b459b181196d0802f1a74220b796e
Signed-off-by: Teresa Ho <teresa.ho@windriver.com>
2019-04-18 09:32:41 -04:00
Zuul 94126225d0 Merge "Add mtce support for manifest apply over initial controller unlock" 2019-04-11 18:00:48 +00:00
Alex Kozyrev aeb2c1f20a Fix for MTCE race condition in BMC secret handling
There is intermittent issue in getting BMC password in MTCE.
The process of obtaining a secret from Barbican stops after
a secret reference is received. No attempts to retrieve the
actual payload is atempted. This happens when the secret
reference reply is received right after BMC queries are
initiated. It was fine before when we had an one-stage
process of getting a password from keyring. We cannot
allow it now because of a two-stage Barbican process.

Change-Id: I381f69ab6a1a54118b22dd31feefcd93698120ad
Closes-bug: 1818284
Signed-off-by: Alex Kozyrev <alex.kozyrev@windriver.com>
2019-04-05 11:10:13 -04:00
Eric MacDonald 543a89eaf6 Add mtce support for manifest apply over initial controller unlock
The introduction of Ansible requires the execution of a manifest
as part of the first controller's initial unlock.

Unfortunately maintenance issues the lazy self reboot immediately
upon receiving the unlock command, interrupting the in-progress
manifest apply.

This update identifies the initial self reboot of the only
provisioned host condition and waits for up to a timeout
period for an unlock ready signal that is provided by
successful completion of the 'initial-unlock-manfest'.

Seeing the unlock ready signal prior to the timeout allows
the unlock self reboot to proceed normally.

Depends-On:https://review.openstack.org/#/c/643914
Story:2004695
Task:30243

Test Plan:
PASS: Verify timeout handling - allowing retry
PASS: Verify with signal - immediate
PASS: Verify with signal - before timeout

Change-Id: I3633e772310c36af5df57364f66c14f037b2ea8f
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-03-29 09:09:07 -04:00
Eric MacDonald f55ef546a7 Remove Resource Monitor ; aka rmon, from the load
All rmon resource monitoring has been moved to collectd.

This update removes rmon from mtce and the load.

Story: 2002823
Task: 30045

Test Plan:
PASS: Build and install a standard system.
PASS: Inspect mtce rpm list
PASS: Inspect logs
PASS: Check pmon.d

Change-Id: I7cf1fa071eac89274e7fae1f307e14d548cc945b
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-03-19 16:12:38 -04:00