Commit Graph

255 Commits

Author SHA1 Message Date
Zuul 975e868431 Merge "Change mtcInfo log in mtcCtrlMsg.cpp to a dlog" 2024-04-17 12:10:16 +00:00
Eric MacDonald 97092bd38b Change mtcInfo log in mtcCtrlMsg.cpp to a dlog
The mtcInfo message log was enhanced to include the payload in a
previous update without realizing that message contained the target
BMC's username and password.

This update switches that log to a debug (not enabled by default) log
to avoid revealing provisioned BMC crediatials in the mtce logs.

Test Plan:

PASS: Verify mtce package build
PASS: Verify mtcInfo log with bmc info payload is no longer logged.

Story: 2010940
Task: 49857
Change-Id: I35db04e9292471d92c24c98922350cfb72b5035e
2024-04-11 17:17:45 +00:00
Eric MacDonald 649e94c8da Add pxeboot mtcAlive messaging alarm handling
This update adds alarm handling to the recently introduced pxeboot
network mtcAlive messaging, see depends on review below.

A new 200.003 maintenance alarm is introduced with the second depends
on update below. This new alarm is MINOR but also Management Affecting
because the pxeboot network is required for node installation.

This update enhances the new pxeboot_mtcAlive_monitor FSM for the
purpose of detecting pxeboot mtcAlive message loss, alarming and
then clearing the alarm once pxceboot mtcAlive messaging resumes.

The new alarm assertion and clear is debounced:
 - alarm is asserted if message loss persists to the accumulation of
   12 missed messages or after 2 minutes of complete message loss.
 - alarm is cleared after decrementing the message missed counter to
   zero or 1 minute of loss-less messaging.

Upgrades are supported with the addition of a features list to the
mtcClient ready event. All new mtcClients that support pxeboot network
messaging now publish pxeboot mtcAlive support through this new
features list. This is rendered in the logs like this:

    <hostname> mtcClient ready ; with pxeboot mtcAlive support

The mtcAgent does not expect/monitor pxeboot mtcAlive messages from
hosts that don't publish the feature support.

Test Plan:

PASS: Verify mtcAlive period is 5 seconds.
PASS: Verify pxeboot mtcAlive monitor period is 10 seconds.
PASS: Verify mtcAgent sends mtcClient a mtcAlive request on every
      mtcAlive monitor miss.
PASS: Verify pxeboot mtcAlive alarm is not raised while a node is
      locked.

Alarm attributes:

PASS: Verify severity is minor.
PASS: Verify alarm is cleared while node is locked.
PASS: Verify alarm can be suppressed while unlocked.
PASS: Verify asserted alarm is management affecting.
PASS: Verify alarm-show output format including cause and repair
      action text.

Process Restart Handling:

PASS: Verify alarm is maintained over a mtcAgent process restart.
PASS: Verify pxeboot monitoring resumes with or without asserted alarm
      immediately following a mtcAgent process restart.
PASS: Verify mtcClient learns and starts pxeboot mtcAlive messaging
      immediately following mtcClient process restart for locked or
      unlocked nodes.

Alarm Debounce Handling:

PASS: Verify alarm assertion only after 2 minutes of mtcAlive loss.
PASS: Verify alarm clear after 1 minutes of mtcAlive recovery.
PASS: Verify assertion and recovery debounce logging.
PASS: Verify alarm management miss and loss controls handle all
      boundary conditions exercised by a 12 hr soak with randomized
      period between message loss and recovery.

Host Action Handling:

PASS: Verify mtcAlive alarm is not raised over a Host Unlock Enable.
PASS: Verify mtcAlive alarm is not raised over a Host Graceful Recovery.
PASS: Verify mtcAlive alarm is not raised over a Host Power Off/On.
PASS: Verify mtcAlive alarm is not raised over a Host Reboot/Reset.
PASS: Verify mtcAlive alarm is not raised over a Host Reinstall.
PASS: Verify pxeboot mtcAlive is factored into Host Offline Handling.
PASS: Verify pxeboot alarm handling for node that does not send
      pxeboot mtcAlive after unlock.

Stuck Alarm Avoidance Handling:

PASS: Verify typical alarm assertion and clear handling.
PASS: Verify alarm is maintained or cleared over node reboot if the
      messaging issue persists or resolves over the reboot recovery.
PASS: Verify mtcAlive alarm is maintained over a Swact and cleared
      if the messaging is ok on the newly active controller.
PASS: Verify mtcAlive alarm assertion recovery case over uncontrolled
      Swact due to active controller reboot.
PASS: Verify alarm is cleared over a spontaneous reboot if pxeboot
      messaging recovers over that reboot.

Upgrades Case:

PASS: Verify pxeboot mtcAlive monitoring only occurs on mtcClients
      that actually support pxeboot network mtcAlive monitoring.

PASS: Verify mtcClient new features list, parsing which enables
      pxeboot  mtcAlive monitoring for that node.

PASS: Verify pxeboot mtcAlive messaging monitoring is not enabled
      towards nodes whose mtcClient does publish pxeboot mtcAlive
      messaging feature support.
PROG: Verify AIO DX upgrade from 22.12 to current master branch.
      Focus on pxeboot messaging over the upgrade process.

Depends-On: https://review.opendev.org/c/starlingx/metal/+/912654
Depends-On: https://review.opendev.org/c/starlingx/fault/+/914660
Story: 2010940
Task: 49542
Change-Id: I1b51ad9ebcf010f5dee9a86c0295be3da6e2f9b1
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2024-04-09 14:13:23 +00:00
Eric MacDonald 14bb67789e Add pxeboot network mtcAlive messaging to Maintenance
The introduction of the new pxeboot network requires maintenance
verify and report on messaging failures over that network.

Towards that, this update introduces periodic mtcAlive messaging
between the mtcAgent and mtcClinet.

Test Plan:

PASS: Verify install and provision each system type with a mix
             of networking modes ; ethernet, bond and vlan
             - AIO SX, AIO DX, AIO DX plus
             - Standard System 2+1
             - Storage System 2+1+1
PASS: Verify feature with physical on management interface
PASS: Verify feature with vlan on management interface
PASS: Verify feature with bonded management interface
PASS: Verify feature with bonded vlans on management interface
PASS: Verify in bonded cases handling with 2, 1 or no slaves found
PASS: Verify mgmt-combined or separate cluster-host network
PASS: Verify mtcClient pxeboot interface address learning
             - for worker and storage nodes       ; dhcp leases file
             - for controller nodes before unlock ; dhcp leases file
             - for controller nodes after unlock  ; static from ifcfg
             - from controller within 10 seconds of process restart
PASS: Verify mtcAgent pxeboot interface address learning from
             dnsmasq.hosts file
PASS: Verify pxeboot mtcAlive initiation, handling, loss detection
             and recovery
PASS: Verify success and failure handling of all new pxeboot ip
             address learning functions ;
             - dhcp - all system node installs.
             - dnsmasq.hosts - active controller for all hosts.
             - interfaces.d - controller's mtcClient pxeboot address.
             - pxeboot req mtcAlive - mtcAgent mtcAlive request message.
PASS: Verify mtcClient pxeboot network 'mtcAlive request' and 'reboot'
             command handling for ethernet, vlan and bond configs.
PASS: Verify mtcAlive sequence number monitoring, out-of-sequence
             detection, handling and logging.
PASS: Verify pxeboot rx socket binding and non-blocking attribute
PASS: Verify mtcAgent handling stress soaking of sustained incoming
             500+ msgs/sec ; batch handling and logging.
PASS: Verify mtcAgent and mtcClient pxeboot tx and rx socket messaging,
             failure recovery handling and logging.
PASS: Verify pxeboot receiver is not setup on the oam interface on
             controller-0 first install until after initial config
             complete.

Regression:

PASS: Verify mtcAgent/mtcClient online and offline state management
PASS: Verify mtcAgent/mtcClient command handling
      - over management network
      - over cluster-host network
PASS: Verify mtcClient interface chain log for all iface types
      - bond    : vlan123 -> pxeboot0 (802.3ad 4) -> enp0s8 and enp0s9
      - vlan    : vlan123 -> enp0s8
      - ethernet: enp0s8
PASS: Verify mtcAgent/mtcClient handling and logging including debug
      logging for standard operations
      - node install and unlock
      - node lock and unlock
      - node reinstall, reboot, reset
PASS: Verify graceful recovery handling of heartbeat loss failure.
      - node reboot
      - management interface down
PASS: Verify systemcontroller and subcloud install with dc-libvirt
PASS: Verify no log flooding, coredumps, memory leaks

Story: 2010940
Task: 49541
Change-Id: Ibc87b85e3e0e07c3b8c40b5291bd3372506fbdfb
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2024-03-28 15:28:27 +00:00
Eric MacDonald 3c94b0e552 Avoid creating non-volatile node locked file while in simplex mode
It is possible to lock controller-0 on a DX system before controller-1
has been configured/enabled. Due to the following recent updates this
can lead to SM disabling all controller services on that now locked
controller-0 thereby preventing any subsequent controller-0 unlock
attempts.

https://review.opendev.org/c/starlingx/metal/+/907620
https://review.opendev.org/c/starlingx/ha/+/910227

This update modifies the mtce node locked flag file management so that
the non-volatile node locked file (/etc/mtc/tmp/.node_locked) is only
created on a locked host after controller-1 is installed, provisioned
and configured.

This prevents SM from shutting down if the administrator locks
controller-0 before controller-1 is configured.

Test Plan:

PASS: Verify AIO DX Install.
PASS: Verify Standard System Install.
PASS: Verify Swact back and forth.
PASS: Verify lock/unlock of controller-0 prior to controller-1 config
PASS: Verify the non-volatile node locked flag file is not created
      while the /etc/platform/simplex file exists on the active
      controller.
PASS: Verify lock and delete of controller-1 puts the system back
      into simplex mode where the non-volatile node locked flag file
      is once again not created if controller-0 is then unlocked.
PASS: Verify an existing non-volatile node locked flag file is removed
      if present on a node that is locked without new persist option.
PASS: Verify original reported issue is resolved for DX systems.

Closes-Bug: 2051578
Change-Id: I40e9dd77aa3e5b0dc03dca3b1d3d73153d8816be
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2024-03-09 12:45:54 +00:00
Eric MacDonald d9982a3b7e Mtce: Create non-volatile backup of node locked flag file
The existing /var/run/.node_locked flag file is volatile.
Meaning it is lost over a host reboot which has DOR implications.

Service Management (SM) sometimes selects and activates services
on a locked controller following a DOR (Dead Office Recovery).

This update is part one of a two-part update that solves both
of the above problems. Part two is a change to SM in the ha git.
This update can be merged without part two.

This update maintains the existing volatile node locked file because
it is looked at by other system services. So to minimize the change
and therefore patchback impact, a new non-volatile 'backup' of the
existing node locked flag file is created.

This update incorporates modifications to the mtcAgent and mtcClient,
introducing a new backup file and ensuring their synchronized
management to guarantee their simultaneous presence or absence.

Note: A design choice was made to not use a symlink of one to the
      other rather than add support to manage symlinks in the code.
      This approach was chosen for its simplicity and reliability
      in directly managing both files. At some point in the future
      volatile file could be deprecated contingent upon identifying
      and updating all services that directly reference it.

This update also removes some dead code that was adjacent to my update.

Test Plan: This test plan covers the maintenance management of
           both files to ensure they always align and the expected
           behavior exists.

PASS: Verify AIO DX Install.
PASS: Verify Storage System Install.
PASS: Verify Swact back and forth.
PASS: Verify mtcClient and mtcAgent logging.
PASS: Verify node lock/unlock soak.

Non-volatile (Nv) node locked management test cases:

PASS: Verify Nv node locked file is present when a node is locked.
      Confirmed on all node types.
PASS: Verify any system node install comes up locked with both node
      locked flag files present.
PASS: Verify mtcClient logs when a node is locked and unlocked.
PASS: Verify Nv node locked file present/absent state mirrors the
      already existing /var/run/.node_locked flag file.
PASS: Verify node locked file is present on controller-0 during
      ansible run following initial install and removed as part
      of the self-unlock.
PASS: Verify the Nv node locked file is removed over the unlock
      along with the administrative state change prior to the
      unlock reboot.
PASS: Verify both node locked files are always present or absent
      together.
PASS: Verify node locked file management while the management
      interface is down. File is still managed over cluster network.
PASS: Verify node locked file management while the cluster interface
      is down. File is still managed over management network.
PASS: Verify behavior if the new unlocked message is received by a
      mtcClient process that does not support it ; unknown command log.
PASS: Verify a node locked state is auto corrected while not in a
      locked/unlocked action change state.
      ... Manually remove either file on locked node and verify
          they are both recreated within 5 seconds.
      ... Manually create either node locked file on unlocked worker
          or storage node and verify the created files are removed
          within 5 seconds.
          Note: doing this to the new backup file on the active
                controller will cause SM to shutdown as expected.
PASS: Verify Nv node locked file is auto created on a node that
      spontaneously rebooted while it was unlocked. During the
      reboot the node was administratively locked.
      The node should come online with both node locked files present.

Partial-Bug: 2051578
Change-Id: I0c279b92491e526682d43d78c66f8736934221de
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2024-02-14 00:54:11 +00:00
Eric MacDonald 191c0aa6a8 Add a wait time between http request retries
Maintenance interfaces with sysinv, sm and the vim using http requests.
Request timeout's have an implicit delay between retries. However,
command failures or outright connection failures don't.

This has only become obvious in mtce's communication with the vim
where there appears to be a process startup timing change that leads
to the 'vim' not being ready to handle commands before mtcAgent
startup starts sending them after a platform services group startup
by sm.

This update adds a 10 second http retry wait as a configuration option
to mtc.conf. The mtcAgent loads this value at startup and uses it
in a new HTTP__RETRY_WAIT state of http request work FSM.

The number of retries remains unchanged. This update is only forcing
a minimum wait time between retries, regardless of cause.

Failure path testing was done using Fault Insertion Testing (FIT).

Test Plan:

PASS: Verify the reported issue is resolved by this update.
PASS: Verify http retry config value load on process startup.
PASS: Verify updated value is used over a process -sighup.
PASS: Verify default value if new mtc.conf config value is not found.
PASS: Verify http connection failure http retry handling.
PASS: Verify http request timeout failure retry handling.
PASS: Verify http request operation failure retry handling.

Regression:

PASS: Build and install ISO - Standard and AIO DX.
PASS: Verify http failures do not fail a lock operation.
PASS: Verify host unlock fails if its http done queue shows failures.
PASS: Verify host swact.
PASS: Verify handling of random and persistent http errors involving
      the need for retries.

Closes-Bug: 2047958
Change-Id: Icc758b0782be2a4f2882efd56f5de1a8dddea490
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2024-02-07 20:33:01 +00:00
Eric Macdonald 50dc29f6c0 Improve maintenance power/reset control command retry handling
This update improves on and drives consistency into the
maintenance power on/off and reset handling in terms of
retries and use of graceful and immediate commands.

This update maintains the 10 retries for both power-on
and power-off commands and increases the number of retries
for the reset command from 5 to 10 to line up with the
power operation commands.

This update also ensures that the first 5 retries are done
with the graceful action command while the last 5 are with
the immediate.

This update also removed a power on handling case that could
have lead to a stuck state. This case was virtually impossible
to hit based on the required sequence of intermittent command
failures but that scenario handling was fixed up anyway.

Issues have been seen with the power-off handling on some servers.
Suspect that those servers need more time to power-off. So, this
introduced a 30 seconds delay following a power-off command before
issuing the power status query to give the server some time to
power-off before retrying the power-off command.

Test Plan: Both IPMI and Redfish

PASS: Verify power on/off and reset handling support up to 10 retries
PASS: Verify graceful command is used for the first power on/off
      or reset try and the first 5 retries
PASS: Verify immediate command is used for the final 5 retries
PASS: Verify reset handling with/without retries (none/mid/max)
PASS: Verify power-on  handling with/without retries (none/mid/max)
PASS: Verify power-off handling  with/without retries (none/mid/max)
PASS: Verify power status command failure handling for power on/off
NOTE: FIT (fault insertion testing) was used to create retry scenarios

PASS: Verify power-off inter retry delay feature
PASS: Verify 30 second power-off to power query delay
PASS: Verify redfish power/reset commands used are logged by default
PASS: Verify power-off/on and reset logging

Regression:

PASS: verify power-on/off and reset handling without retries
PASS: Verify power-off handling when power is already off
PASS: Verify power-on handling when power is already on

Closes-Bug: 2031945
Signed-off-by: Eric Macdonald <eric.macdonald@windriver.com>
Change-Id: Ie39326bcb205702df48ff9dd090f461c7110dd36
2024-01-25 22:42:26 +00:00
Zuul 125601c2f9 Merge "Failure case handling of LUKS service" 2023-12-14 18:09:46 +00:00
Jagatguru Prasad Mishra 1210ed450a Failure case handling of LUKS service
luks-fs-mgr service creates and unseals the LUKS volume used to store
keys/secrets. This change handles the failure case if this essential
service is inactive. It introduces an alarm LUKS_ALARM_ID which is
raised if service is inactive which implies that there is an issue in
creating or unsealing the LUKS volume.

Test Plan:
PASS" build-pkgs -c -p mtce-common
PASS: build-pkgs -c -p mtce
PASS: build-image
PASS: AIO-SX bootstrap with luks volume status active
PASS: AIO-DX bootstrap with volume status active
PASS: Standard setup with 2 controllers and 1 compute node with luks
      volume status active. There should not be any alarm and node
      status should be unlocked/enabled/available.
PASS: AIO-DX node enable failure on the controller where luks volume
      is inactive. Node availability should be failed. A critical
      alarm with id 200.016 should be displayed with 'fm alarm-list'
PASS: AIO-SX node enable failure on the controller-0. Node availability
      should be failed. A critical alarm with id 200.016 should be
      displayed with 'fm alarm-list'
PASS: Standard- node enable failure on the node (controller-0,
      controller-1, storage-0, compute-1). Node availability
      should be failed. A critical alarm with id 200.016 should be
      displayed with 'fm alarm-list' for the failed host.
PASS: AIO-DX In service volume inactive should be detected and a
      critical alarm should be raised with ID 200.016. Node
      availability should be changed to degraded.
PASS: AIO-SX In service volume inactive  status should be detected
      and a critical alarm should be raised with ID 200.016. Node
      availability should be changed to degraded.
PASS: Standard ( 2 controller, 1 storage, 1 compute) In service
      volume inactive status should be detected and a
      critical alarm should be raised with ID 200.016. Node
      availability should be changed to degraded.
PASS: AIO-DX In service: If volume becomes active and a LUKS alarm
      is active, alarm should be cleared. Node availability should
      be changed to available.
PASS: AIO-SX In service: If volume becomes active and a  LUKS alarm is
      active, alarm should be cleared. Node availability should be
      changed to available.
PASS: Standard ( 2 controller, 1 storage, 1 compute) In service:
      If volume becomes active and a LUKS alarm is active, alarm
      should be cleared. Node availability should be changed to
      available.
PASS: AIO-SX, AIO-DX, Standard- If intest fails and node availability
      is 'failed'. After fixing the volume issue, a lock/unlock should
      make the node available.

Story: 2010872
Task: 49108

Change-Id: I4621e7c546078c3cc22fe47079ba7725fbea5c8f
Signed-off-by: Jagatguru Prasad Mishra <jagatguruprasad.mishra@windriver.com>
2023-12-06 00:34:02 -05:00
Zuul 1332ebb7a7 Merge "Replace a file test from fsmond" 2023-12-04 14:03:11 +00:00
Teresa Ho 36814db843 Increase timeout for runtime manifest
In management network reconfiguration for AIO-SX, the runtime manifest
executed during host unlock could take more than five minutes to complete.
This commit is to extend the timeout period from five minutes to eight
minutes.

Test Plan:
PASS: AIO-SX subcloud mgmt network reconfiguration

Story: 2010722
Task: 49133

Change-Id: I6bc0bacad86e82cc1385132f9cf10b56002f385e
Signed-off-by: Teresa Ho <teresa.ho@windriver.com>
2023-11-23 16:51:22 -05:00
Erickson Silva de Oliveira 16181a2ce8 Replace a file test from fsmond
fsmond tries to create a test file in "/.fs-test" but
it is not possible because "/" is blocked by ostree.

So the fix is to replace this path from fsmond monitoring
with /sysroot/.fs_test.

Below is a comparison of the logs:
  - Before change:
  ( 196) fsmon_service : Warn : File (/.fs-test) test failed

  - After change:
  ( 201) fsmon_service : Info : tests passed

Test Plan:
  - PASS: Build mtce package
  - PASS: Replace fsmond binary on AIO-SX
  - PASS: Check fsmond.log output

Closes-Bug: 2043712

Change-Id: Ib4bad73448735bce1dff598151fce86f867f4db7
Signed-off-by: Erickson Silva de Oliveira <Erickson.SilvadeOliveira@windriver.com>
2023-11-17 08:15:28 -03:00
Eric MacDonald 79d8644b1e Add bmc reset delay in the reset progression command handler
This update solves two issues involving bmc reset.

Issue #1: A race condition can occur if the mtcAgent finds an
          unlocked-disabled or heartbeat failing node early in
          its startup sequence, say over a swact or an SM service
          restart and needs to issue a one-time-reset. If at that
          point it has not yet established access to the BMC then
          the one-time-reset request is skipped.

Issue #2: When issue #1 race conbdition does not occur before BMC
          access is established the mtcAgent will issue its one-time
          reset to a node. If this occurs as a result of a crashdump
          then this one-time reset can interrupt the collection of
          the vmcore crashdump file.

This update solves both of these issues by introducing a bmc reset
delay following the detection and in the handling of a failed node
that 'may' need to be reset to recover from being network isolated.

The delay prevents the crashdump from being interrupted and removes
the race condition by giving maintenance more time to establish bmc
access required to send the reset command.

To handle significantly long bmc reset delay values this update
cancels the posted 'in waiting' reset if the target recovers online
before the delay expires.

It is recommended to use a bmc reset delay that is longer than a
typical node reboot time. This is so that in the typical case, where
there is no crashdump happening, we don't reset the node late in its
almost done recovery. The number of seconds till the pending reset
countdown is logged periodically.

It can take upwards of 2-3 minutes for a crashdump to complete.
To avoid the double reboot, in the typical case, the bmc reset delay
is set to 5 minutes which is longer than a typical boot time.
This means that if the node recovers online before the delay expires
then great, the reset wasn't needed and is cancelled.

However, if the node is truely isolated or the shutdown sequence
hangs then although the recovery is delayed a bit to accomodate for
the crashdump case, the node is still recovered after the bmc reset
delay period. This could lead to a double reboot if the node
recovery-to-online time is longer than the bmc reset delay.

This update implements this change by adding a new 'reset send wait'
phase to the exhisting reset progression command handler.

Some consistency driven logging improvements were also implemented.

Test Plan:

PASS: Verify failed node crashdump is not interrupted by bmc reset.
PASS: Verify bmc is accessible after the bmc reset delay.
PASS: Verify handling of a node recovery case where the node does not
      come back before bmc_reset_delay timeout.
PASS: Verify posted reset is cancelled if the node goes online before
      the bmc reset delay and uptime shows less than 5 mins.
PASS: Verify reset is not cancelled if node comes back online without
      reboot before bmc reset delay and still seeing mtcAlive on one
      or more links.Handles the cluster-host only heartbeat loss case.
      The node is still rebooted with the bmc reset delay as backup.
PASS: Verify reset progression command handling, with and
      without reboot ACKs, with and without bmc
PASS: Verify reset delay defaults to 5 minutes
PASS: Verify reset delay change over a manual change and sighup
PASS: Verify bmc reset delay of 0, 10, 60, 120, 300 (default), 500
PASS: Verify host-reset when host is already rebooting
PASS: Verify host-reboot when host is already rebooting
PASS: Verify timing of retries and bmc reset timeout
PASS: Verify posted reset throttled log countdown

Failure Mode Cases:

PASS: Verify recovery handling of failed powered off node
PASS: Verify recovery handling of failed node that never comes online
PASS: Verify recovery handling when bmc is never accessible
PASS: Verify recovery handling cluster-host network heartbeat loss
PASS: Verify recovery handling management network heartbeat loss
PASS: Verify recovery handling both heartbeat loss
PASS: Verify mtcAgent restart handling finding unlocked disabled host

Regression:

PASS: Verify build and DX system install
PASS: Verify lock/unlock (soak 10 loops)
PASS: Verify host-reboot
PASS: Verify host-reset
PASS: Verify host-reinstall
PASS: Verify reboot graceful recovery (force and no force)
PASS: Verify transient heartbeat failure handling
PASS: Verify persistent heartbeat loss handling of mgmt and/or cluster networks
PASS: Verify SM peer reset handling when standby controller is rebooted
PASS: Verify logging and issue debug ability

Closes-Bug: 2042567
Closes-Bug: 2042571
Change-Id: I195661702b0d843d0bac19f3d1ae70195fdec308
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2023-11-02 20:58:00 +00:00
Enzo Candotti 23143abbca Update crashDumpMgr to source config from envfile
This commit updates the crashDumpMgr service in order to:
- Cleanup of current service naming and packaging to follow the
  standard Linux naming convention:
    - Repackage /etc/init.d/crashDumpMgr to
      /usr/sbin/crash-dump-manager
    - Rename crashDumpMgr.service to crash-dump-manager.service
- Add EnvironmentFile to crash-dump-manager service file to source
  configuration from /etc/default/crash-dump-manager.
- Update ExecStart of crash-dump-manager service to use parameters
  from EnvironmentFile
- Update crash-dump-manager service dependencies to run after
  config.service.
- Update logrotate configuration to support the retention polices of
  the maximum files. The “rotate 1” option was removed to permit
  crash-dump-manager to manage pruning old files.
- Modify the crash-dump-manager script to enable updates to the
  max_files parameter to a lower value. If there are currently more
  files than the new max_files value, the oldest files will be
  deleted the next time a crash dump file needs to be stored, thus
  adhering to the new max_files values.

Test Plan:

PASS: Build ISO and perform a fresh install. Verify the new
crash-dump-manager service is enabled and working as expected.
PASS: Add and apply new crashdump service parameters and force a kernel
panic. Verify that after the reboot, the max_files, max_used,
min_available and max_size values are updated accordingly to the service
parameters values.
PASS: Verify that the crashdump files are rotated as expected.

Story: 2010893
Task: 48910

Change-Id: I4a81fcc6ba456a0d73067b77588ee4a125e44e62
Signed-off-by: Enzo Candotti <enzo.candotti@windriver.com>
2023-10-06 23:06:54 +00:00
Enzo Candotti a120cc5fea Add new configuration parameters to crashDumpMgr
This commmit updates crashDumpMgr in order to add three new parameters
and enhance the existing one.

1. Maximum Files: Added 'max-files' parameter to specify the maximum
   number of saved crash dump files. The default value is 4.
2. Maximum Size: Updated the 'max-size' parameter to support
   the 'unlimited' value. The default value is 5GiB.
3. Maximum Used: Included 'max-used' parameter to limit the maximum
   storage used by saved crash dump files. It supports 'unlimited'
   and has a default value of unlimited.
4. Minimum Available: Implemented 'min-available' parameter, enabling
   the definition of a minimum available storage threshold on the
   crash dump file system. The value is restricted to a minimum of
   1GB and defaults to 10%.

These enhancements refine the crash dump management process and
offer more control over storage usage and crash dump file retention.

Story: 2010893
Task: 48676

Test Plan:
1) max-files parameter:
  PASS: don't set max-files param. Ensure the default value is used.
  Create 5 directories inside /var/crash. Each of them contains
  dmesg.<date> and dump.<date>. run the crashDumpMgr script.
  Verify:
    PASS: the vmcore_first.tar.1.gz is created when the first
          directory is read.
    PASS: 4 more vmcore_<date>.tar files are created.
    PASS: There will be 1 vmcore_first.tar.1.gz and 4
          vmcore_<date>.tar inside /var/log/crash.
    PASS: There will be one summary file for each direcory:
          <date>_dmesg.<date> inside /var/crash
2) max-size parameter
  PASS: don't set max-size param. Ensure the default value is used
        (5GiB).
  PASS: Set a fixed max-size param. Create a dump.<date> file greater
        that the max-size param. Run the crashDumpMgr script. Verify
        that the crash dump file is not generated and a log
        message is displayed.
3) max-used parameter:
  PASS: don't set max-used param. Ensure the default value is used
        (unlimited).
  PASS: Set a fixed max-used param. Create a dump.<date> file that
        will generate that the used space is greater that the
        max-used param. Run the crashDumpMgr script. Verify that
        the crash dump file is not generated, a log message is
        displayed and the directory is deleted.
4) min-available parameter:
  PASS: don't set min-available param. Ensure the default value is
        used (10% of /var/log/crash).
  PASS: Set a fixed 'min-available' param. Generate a 'dump.<date>'
        file to simulate a situation where the remaining space is
        less than the 'min-available' parameter. Run the crashDumpMgr
        script and ensure that it does not create the crashdump file,
        displays a log message, and deletes the entry.
5) PASS: Since the crashDumpMgr.service file is not being modified,
         verify that the script takes the default values.

Note: All tests have also been conducted by generating a kernel panic
and ensuring the crashDumpMgr script follows the correct workflow.

Change-Id: I8948593469dae01f190fd1ea21da3d0852bd7814
Signed-off-by: Enzo Candotti <enzo.candotti@windriver.com>
2023-09-18 19:22:09 +00:00
Eric MacDonald d863aea172 Increase mtce host offline threshold to handle slow host shutdown
Mtce polls/queries the remote host for mtcAlive messages
for 42 x 100 ms intervals over unlock or host failed cases.
Absence of mtcAlive during this (~5 sec) period indicates
the node is offline.

However, in the rare case where shutdown is slow, 5 seconds
is not long enough. Rare cases have been seen where 7 or 8
second wait time is required to properly declare offline.

To avoid the rare transient 200.004 host alarm over an
unlock operation, this update increases the mtce host
offline window from 5 to 10 seconds (approx) by modifying
the mtce configuration file offline threshold from 42 to 90.

Test Plan:

PASS: Verify unchallenged failed to offline period to be ~10 secs
PASS: Verify algorithm restarts if there is mtcAlive received
      anytime during the polls/queries (challenge) window.
PASS: Verify challenge handling leads to a longer but
      successful offline declaration.
PASS: Verify above handling for both unlock and spontaneous
      failure handling cases.

Closes-Bug: 2024249
Change-Id: Ice41ed611b4ba71d9cf8edbfe98da4b65dcd05cf
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2023-06-16 18:14:08 +00:00
Matheus Guilhermino a0e270b51b Add mpath support to wipedisk script
The wipedisk script was not able to find the boot device
when using multipath disks. This is due to the fact that
multipath devices are not listed under /dev/disk/by-path/.

To add support to multipath devices, the script should look
for the boot device under /dev/disk/by-id/ as well.

Test Plan
PASS: Successfully run wipedisk on a AIO-SX with multipath
PASS: Successfully run wipedisk on a AIO-SX w/o multipath

Closes-bug: 2013391

Signed-off-by: Matheus Guilhermino <matheus.machadoguilhermino@windriver.com>
Change-Id: I3af76cd44f22795784a9184daf75c66fc1b9874f
2023-04-10 17:10:22 -03:00
Al Bailey 37c5910a62 Update mtce debian package ver based on git
Update debian package versions to use git commits for:
 - mtce         (old 9, new 30)
 - mtce-common  (old 1, new 9)
 - mtce-compute (old 3, new 4)
 - mtce-control (old 7, new 10)
 - mtce-storage (old 3, new 4)

The Debian packaging has been changed to reflect all the
git commits under the directory, and not just the commits
to the metadata folder.

This ensures that any new code submissions under those
directories will increment the versions.

Test Plan:
  PASS: build-pkgs -p mtce
  PASS: build-pkgs -p mtce-common
  PASS: build-pkgs -p mtce-compute
  PASS: build-pkgs -p mtce-control
  PASS: build-pkgs -p mtce-storage

Story: 2010550
Task: 47401
Task: 47402
Task: 47403
Task: 47404
Task: 47405

Signed-off-by: Al Bailey <al.bailey@windriver.com>
Change-Id: I4846804320b0ad3ec10799a468a9ee3bf7973587
2023-03-02 14:50:35 +00:00
Kyale, Eliud 502662a8a7 Cleanup mtcAgent error logging during startup
- reduced log level in http util to warning
- use inservice test handler to ensure state change notification
  is sent to vim
- reduce retry count from 3 to 1 for add_handler state_change
  vim notification

Test plan:
PASS - AIO-SX: ansible controller startup (race condition)
PASS - AIO-DX: ansible controller startup
PASS - AIO-DX: SWACT
PASS - AIO-DX: power off restart
PASS - AIO-DX: full ISO install
PASS - AIO-DX: Lock Host
PASS - AIO-DX: Unlock Host
PASS - AIO-DX: Fail Host ( by rebooting unlocked-enabled standby controller)

Story: 2010533
Task: 47338

Signed-off-by: Kyale, Eliud <Eliud.Kyale@windriver.com>
Change-Id: I7576e2642d33c69a4b355be863bd7183fbb81f45
2023-02-14 14:18:02 -05:00
Christopher Souza 56ab793bc5 Change hostwd emergency log to write to /dev/kmsg
The hostwd emergency logs was written to /dev/console,
the change was to add the prefix "hoswd:" to the log message
and write to /dev/kmsg.

Test Plan:

Pass: AIO-SX and AIO DX full deployment.
Pass: kill pmond and wait for the emergency log to be written.
Pass: check if the emergency log was written to /dev/kmsg.
Pass: Verify logging for quorum report missing failure.
Pass: Verify logging for quorum process failure.
Pass: Verify emergency log crash dump logging to mesg and
      console logging for each of the 2 cases above with
      stressng overloading the server (CPU, FS and Memory);
      stress-ng --vm-bytes 4000000000 --vm-keep -m 30 -i 30 -c 30

Story: 2010533
Task: 47216

Co-authored-by: Eric MacDonald <eric.macdonald@windriver.com>
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Co-authored-by: Christopher Souza <Christopher.DeOliveiraSouza@windriver.com>
Signed-off-by: Christopher Souza <Christopher.DeOliveiraSouza@windriver.com>
Change-Id: I0da82f964dd096840259c4d0ed4e5f558debdf22
2023-02-01 23:41:14 +00:00
Eric MacDonald a3cba57a1f Adapt Host Watchdog to use kdump-tools
The Debian package for kdump changed from kdump to kdump-tools

Test Plan:

PASS: Verify build and install AIO DX system
PASS: Verify host watchdog detects kdump as active in debian

Closes-Bug: 2001692
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Change-Id: Ie1ac29d3d29f3d9c843789cdedf85081fe790616
2023-01-04 12:57:19 -05:00
Robert Church 1796ed8740 Update wipedisk for LVM based rootfs
Now that the root filesystem is based on an LVM logical volume, discover
the root disk by searching for the boot partition.

Changes include:
 - remove detection of rootfs_part/rootfs and adjust rootfs related
   references with boot_disk.
 - run bashate on the script and resolve indentation and syntax related
   errors. Leave long-line errors alone for improved readability.

Test Plan:
PASS - run 'wipedisk', answer prompts, and ensure all partitions are
       cleaned up except for the platform backup partition
PASS - run 'wipedisk --include-backup', answer prompts, and ensure all
       partitions are cleaned up
PASS - run 'wipedisk --include-backup --force' and ensure all partitions
       are cleaned up

Change-Id: I036ce745353b6a26bc2615ffc6e3b8955b4dd1ec
Closes-Bug: #1998204
Signed-off-by: Robert Church <robert.church@windriver.com>
2022-11-29 05:04:38 -06:00
Eric MacDonald da398e0c5f Debian: Make Mtce offline handler more resilient to slow shutdowns
The current offline handler assumes the node is offline after
'offline_search_count' reaches 'offline_threshold' count
regardless of whether mtcAlive messages were received during
the search window.

The offline algorithm requires that no mtcAlive messages
be seen for the full offline_threshold count.

During a slow shutdown the mtcClient runs for longer than
it should and as a result can lead to maintenance seeing
the node as recovered before it should.

This update manages the offline search counter to ensure that
it only reached the count threshold after seeing no mtcAlive
messages for the full search count. Any mtcAlive message seen
during the count triggers a count reset.

This update also
1. Adjusts the reset retry cadence from 7 to 12 secs
   to prevent unnecessary reboot thrash during
   the current shutdown.
2. Clears the hbsClient ready event at the start of the
   subfunction handler so the heartbeat soak is only
   started after seeing heartbeat client ready events
   that follow the main config.

Test Plan:

PASS: Debian and CentOS Build and DX install
PASS: Verify search count management
PASS: Verify issue does not occur over lock/unlock soak (100+)
      - where the same test without update did show issue.
PASS: Monitor alive logs for behavioral correctness
PASS: Verify recovery reset occurs after expected extended time.

Closes-Bug: 1993656
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Change-Id: If10bb75a1fb01d0ecd3f88524d74c232658ca29e
2022-10-24 15:57:43 +00:00
Eric MacDonald 3f4c2cbb45 Mtce: Add ActionInfo extension support for reset operations.
StarlingX Maintenance supports host power and reset control through
both IPMI and Redfish Platform Management protocols when the host's
BMC (Board Management Controller) is provisioned.

The power and reset action commands for Redfish are learned through
HTTP payload annotations at the Systems level; "/redfish/v1/Systems.

The existing maintenance implementation only supports the
"ResetType@Redfish.AllowableValues" payload property annotation at
the #ComputerSystem.Reset Actions property level.

However, the Redfish schema also supports an 'ActionInfo' extension
at /redfish/v1/Systems/1/ResetActionInfo.

This update adds support for the 'ActionInfo' extension for Reset
and power control command learning.

For more information refer to the section 6.3 ActionInfo 1.3.0 of
the Redfish Data Model Specification link in the launchpad report.

Test Plan:

PASS: Verify CentOS build and patch install.
PASS: Verify Debian build and ISO install.
PASS: Verify with Debian redfishtool 1.1.0 and 1.5.0
PASS: Verify reset/power control cmd load from newly added second
      level query from ActionInfo service.

Failure Handling: Significant failure path testing with this update

PASS: Verify Redfish protocol is periodically retried from start
      when bm_type=redfish fails to connect.
PASS: Verify BMC access protocol defaults to IPMI when
      bm_type=dynamic but failed connect using redfish.
      Connection failures in the above cases include
      - redfish bmc root query fails
      - redfish bmc info query fails
      - redfish bmc load power/reset control actions fails
      - missing second level Parameters label list
      - missing second level AllowableValues label list
PASS: Verify sensor monitoring is relearned to ipmi from failed and
      retried with bm_type=redfish after switch to bm_type=dynamic
      or bm_type=ipmi by sysinv update command.

Regression:

PASS: Verify with CentOS redfishtool 1.1.0
PASS: Verify switch back and forth between ipmi and redfish using
      update bm_type=ipmi and bm_type=redfish commands
PASS: Verify switch from ipmi to redfish usinf bm_type=dynamic for
      hosts that support redfish
PASS: Verify redfish protocol is preferred in bm_type=dynamic mode
PASS: Verify IPMI sensor monitoring when bm_type=ipmi
PASS: Verify IPMI sensor monitoring when bm_type=dynamic
      and redfish connect fails.
PASS: Verify redfish sensor event assert/clear handling with
      alarm and degrade condition for both IPMI and redfish.
PASS: Verify reset/power command learn by single level query.
PASS: Verify mtcAgent.log logging

Closes-Bug: 1992286
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Change-Id: Ie8cdbd18104008ca46fc6edf6f215e73adc3bb35
2022-10-13 17:40:05 +00:00
Zuul 8fd1bcbb97 Merge "Alarm Hostname controller function has in-service failure reported" 2022-10-06 20:47:06 +00:00
Al Bailey dd5a24037d Fix bashate failure in zuul
This review allows this repo to pass zuul.

When tox is run locally it pulls in an older
bashate 0.6.0 but the zuul jobs are pulling in
the higher version.

Bashate 2.1.1 was releated Oct 6, 2022

Changed the upper constraints to allow developers
to pull in dependencies that are more aligned with zuul.

Fixed the new bashate error.
Also cleaned up the yamllint syntax.

Closes-Bug: 1991971
Signed-off-by: Al Bailey <al.bailey@windriver.com>
Change-Id: I9cda349a20c63f9d222a3c3fc3645c5ceb4c2751
2022-10-06 17:22:12 +00:00
Girish Subramanya 86681b7598 Alarm Hostname controller function has in-service failure reported
When compute services remain healthy:
 - listing alarms shall not refer to the below Obsoleted alarm
 - 200.012 alarm hostname controller function has an in-service failure

This update deletes definition of the obsoleted alarm and any references
200.012 is removed in events.yaml file
Also updated any reference to this alarm definition.
Need to also raise a Bug to track the Doc change.

Test Plan:
Verify on a Standard configuration no alarms are listed for
hostname controller in-service failure
Code (removal) changes exercised with fix prior to ansible bootstrap
and host-unlock and verify no unexpected alarms
Regression:
There is no need to test the alarm referred here as they are obsolete

Closes-Bug: 1991531

Signed-off-by: Girish Subramanya <girish.subramanya@windriver.com>

Change-Id: I255af68155c5392ea42244b931516f742fa838c3
2022-10-05 10:30:01 -04:00
Zuul 6bcd8333b2 Merge "Debian: Remove conf files from etc-pmon.d" 2022-09-30 19:41:16 +00:00
Leonardo Fagundes Luz Serrano d1c0d04719 Debian: Remove conf files from etc-pmon.d
Removed conf files from /etc/pmon.d/
as they are being moved to another location.

This is part of an effort to allow pmon conf files
to be selected at runtime by kickstarts.

The change is debian-only, since centos support
will be dropped soon.
Centos' pmon conf files remain in /etc/pmon.d/

Test Plan:
PASS - deb doesn't install anything to /etc/pmon.d/
PASS - rpm files unchanged
PASS - AIOSX unlocked-enabled-available
PASS - Standard 2+2 unlocked-enabled-available

Story: 2010211
Task: 46306

Depends-On: https://review.opendev.org/c/starlingx/metal/+/855095

Signed-off-by: Leonardo Fagundes Luz Serrano <Leonardo.FagundesLuzSerrano@windriver.com>
Change-Id: I086db0750df5626d2a8ba1010153ce4f45535ca5
2022-09-26 13:41:40 +00:00
Charles Short 3935abf187 mtcAgent: Run in active mode
Run the mtcAgent with active mode by default. This was done because
it was being observed that mtcAgent was causing an increased CPU
load under Debian.

Story: 2009964
Task: 46202

Test-Plan
PASS Build playbookconfig package
PASS Boot ISO
PASS Bootstrap simplex
PASS Check for running mtcAgent
PASS Install and provision CentOS 2+3 Standard System

Signed-off-by: Charles Short <charles.short@windriver.com>
Change-Id: If4278ab6e14cd30c995ce5004004fab955ad23eb
2022-09-13 21:38:50 +00:00
Davi Frossard 646192989d Remove sm-watchdog residues
Due to the changes
bd9e560d4b
which removed the sm-watchdog, we also need to remove residues in
kickstart config.

Story: 2010087
Task: 46007

Signed-off-by: Davi Frossard <dbarrosf@windriver.com>
Change-Id: I17911773ec4db1549df32a77acd43cd4615b28ee
2022-09-01 12:35:06 +00:00
Leonardo Fagundes Luz Serrano a5e7a108f5 Duplicate pmon.d conf files to another location
Created a duplicate install of /etc/pmon.d/*.conf files
to /usr/share/starlingx/pmon.d/

This is part of an effort to allow pmon conf files
to be selected at runtime by kickstarter.

Test Plan:
PASS: duplicate conf on deb

Story: 2010211
Task: 46112

Signed-off-by: Leonardo Fagundes Luz Serrano <Leonardo.FagundesLuzSerrano@windriver.com>
Change-Id: Ie07c1bfa370da5b2ec71fe3fce948d59be1dd098
2022-08-26 16:21:18 -03:00
Andy Ning 162398acbc Add pmon configuration file for sssd
This is part of the change to replace nslcd with sssd to
support multiple secure ldap backends.

This change added pmon configuration file for sssd so that it
is monitored by pmon.

Test Plan on Debian (SX and DX):
PASS: Package build, image build.
PASS: System deployment.
PASS: After controller is unlocked, sssd is running.
PASS: ldap user creation by ldapadduser and ldapusersetup.
PASS: ldap user login on console.
PASS: ldap user remote login by oam IP address:
      ssh <ldapuser>@<controller-oam-ip-address>
PASS: ldap user login by local ldap domain within controllers:
      ssh <ldapuser>@controller
PASS: For DX system, same ldap functions still work properly after
      swact.
PASS: Kill sssd process, verify that it is brought up by pmon.

Story: 2009834
Task: 46064
Signed-off-by: Andy Ning <andy.ning@windriver.com>
Change-Id: I701a4cbbda0f900dafd0456aad63132b62d8424f
2022-08-24 14:42:25 -04:00
Eric MacDonald 038eb198fd Re-enable sensor suppression support in Mtce Hardware Monitor
Sensor and sensorgroup suppression was temporarily disabled in
Debian while System Inventory was modified to align API types
with database types.

That update is now merged so this update removes the Debian only
gate on sensor and sensorgroup suppression.

Test Plan:

PASS: Verify Debian build and install
PASS: Verify CentOS build and install
PASS: Verify multiple individual sensor suppression/unsuppression
PASS: Verify sensorgroup suppression/unsuppression
PASS: Verify host degrade and alarm mgmt for each above 2 cases

Story: 2009968
Task: 45964
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Change-Id: I34abf6bd7c72df2f7da743e4f20300956248c6d7
2022-08-06 00:02:29 +00:00
Eric MacDonald f7f552ad8e Debian: Fix mtcAgent segfault on SM host state change requests
The mtcAgent communicates to Service Managenment using libEvent.

The host state change notification requests are all blocking
requests. Both the common and service manager handlers are
freeing the object. This double free results in a segmentation
fault with the newer version of libEvent in Debian.

The bug is fixed by removing the free in the service handler
to allow the dispatch handler to manage the object free as it
does for other blocking requests for other services.

Test Plan:

PASS: Verify mtcAgent does not crash on SM state change request
PASS: Verify all blocking state change requests
PASS: Verify no memory leak (before ; request stress ; after)

Regression:

PASS: Verify Debian Build and Install (duplex/duplex)
PASS: Verify CentOS Build and Patch (duplex)
PASS: Verify CentOS Swact
PASS: Verify Logging

Story: 2009968
Task: 45675
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Change-Id: Iad27a0e77cb9d2233a2f2e1b6f8216b93964335b
2022-06-26 20:18:20 +00:00
Jiping Ma c031a990f2 Debian: modify crashDumpMgr to adapt to the vmcore name format.
This commit modifies crashDumpMgr to support the current vmcore name
format for debian.

They are dmesg.202206101633 and dump.202206101633 in Debian.
They are vmcore-dmesg.txt and vmcore in CentOS.

Test Plan:
PASS: Image builds successfully.
PASS: vmcore are put to /var/log/crash successfully.
PASS: Create dump files manually in /var/crash with the format of
      CentOS, then run the crashDumpMgr.
PASS: Create dump files manually in /var/crash with the format of
      Debian, then run the crashDumpMgr.

Story: 2009964
Task: 45629

Depends-On: https://review.opendev.org/c/starlingx/integ/+/845883

Signed-off-by: Jiping Ma <jiping.ma2@windriver.com>
Change-Id: Ic540f7004a4fffd3ce7c008968ac10dca4d1c4d0
2022-06-17 11:31:29 -04:00
Eric MacDonald aaf9d08028 Mtce: Fix bmc password fetch error handling
The mtcAgent process sometimes segfaults while trying to fetch
the bmc password from a failing barbican process.

With that issue fixed the mtcAgent sends the bmc access
credentials to the hardware monitor (hwmond) process which
then segfaults for a reason similar

In cases where the process does not segfault but also does not
get a bmc password, the mtcAgent will flood its log file.

This update

 1. Prevents the segfault case by properly managing acquired
    json-c object releases. There was one in the mtcAgent and
    another in the hardware monitor (hwmond).

    The json_object_put object release api should only be called
    against objects that were created with very specific apis.
    See new comments in the code.

 2. Avoids log flooding error case by performing a password size
    check rather than assume the password is valid following the
    secret payload receive stage.

 3. Simplifies the secret fsm and error and retry handling.

 4. Deletes useless creation and release of a few unused json
    objects in the common jsonUtil and hwmonJson modules.

Note: This update temporarily disables sensor and sensorgroup
      suppression support for the debian hardware monitor while
      a suppression type fix in sysinv is being investigated.

Test Plan:

PASS: Verify success path bmc password secret fetch
PASS: Verify secret reference get error handling
PASS: Verify secret password read error handling
PASS: Verify 24 hr provision/deprov success path soak
PASS: Verify 24 hr provision/deprov error path path soak
PASS: Verify no memory leak over success and failure path soaking
PASS: Verify failure handling stress soak ; reduced retry delay
PASS: Verify blocking secret fetch success and error handling
PASS: Verify non-blocking secret fetch success and error handling
PASS: Verify secret fetch is set non-blocking
PASS: Verify success and failure path logging
PASS: Verify all of jsonUtil module manages object release properly
PASS: Verify hardware monitor sensor model creation, monitoring,
             alarming and relearning. This test requires suppress
             disable in order to create sensor groups in debian.
PASS: Verify both ipmi and redfish and switch between them with
             just bm_type change.
PASS: Verify all above tests in CentOS
PASS: Verify over 4000 provision/deprovision cycles across both
             failure and success path handling with no process
             failures

Closes-Bug: 1975520
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Change-Id: Ibbfdaa1de662290f641d845d3261457904b218ff
2022-06-01 15:21:05 +00:00
Zuul 9af975fbdb Merge "Fix pmon scripts path (Debian)" 2022-03-17 23:32:09 +00:00
Roberto Luiz Martins Nogueira bb06207bd2 debian: correct bindir for maintenance services - mtce
Deploying to /usr/local/bin instead of /usr/bin for maintenance
services mtce, detected in  Debian 11(bullseye).
Without this patch the ocf script will fail to run.

Test Plan:
PASS Build package
PASS Build ISO
PASS Bootstrap in VM
PASS Fresh new build

Story: 2009101
Task: 44699

Signed-off-by: Roberto Luiz Martins Nogueira <robertoluiz.martinsnogueira@windriver.com>
Change-Id: I60471ff51e9e9770de41f67ee1f48a08408eec7d
2022-03-10 11:13:47 +00:00
Lucas Cavalcante d39f461031 Fix pmon scripts path (Debian)
Puppet expects pmon-* executables to be found at /usr/local/sbin,
therefore Debian should install these files at the correct location.

Test Plan:

PASS: Unlock controller (Debian)
SKIPPED: Unlock controller (Centos)

Story: 2009101
Task: 44711
Change-Id: I5abe5a4c79b58c0a58649f74f54475cca8d29593
Signed-off-by: Lucas Cavalcante <lucasmedeiros.cavalcante@windriver.com>
2022-03-09 11:48:36 -03:00
Matheus Machado Guilhermino 360a344370 Fix remaining failing mtce services on Debian
Modified mtce to address the following
failing services on Debian:
crashDumpMgr.service
fsmon.service
goenabled.service
hostw.service
hwclock.service
mtcClient.service
pmon.service

Applied fix:
- Included modified .service files for debian
directly into into the deb_folder.
- Changed the init files to account for the different
locations of the init-functions and service daemons
on Debian and CentOS
- Included "override_dh_installsystemd" section
to rules in order to start services at boot.

Test Plan:

PASS: Package installed and ISO built successfully
PASS: Ran "systemctl list-units --failed" and verified that the
services are not failing
PASS: Ran "systemctl status <service_name>" for
each service and verified that they are behaving as desired
PASS: Services work as expected on CentOS
PASS: Bootstrap and host-unlock successful on CentOS

Story: 2009101
Task: 44323

Signed-off-by: Matheus Machado Guilhermino <Matheus.MachadoGuilhermino@windriver.com>
Change-Id: Ie61cedac24f84baea80cab6a69772f8b2e9e1395
2022-01-25 12:10:39 -03:00
Matheus Machado Guilhermino 4c8abe18d3 Fix failing mtce services on Debian
Modified mtce and mtce-control to address the following
failing services on Debian:
hbsAgent.service
hbsClient.service
hwmon.service
lmon.service
mtcalarm.service
mtclog.service
runservices.service

Applied fix:
- Included modified .service files for debian
directly into into the deb_folder.
- Changed the init files to account for the different
locations of the init-functions and service daemons
on Debian and CentOS
- Included "override_dh_installsystemd" section
to rules in order to start services at boot.

Test Plan:

PASS: Package installed and ISO built successfully
PASS: Ran "systemctl list-units --failed" and verified that the
services are not failing
PASS: Ran "systemctl status <service_name>" for
each service and verified that they are active

Story: 2009101
Task: 44192

Signed-off-by: Matheus Machado Guilhermino <Matheus.MachadoGuilhermino@windriver.com>
Change-Id: I50915c17d6f50f5e20e6448d3e75bfe54a75acc0
2022-01-14 10:50:09 -03:00
Zuul 13d18b98ad Merge "Reduce log rates for daemon-ocf" 2021-11-08 16:39:50 +00:00
Tracey Bogue 0551c665cb Add Debian packaging for mtce packages
Some of the code used TRUE instead of true which did not compile
for Debian. These instances were changed to true.
Some #define constants generated narrowing errors because their
values are negative in a 32 bit integer. These values were
explicitly casted to int in the case statements causing the errors.

Story: 2009101
Task: 43426

Signed-off-by: Tracey Bogue <tracey.bogue@windriver.com>
Change-Id: Iffc4305660779010969e0c506d4ef46e1ebc2c71
2021-10-29 09:17:00 -05:00
Delfino Curado 366b68d3c7 Add option --include-backup to wipedisk
The option --include-backup offers the possibility to wipe the
directory /opt/platform-backup by ignoring the "protected" partition
GUID.

Test Plan:

PASS: Verify that wipedisk without parameters keep the directory
/opt/platform-backup contents
PASS: Verify that wipedisk with parameter --include-backup remove
contents of /opt/platform-backup

Story: 2009291
Task: 43719
Signed-off-by: Delfino Curado <delfinogomes.curadofilho@windriver.com>
Change-Id: I1a7c0b284a4c229d6ea59433fd7db296745ead2f
2021-10-22 11:58:39 -04:00
jmusico 5138cb12e4 Reduce log rates for daemon-ocf
This change will change a few info logs, making them debug level.
To be able to check these logs, HA_debug=1 variable shall be added to
each process ocf script.

Test Plan:

PASS: Verify that selected logs to be changed to debug are not logged
as info anymore
PASS: Verify after enabling debug level logs these logs are correctly
logged as debug

Failure Path:

PASS: Verify logs are not logged if variable is removed or set to 0

Regression:

PASS: Verify system install
PASS: Verify all log levels, other than debug, are still being
generated (related to task 43606)

Story: 2009272
Task: 43728

Signed-off-by: jmusico <joaopaulotavares.musico@windriver.com>
Change-Id: Ie58683054fd6e60ee5ae496cb823d9ae956251cd
2021-10-21 21:54:42 +00:00
M. Vefa Bicakci 2d25f71f2a pmon.h: Ensure compat. with v5.10 kernel
The v5.10 kernel no longer guards the task_state_notify_info data
structure with #ifdef CONFIG_SIGEXIT, which causes a
redefinition-related compilation error. Work around this by checking for
the existence of the PR_DO_NOTIFY_TASK_STATE macro, and only define the
PR_DO_NOTIFY_TASK_STATE and the task_state_notify_info structure if the
kernel does not do so.

Story: 2008921
Task: 42915

Change-Id: I4bb499e2b52e20542f202dea1c2c55d88bb8ba61
Signed-off-by: M. Vefa Bicakci <vefa.bicakci@windriver.com>
2021-07-29 17:36:31 -04:00
Eric MacDonald 74bfeba7d3 Increase maximum preserved crash dump vmcore file size to 5Gi
The current crashDumpMgr service has several filesystem
protection methods that can result in the auto deletion
of a crashdump vmcore file. One is a hard cap of 3Gi.

This max vmcore size is too small for some applications.
Crash dump vmcore files can get big with servers that have
a lot of memory and big apps.

This update modifies the crashDumpMgr service file
max_size override to 5Gi.

Test Plan:

PASS: Verify change functions as expected
PASS: Verify change is inserted after patch apply
PASS: Verify crash dump under-size threshold handling
PASS: Verify crash dump over-size threshold handling
PASS: Verify change is reverted after patch removal

Change-Id: I867600460ba9311818ace466986603f5bffe4cd7
Closes-Bug: 1936976
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2021-07-21 01:46:30 +00:00
Zuul 469cc3ba06 Merge "Clear bmc alarm over mtcAgent process restart for ALL system types" 2021-06-15 20:44:12 +00:00