Commit Graph

13 Commits

Author SHA1 Message Date
Eric MacDonald 7d8be4bc1f Add auto-versioning to starlingx/metal mtce packages
This update makes use of the PKG_GITREVCOUNT variable
to auto-version the mtce packages in this repo.

Change-Id: Ifb4da4570e0261bbdcf0d7af79b8add7cfc133ac
Story: 2006166
Task: 39822
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2020-05-21 15:18:43 -04:00
Kristine Bujold bee31d98c8 Remove wrs-guest-heartbeat SDK Module
With the StarlingX move to supporting pure upstream OpenStack, the
majority of the SDK Modules are related to functionality no longer
supported. The remaining SDK Modules will be moved to StarlingX
documentation.

Story: 2005275
Task: 30565

Change-Id: Ifc560a6865d045ab3bd93923811aeb5f8ac7f030
Signed-off-by: Kristine Bujold <kristine.bujold@windriver.com>
2019-04-17 13:38:18 -04:00
Eric MacDonald 7e8be89143 Make Mtce default to Simplex system type if label is missing
This update refactors daemon_system_type function so that it
returns a SIMPLEX system type if it is unable to properly
find and parse the system_mode/system_type from platform.conf

This is needed for Ansible Bootstrap Deployment where mtcAgent
and mtcClient need to run and function like it would in a
simplex system prior to the system type being added to the
platform.conf file.

Change-Id: Ib0130f3559ee3aa8d8d8203ea59d4896a571944f
Story: 2004695
Task: 28714
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-02-04 14:15:40 +00:00
Eric MacDonald ff8ef3ea8a Change Mtce token endpoint lookup to be 'platform'.
The maintenance token request's response parser is looking
for nova compute endpoint as a day one implementation when
mtce actually managed nova. That is long since changed but
this endpoint lookup remained.

In the new containterized environment the nova compute
endpoint is not always present and when its not mtce
fails to get its token.

Since mtce needs the token for communication with sysinv
this update changes the endpoint lookup type to 'platform'
to match that of sysinv.

Change-Id: I389b64d345e47f7d7bc062671da7c7cc51ac398f
Story: 2004695
Task: 29213
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-01-30 12:55:55 -05:00
Eric MacDonald 3a5c578355 Mtce: Add Thresholded Maintenance Enable Recovery support
This update stops trying to recover hosts that have failed the
Enable sequence after a thresholded number of back-to-back tries.

A host that has reached a particular failure modes' max failure
threshold then maintenance puts it into a 'unlocked-disabled-failed'
state and left that way with no further recovery action until
it is manually locked and unlocked.

The thresholded Enable failure causes are

 Configuration Failure ....... threshold:2 retry interval:30 secs
 In-Test GoEnabled Failure ... threshold:2 retry interval:30 sec
 Start Host Services Failure . threshold:2 retry interval:30 sec
 Heartbeat Soak Failure ...... threshold:2 retry interval:10 minute

This update refactors the old auto recovery for AIO SX into this
more generic framework.

Story: 2003576
Task: 24905

Test Plan:

PASS: Verify AIO DX System Install
PASS: Verify AIO SX DOR
PASS: Verify Auto recovery disabled state is maintained over AIO SX DOR
PASS: Verify Lock/Unlock recovers host from Auto recovery disabled state
PASS: Verify AIO SX Main Config Failure handling
PASS: Verify AIO SX Main Config Timeout handling
PASS: Verify AIO SX Main GoEnabled Failure Handling
PASS; Verify AIO SX Main Host Services Failure handling
PASS; Verify AIO SX Main Host Services Timeout handling
PASS; Verify AIO SX Subf Config Failure handling
PASS: Verify AIO SX Subf Config Timeout handling
PASS: Verify AIO SX Subf GoEnabled Failure Handling
PASS: Verify AIO SX Subf Host Services Failure handling

PASS: Verify AIO DX System Install
PASS: Verify AIO DX DOR
PASS: Verify AIO DX DOR ; one time active controller GoEnabled failure ; swact requested
PASS: Verify AIO DX Main First Unlock Failure handling
PASS: Verify AIO DX Main Config Failure handling (inactive ctrl)
PASS: Verify AIO DX Main one time Config Failure handling
PASS: Verify AIO DX Main one time GoEnabled Failure handling.
PASS: Verify AIO DX SUBF Inactive Controller 1 GoEnable Failure handling.
PASS: Verify AIO DX Inactive Controller 1 GoEnable Failure with recovery on retry.
PASS: Verify AIO DX Active controller Enable failure with no or locked peer controller.
PASS: Verify AIO DX Reboot Active controller with peer in auto recovery disabled state.
PASS: Verify AIO DX Active controller failure with peer in auto recovery disabled state. (vswitch process)
PASS: Verify AIo DX Active controller failure then recovery after reboot with peer in auto recovery disabled state. (goenabled)
PASS: Verify AIO DX Inactive Controller Enable Heartbeat Soak Failure handling.
PASS: Verify AIO DX Active controller unhealthy detection and handling. (degrade)
PASS: Verify AIO DX Inactive controller unhealthy detection and handling. (fail)

PASS: Verify Normal System Install
PASS: Verify Compute Enable Configuration Failure handling (wc71-75)
PASS: Verify Compute Enable GoEnabled Failure handling (recover after 1)
PASS: Verify Compute Enable Start Host Services Failure handling
PASS: Verify Compute Enable Heartbeat Soak Failure handling
PASS: Verify Inactive Controller Enable Heartbeat Soak Failure handling
PASS: Verify Inactive Controller Configuration Failure handling
PASS; Verify Inactive Controller GoEnabled Failure handling
PASS; Verify Inactive Controller Host Services Failure handling
PASS; Verify goEnabled failure after active controller reboot with no peer controller (C0 rebooted with C1 locked) - no SM startup
PASS: Verify auto recovery threshold number is configurable
PASS: Verify auto recovery retry interval is configurable
PASS: Verify auto recovery host state and status message

Regression:

PASS: Verify Swact behavior, over and back
PASS: Verify 5 node DOR
PASS: Verify 3 host MNFA behavior
PASS: verify in-service heartbeat failure handling
PASS: verify no segfaults during UT

Corner Cases:

PASS: Verify mtcAlive boot failure behavior. reset progression. retry forever. - sleep in config script
PASS: Verify AIO SX mtcAgent process restart while in autorecovery disabled state
PASS: Verify autorecovery disabled state is preserved over mtcAgent process restart.

Change-Id: I7098f16243caef27c5295971ef3c9de5be975755
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2018-12-12 08:11:36 -05:00
Eric MacDonald dc531dc815 Fix mtce guest build failure
A recent update to stx-metal/mtce-common removed a daemon_config
structure member that the stx-nfv/mtce-guest git depends on.
This was not detected during UT of the mtc-common change because
of a missing build dependency that should force a rebuild of the
mtce guest.

Delivering the code fix to unblock the community.
Will deliver the build dependency change shortly.

Change-Id: Ice08424f156ffc84e38651fbc40ebc184170eb20
Closes-Bug: 1804579
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2018-11-22 10:26:18 -05:00
Jim Gauld 6a5e10492c Decouple Guest-server/agent from stx-metal
This decouples the build and packaging of guest-server, guest-agent from
mtce, by splitting guest component into stx-nfv repo.

This leaves existing C++ code, scripts, and resource files untouched,
so there is no functional change. Code refactoring is beyond the scope
of this update.

Makefiles were modified to include devel headers directories
/usr/include/mtce-common and /usr/include/mtce-daemon.
This ensures there is no contamination with other system headers.

The cgts-mtce-common package is renamed and split into:
- repo stx-metal: mtce-common, mtce-common-dev
- repo stx-metal: mtce
- repo stx-nfv: mtce-guest
- repo stx-ha: updates package dependencies to mtce-pmon for
  service-mgmt, sm, and sm-api

mtce-common:
- contains common and daemon shared source utility code

mtce-common-dev:
- based on mtce-common, contains devel package required to build
  mtce-guest and mtce
- contains common library archives and headers

mtce:
- contains components: alarm, fsmon, fsync, heartbeat, hostw, hwmon,
  maintenance, mtclog, pmon, public, rmon

mtce-guest:
- contains guest component guest-server, guest-agent

Story: 2002829
Task: 22748

Change-Id: I9c7a9b846fd69fd566b31aa3f12a043c08f19f1f
Signed-off-by: Jim Gauld <james.gauld@windriver.com>
2018-09-18 17:15:08 -04:00
Eric MacDonald 316032b904 Mtce: Improve non-blocking http request dispatch
Maintenance is seen to intermittently fail Swact requests early
after initial system provisioning, without logging an error
reason, only to always succeed later on.

The issue is difficult to reproduce so this update adds extra
logging to this code path and implements a speculative fix.

The event_base_loop calls' non-zero return code is never being
logged. The libevent documentation states that this API will
return 1 while the target has not yet provided any data.

Theory is, because the call is local, that normally it returns
with data even on the first dispatch case. However, during early
system configuration, when the system is busy, that first dispatch
does not complete immediately like it normally does later on.

Speculation is, instead it returns a 1 stating retry but the
existing code path treats that as a failure.

This update modifies the code to return a PASS if the command
dispatch returns a 1 while the error case of -1 gets enhanced
logging and continues to be treated as a failure.

Test Plan:
PASS: Swact 5 times
PASS: Lock/Unlock Host
PASS: Large System DOR

Related Bug: https://bugs.launchpad.net/starlingx/+bug/1791381
Change-Id: I19b22e07d3224b2e9dd3f3569ecbe9aed7d9402f
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2018-09-10 19:02:42 +00:00
Eric MacDonald 74c5f89ab4 Mtce: Make Heartbeat Failure Action Configurable
The current maintenance heartbeat failure action handling is to Fail
and Gracefully Recover the host. This means that maintenance will
ensure that a heartbeat failed host is rebooted/reset before it is
recovered but will avoid rebooting it a second time if its recovered
uptime indicates that it has already rebooted.

This update expands that single action handling behavior to support
three new actions. In doing so it adds a new configuration service
parameter called heartbeat_failure_action. The customer can configure
this new parameter with any one of the following 4 actions in order of
decreasing impact.

   fail - Host is failed and gracefuly recovered.
        - Current Network specific alarms continue to be raised/cleared.
          Note: Prior to this update this was standard system behavior.
degrade - Host is only degraded while it is failing heartbeat.
        - Current Network specific alarms continue to be raised/cleared.
        - heartbeat degrade reason is cleared as are the alarms when
          heartbeat responses resume.
  alarm - The only indication of a heartbeat failure is by alarm.
        - Same set of alarms as in above action cases
        - Only in this case no degrade, no failure, no reboot/reset
   none - Heartbeat is disabled ; no multicase heartbeat message is sent.
        - All existing heartbeat alarms are cleared.
        - The heartbeat soak as part of the enable sequence is bypassed.

The selected action is a system wide setting.
The selected setting also applies to Multi-Node Failure Avoidance.
The default action is the legacy action Fail.

This update also

 1. Removes redundant inservice failure alarm for MNFA case in support
    of degrade only action. Keeping it would make that alarm handling
    case unnecessarily complicated.
 2. No longer used 'hbs calibration' code is removed (cleanup).
 3. Small amount of heartbeat logging cleanup.

Test Plan:
PASS:    fail: Verify MNFA and recovery
PASS:    fail: Verify Single Host heartbeat failure and recovery
PASS:    fail: Verify Single Host heartbeat failure and recovery (from none)
PASS: degrade: Verify MNFA and recovery
PASS: degrade: Verify Single Host heartbeat failure and recovery
PASS: degrade: Verify Single Host heartbeat failure and recovery (from alarm)
PASS:   alarm: Verify MNFA and recovery
PASS:   alarm: Verify Single Host heartbeat failure and recovery
PASS:   alarm: Verify Single Host heartbeat failure and recovery (from degrade)
PASS:    none: Verify heartbeat disable, fail ignore and no recovery
PASS:    none: Verify Single Host heartbeat ignore and no recovery
PASS:    none: Verify Single Host heartbeat ignode and no recovery (from fail)
PASS: Verify action change behavior from none to alarm with active MNFA
PASS: Verify action change behavior from alarm to degrade with active MNFA
PASS: Verify action change behavior from degrade to none with active MNFA
PASS: Verify action change behavior from none to fail with active MNFA
PASS: Verify action change behavior from fail to none with active MNFA
PASS: Verify action change behavior from degrade to fail then MNFA timeout
PASS: Verify all heartbeat action change customer logs
PASS: verify heartbeat stats clear over action change
PASS: Verify LO DOR (several large labs - compute and storage systems)
PASS: Verify recovery from failure of active controller
PASS: Verify 3 host failure behavior with MNFA threshold at 3 (action:fail)
PASS: Verify 2 host failure behavior with MNFA threshold at 3 (action:fail)

Depends-On: https://review.openstack.org/601264
Change-Id: Iede5cdbb1c923898fd71b3a95d5289182f4287b4
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2018-09-10 13:03:30 -04:00
Eric MacDonald 82e851d651 Mtce: Make Multi-Node Failure Avoidance Configurable
The maintenance system implements a high availability (HA) feature
designed to detect the simultaneous heartbeat failure of a group
of hosts and avoid failing all those hosts until heartbeat resumes
or after a set period of time.

This feature is called Multi-Node Failure Avoidance, aka MNFA, and
currently has the hosts threshold set to 3 and timeout set to 100 secs.

This update implements enhancements to that existing feature by
making the 'number-of-hosts threshold' and 'timeout period'
customer configurable service parameters.

The new service parameters are listed under platform:maintenance which
display with the following command

> system service-parameter-list

mnfa_threshold: This new label and value is added to the puppet
managed /etc/mtc.ini and represents the number of hosts that are
required to fail heartbeat as a group; within the heartbeat
failure window (heartbeat_failure_threshold) after which maintenance
activates MNFA Mode.

This update changes the default number of failing hosts from
3 to 2 while allowing a configurable range from 2 to 100.

mnfa_timeout: This new label and value is added to the puppet
managed /etc/mtc.ini. While MNFA mode is active, it will remain active
until the number of failing hosts drop below the mnfa_threshold or this
timer expires. The MNFA mode deactivates on the first occurance of
either case. Upon deactivation the remaining failed hosts are no
longer treated as a failure group but instead are all Gracefully
Recovered individually. A value of zero imposes no timeout making the
deactivation criteria solely host based.

This update changes the default 100 second timer to 0; no-timeout
while permitting valid a times range from 100 to 86400 secs or 1 day.

Test Plan:

PASS - Verify duplex and 4 compute DOR
PASS - Verify default MNFA - 1 inactive controller and 4 computes
PASS - Verify default MNFA - 4 computes
PASS - Verify default MNFA - 1 active controller and 3 computes and failed host
PASS - Verify Single host heartbeat failure handling - fail host
PASS - Verify Multi Node failure below mnfa_threshold - fail hosts
PASS - Verify MNFA handling with timeout of zero and threshold of 3
PASS - Verify MNFA timeout handling with timeout set at 100 sec
PASS - Verify MNFA service parameter lising, default value and mtc.ini
PASS - Verify MNFA service parameter change and inservice apply
PASS - Verify MNFA timeout service parameter change from value to 0
PASS - Verify MNFA timeout service parameter change from to inrange value
PASS - Verify MNFA service parametrer out of range change handling
PASS - Verify MNFA timeout change from No-Timeout to 100 sec (while active)

DocImpact
Story: 2003576
Task: 24903

Change-Id: Ib56dd79b38c3726e042cf34aae361f229c89940b
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2018-08-31 15:35:08 -04:00
Alex Kozyrev 00520ac78c Moving PMON script for NTP from MTCE to Puppet
Introduction of PTP service requires NTP service to be disabled.
Process monitoring of NTP daemon must be turned off as well.
There is no way to start/stop process monitoring from MTCE.
Puppet can check NTP status at startup and enable/disable monitoring.
So, it is needed to move NTP-related PMON script from MTCE to Puppet.
This is first step: removing NTP references from MTCE.

Change-Id: I1ca6045af8c5169220b7332d45b843fdb4960f01
Story: 2002935
Task: 24520
Signed-off-by: Alex Kozyrev <alex.kozyrev@windriver.com>
2018-08-09 16:04:57 -04:00
Kam Nasim 5e725a7a0a Multi-Region: Support shared LDAP service
Decouple NSLCD from the open-ldap SM service and manage it by PMOND
instead. This is needed because in the Shared LDAP case, we deprovision
the open-ldap service on the Secondary Region which renders NSLCD
unmanaged.

Additionally, we allow the Secondary Region or Sub Clouds to bind
anonymously, but still need to support LDAP read operations in these
regions such as ldapfinger or lsldap. For this purpose, the ldapscripts
runtime library has been modified to allow anonymous binds during LDAP
search operations.

Change-Id: Ic01a8097e8124348d493c9e0c82fda94700e28e2
Signed-off-by: Jack Ding <jack.ding@windriver.com>
2018-06-28 15:49:45 -04:00
Dean Troyer 18922761a6 StarlingX open source release updates
Signed-off-by: Dean Troyer <dtroyer@gmail.com>
2018-05-31 07:36:43 -07:00