Commit Graph

199 Commits

Author SHA1 Message Date
OpenDev Sysadmins f9ab1df900 OpenDev Migration Patch
This commit was bulk generated and pushed by the OpenDev sysadmins
as a part of the Git hosting and code review systems migration
detailed in these mailing list posts:

http://lists.openstack.org/pipermail/openstack-discuss/2019-March/003603.html
http://lists.openstack.org/pipermail/openstack-discuss/2019-April/004920.html

Attempts have been made to correct repository namespaces and
hostnames based on simple pattern matching, but it's possible some
were updated incorrectly or missed entirely. Please reach out to us
via the contact information listed at https://opendev.org/ with any
questions you may have.
2019-04-19 19:52:32 +00:00
Dean Troyer 4cdbe5fa26 Update .gitreview for f/ceph_mimic_upgrade
Change-Id: I55e5adec7a639228e76bcdb4a3203f3315020b4c
Signed-off-by: Dean Troyer <dtroyer@gmail.com>
2019-01-30 08:13:39 -06:00
Jack Ding 9ececd7623 Remove nova storage aggregates
Remove the automated creation of storage host aggregates and host
population in inventory.

Story: 2004607
Task: 29068
Change-Id: I4a74a1ee1f8b3bc8dc6293a5c971d9c7ed1442b5
Signed-off-by: Jack Ding <jack.ding@windriver.com>
2019-01-25 09:56:09 -05:00
Zuul 9c271569d6 Merge "Clean up and standardize landing pages" 2019-01-23 14:23:28 +00:00
Zuul 887bd34471 Merge "Add NTP server monitoring as a collectd plugin" 2019-01-11 21:00:05 +00:00
Eric MacDonald f7031cf5fb Add NTP server monitoring as a collectd plugin
This update disables rmon NTP monitoring which is now done
as a collectd plugin with the following depends update.

Story: 2002823
Task: 22859

Depends-On: https://review.openstack.org/#/c/628685/
Change-Id: I736703542c8a6ba3dd9e9db2d6fb7ccbdc906643
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-01-11 09:15:58 -05:00
Zuul b1a7d73ee8 Merge "Automatically create cgts-vg volume group on worker nodes" 2019-01-10 22:36:09 +00:00
Kristal Dale 3522ead301 Clean up and standardize landing pages
doc index.rst:
1. Update intro sentence to read as a complete sentence
2. Remove unused toctree
3. Correct heading levels (impacting side nav and correct rendering
of content)
4. Remove "Indices and Tables" section: genindex page not used,
search searches only index (not useful here)

api-ref index.rst:
1. Update intro sentence to read as a complete sentence
2. Update text around search link for consistency (move to
follow intro)
3. Add heading before toctree for consistency with other pages

releasenotes index.rst:
1. Standardize page title reST markup
2. Remove search (make consistent with other openstack release
note pages)

Story: 2004737
Task: 28805

Change-Id: I388cc5d69db56e6e94bf034ece2478933c9d9c1e
Signed-off-by: Kristal Dale <kristal.dale@intel.com>
2019-01-09 09:34:38 -08:00
Mingyuan Qi 4273c21af7 Add devstack plugin
Add maintenance services as stx-metal plugin.
Enable services by both node type and metal components.

Target:
Mtce services are installed and active(running) in devstack.

Story: 2003161
Task: 23296

Change-Id: I2123c64fb1b70bd135e8945d7ff7f4f3691bdbcc
Signed-off-by: Mingyuan Qi <mingyuan.qi@intel.com>
2019-01-09 19:11:18 +08:00
Wei Zhou fe397d5d27 Automatically create cgts-vg volume group on worker nodes
This commit creates cgts-vg volume group automatically on worker
nodes by kickstart. This cgts-vg volume group reserves space for
log-lv, scratch-lv, docker-lv and ceph-mon-lv.

This commit reserves space in cgts-vg volume group for 30G
docker-lv and 20G ceph-mon-lv for AIO configuration.

Story: 2004520
Task: 28663
Change-Id: Ic77d00c354da1070e2c4c2da4545d70ab4a93d91
Signed-off-by: Wei Zhou <wei.zhou@windriver.com>
2019-01-07 22:03:03 -05:00
Eric MacDonald 64c1d400b9 Implement collectd startup in manifest apply post stage
Starting collectd too early in the manifest apply is seen
to occasionally fail due to a dependency configuration on
hostname resolution in FQDNLookup not being complete.

Since influxdb is used by collectd and is a controller
only service this update moves it to the manifest apply
post stage as well and is filtered out from non
controller load types.

This issue is fixed by the following multi-git changes.

stx-metal: This update.
   Filter influxdb out of storage and compute only loads.
   No real inter git merge dependency

stx-integ:
   Add startup Before=pmond dependency

stx-config:
   Move collectd config and startup to manifest apply post stage
   Move influxdb config and startup to manifest apply post stage

Test Plan:
PASS: Build iso
PASS: verify install storage system and collectd startup
PASS: Verify Storage system DOR
PASS: Verify influxdb and extensions excluded in non-controller loads
PASS: Verify collectd starts properly on all nodes (CC,DOR,UNLOCK)
PASS: Verify influxdb starts properly on controller nodes (CC,DOR,UNLOCK)
PASS: Verify collectd pmond process monitoring and recovery
PASS: Verify influxdb pmond process monitoring and recovery

PEND: Verify collectd statistics storage and fetch to/from influxdb
PEND: Install AIO DX and verify collectd and influxdb startup

Change-Id: I8c71f36978620e0650062cc848bfb9d85f6810b2
Closes-Bug: 1797909
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-01-02 09:55:42 -05:00
zhipengl 68ab0560cf Fix trivial issue found during code review for hbs related code
1. Build-iso - PASS
2. Install iso and unlock all hosts -PASS
3. Force reboot on unlocked host to verify heartbeat failure detection 
and graceful recovery. PASS
4. Verify hbsAgent logs for unexpected logs. PASS

Change-Id: Ia4f52d3ffa52152914f3c221fa6eb860d127724b
Signed-off-by: zhipengl <zhipengs.liu@intel.com>
2018-12-27 07:56:23 +00:00
Zuul 351cc87c9c Merge "Remove version from installer" 2018-12-21 15:32:32 +00:00
Zuul 7512c6b105 Merge "Mtce: Improve robustness of heartbeat Loss reporting" 2018-12-21 14:59:11 +00:00
Zuul a4a5a86a08 Merge "Mtce: fix hbsClient active monitoring over config reload" 2018-12-21 14:32:52 +00:00
Eric MacDonald 4fb3ce1121 Mtce: Improve robustness of heartbeat Loss reporting
Closes-Bug: 1806963

In the case where the active controller experiences a
spontaneous reboot failure there is the potential for
a race condition in the new Active-Active Heartbeat
model between the inactive hbsAgent and mtcAgent
starting up on the newly active controller.

The inactive hbsAgent can report a heartbeat Loss before
SM starts up the mtcAgent. This results in a no detect
of the of a heartbeat failed host.

This update modifies the hbsAgent to continue to report
heartbeat Loss at a throttled rate while the hbsAgent
continues to experience heartbeat loss of enabled monitored
hosts. This change is implemented in nodeClass.cpp.

Debug of this issue also revealed another undesirable race
condition and logging issue when a controller is locked. This
issue is remedied with the introduction of a control structure
'locked' state that is set on controller lock and looked at in
the hbs_cluster_update utility. hbsCluster.cpp

Two additional hbsAgent logging changes were implemented with
this update.

  1. Only print "missing peer controller cluster view" on a
     state change event. Otherwise, this becomes excessive
     whenever the inactive controller fails.
     hbsAgent.cpp

  2. Don't print the full heartbeat inventory and state banner
     with hbsInv.print_node_info on every heartbeat Loss event.
     Otherwise, this becomes excessive in larget systems.
     hbsCluster.cpp

Test Plan:
PASS: Verify hbsAgent log stream for implemented improvements.
PASS: Verify Lock inactive controller several times.
PASS: Fail inactive controller several times. verify detect.
PASS: Reboot active controller several times. verify detect.
PASS: DOR System several times. Verify proper recovery.
PASS: DOR system but prevent power-up of several hosts. Verify detect.

Change-Id: I36e6309e141e9c7844b736cce0cf0cddff3eb588
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2018-12-20 15:46:03 -05:00
Angie Wang 45da23bbce Increase the partition size for docker distribution
This increases the default docker distribution partition size from
1G to 16G. This also increases the minimum disk requirements from
130G to 145G for small disk, 170G to 185G for large disk.

Story: 2004520
Task: 28526
Change-Id: I898cfac45757ff1f9e6ce7c4928bbd9a42dca77d
Signed-off-by: Angie Wang <angie.wang@windriver.com>
2018-12-18 20:52:12 -05:00
Tao Liu 9661e49411 Change compute node to worker node personality
This update replaces compute references to worker in mtce,
kickstarts, installer and bsp files.

Tests Performed:
Non-containerized deployment
AIO-SX: Sanity and Nightly automated test suite
AIO-DX: Sanity and Nightly automated test suite
2+2 System: Sanity and Nightly automated test suite
2+2 System: Horizon Patch Orchestration

Kubernetes deployment:
AIO-SX: Create, delete, reboot and rebuild instances
2+2+2 System: worker nodes are unlock enable and no alarms

Story: 2004022
Task: 27013

Depends-On: https://review.openstack.org/#/c/624452/

Change-Id: I225f7d7143d841f80459603b27b95ac3f846c46f
Signed-off-by: Tao Liu <tao.liu@windriver.com>
2018-12-13 13:08:48 -05:00
Zuul 8eb55b2b03 Merge "Mtce: Add Thresholded Maintenance Enable Recovery support" 2018-12-13 15:57:44 +00:00
Eric MacDonald 4e132af308 Mtce: fix hbsClient active monitoring over config reload
The maintenance process monitor is failing the hbsClient
process over config or process reload operations.

The issue relates to the hbsClient's subfunction being
'last-config' without pmon properly gating the active
monitoring FSM from starting until the passive monitoring
phase is complete and in the MANAGE state.

Test Plan

PASS: Verify active monitoring failure detection and handling
PASS: Verify proper process monitoring over pmond config reload
PASS: Verify proper process monitoring over SIGHUP -> pmond
PASS: Verify proper process monitoring over SIGUSR2 -> pmond
PASS: Verify proper process monitoring over process failure recovery
PASS: Verify pmond regression test soak ; on active and inactive controllers
PASS: Verify pmond regression test soak ; on compute node
PASS: Verify pmond regression test soak ; kill/recovery function
PASS: Verify pmond regression test soak ; restart function
PASS: Verify pmond regression test soak ; alarming function
PASS: Verify pmond handles critical process failure with no restart config
PASS: Verify pmond handles ntpd process failure

PASS: Verify AIO DX Install
PASS: Verify AIO DX Inactive Controller process management over Lock/Unlock.

Change-Id: Ie2fe7b6ce479f660725e5600498cc98f36f78337
Closes-Bug: 1807724
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2018-12-12 13:53:18 -05:00
Zuul 373f21e5cd Merge "Set SHELL in Makefiles that use bash constructs" 2018-12-12 14:27:53 +00:00
Eric MacDonald 3a5c578355 Mtce: Add Thresholded Maintenance Enable Recovery support
This update stops trying to recover hosts that have failed the
Enable sequence after a thresholded number of back-to-back tries.

A host that has reached a particular failure modes' max failure
threshold then maintenance puts it into a 'unlocked-disabled-failed'
state and left that way with no further recovery action until
it is manually locked and unlocked.

The thresholded Enable failure causes are

 Configuration Failure ....... threshold:2 retry interval:30 secs
 In-Test GoEnabled Failure ... threshold:2 retry interval:30 sec
 Start Host Services Failure . threshold:2 retry interval:30 sec
 Heartbeat Soak Failure ...... threshold:2 retry interval:10 minute

This update refactors the old auto recovery for AIO SX into this
more generic framework.

Story: 2003576
Task: 24905

Test Plan:

PASS: Verify AIO DX System Install
PASS: Verify AIO SX DOR
PASS: Verify Auto recovery disabled state is maintained over AIO SX DOR
PASS: Verify Lock/Unlock recovers host from Auto recovery disabled state
PASS: Verify AIO SX Main Config Failure handling
PASS: Verify AIO SX Main Config Timeout handling
PASS: Verify AIO SX Main GoEnabled Failure Handling
PASS; Verify AIO SX Main Host Services Failure handling
PASS; Verify AIO SX Main Host Services Timeout handling
PASS; Verify AIO SX Subf Config Failure handling
PASS: Verify AIO SX Subf Config Timeout handling
PASS: Verify AIO SX Subf GoEnabled Failure Handling
PASS: Verify AIO SX Subf Host Services Failure handling

PASS: Verify AIO DX System Install
PASS: Verify AIO DX DOR
PASS: Verify AIO DX DOR ; one time active controller GoEnabled failure ; swact requested
PASS: Verify AIO DX Main First Unlock Failure handling
PASS: Verify AIO DX Main Config Failure handling (inactive ctrl)
PASS: Verify AIO DX Main one time Config Failure handling
PASS: Verify AIO DX Main one time GoEnabled Failure handling.
PASS: Verify AIO DX SUBF Inactive Controller 1 GoEnable Failure handling.
PASS: Verify AIO DX Inactive Controller 1 GoEnable Failure with recovery on retry.
PASS: Verify AIO DX Active controller Enable failure with no or locked peer controller.
PASS: Verify AIO DX Reboot Active controller with peer in auto recovery disabled state.
PASS: Verify AIO DX Active controller failure with peer in auto recovery disabled state. (vswitch process)
PASS: Verify AIo DX Active controller failure then recovery after reboot with peer in auto recovery disabled state. (goenabled)
PASS: Verify AIO DX Inactive Controller Enable Heartbeat Soak Failure handling.
PASS: Verify AIO DX Active controller unhealthy detection and handling. (degrade)
PASS: Verify AIO DX Inactive controller unhealthy detection and handling. (fail)

PASS: Verify Normal System Install
PASS: Verify Compute Enable Configuration Failure handling (wc71-75)
PASS: Verify Compute Enable GoEnabled Failure handling (recover after 1)
PASS: Verify Compute Enable Start Host Services Failure handling
PASS: Verify Compute Enable Heartbeat Soak Failure handling
PASS: Verify Inactive Controller Enable Heartbeat Soak Failure handling
PASS: Verify Inactive Controller Configuration Failure handling
PASS; Verify Inactive Controller GoEnabled Failure handling
PASS; Verify Inactive Controller Host Services Failure handling
PASS; Verify goEnabled failure after active controller reboot with no peer controller (C0 rebooted with C1 locked) - no SM startup
PASS: Verify auto recovery threshold number is configurable
PASS: Verify auto recovery retry interval is configurable
PASS: Verify auto recovery host state and status message

Regression:

PASS: Verify Swact behavior, over and back
PASS: Verify 5 node DOR
PASS: Verify 3 host MNFA behavior
PASS: verify in-service heartbeat failure handling
PASS: verify no segfaults during UT

Corner Cases:

PASS: Verify mtcAlive boot failure behavior. reset progression. retry forever. - sleep in config script
PASS: Verify AIO SX mtcAgent process restart while in autorecovery disabled state
PASS: Verify autorecovery disabled state is preserved over mtcAgent process restart.

Change-Id: I7098f16243caef27c5295971ef3c9de5be975755
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2018-12-12 08:11:36 -05:00
Zuul 42ad23ae83 Merge "No json_object_put() for the json_obj created by json_object_object_get_ex()." 2018-12-11 21:10:30 +00:00
Eric MacDonald 9d7a4bf92c Implement Active-Active Heartbeat as HA Improvement Fix
A few small issues were found during integration testing with SM.

This update delivers those integration tested fixes.

1. Send cluster event to SM only after the first 10 heartbeat
   pulses are received.
2. Only send inventory to hbsAgent on provisioned controllers.
3. Add new OOB SM_UNHEALTHY flag to detect and act on an SM
   declared unhealthy controller.
4. Network monitoring enable fix.
5. Fix oldest entry tracking when a network history is not full.
6. Prevent clearing local uptime for a host that is being enabled.
7. Refactor cluster state change notification logging and handling.

These fixes were both UT and IT tested in multiple labs

Change-Id: I28485f241ac47bb3ed3ec1e2a8f4c09a1ca2070a
Story: 2003576
Task: 24907
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2018-12-10 09:57:34 -05:00
Dean Troyer 58b987239f Set SHELL in Makefiles that use bash constructs
A number of Makefiles use '[[' in their test to set
STATIC_ANALYSIS_TOOL_EXISTS.  Set SHELL=/bin/bash

Change-Id: Ie9536d7cafd518f3e65acf38ac5b30aa7536ea79
Signed-off-by: Dean Troyer <dtroyer@gmail.com>
2018-12-07 14:09:48 -06:00
Erich Cordoba 9490a2b1fb Remove version from installer
The stx-2.0 version was removed from the required filenames.
Also, now the files should be placed on the stx-installer folder.

Story: 2004126
Task: 28336

Change-Id: I1d8667472c7dfa6d48ce120626fe202a16f41c28
Signed-off-by: Erich Cordoba <erich.cordoba.malibran@intel.com>
2018-12-07 11:06:47 -06:00
Yan Chen 1c38aff32a No json_object_put() for the json_obj created by json_object_object_get_ex().
It is stated in the json_object.h from version 0.11:
https://github.com/json-c/json-c/blob/json-c-0.11/json_object.h#L271

As on json-c 0.11, there's no assert to check for the ref_count, we wont get
  crashed. But on json-c 0.13.1 (latest release), json_object_put will check
  for the ref_count first, so mtcAgent will crash.

Test Done:
Run mtcAgent with json-c version 0.13.1 with this patch, no crash found.

Closes-Bug: 1807097

Change-Id: I35e5c1cad2e16ee0b6fc639380f1bdd3b64a7018
Signed-off-by: Yan Chen <yan.chen@intel.com>
2018-12-08 00:32:44 +08:00
Zuul 286577940f Merge "SysInv Decoupling: Create Inventory Service" 2018-12-06 21:28:41 +00:00
Eric MacDonald c0d26f5907 Revert "No json_object_put() for the json_obj created by json_object_object_get_ex()."
This reverts commit a92c543fd5.

Change-Id: I972e083ac91bd1ecd13b900b417685eda5a4add0
2018-12-06 20:40:19 +00:00
John Kung bd998017d5 SysInv Decoupling: Create Inventory Service
Create host inventory services (api, conductor and agent) and
python-inventoryclient.

The inventory service collects the host resources and provides a
REST API and client to expose the host resources.

Create plugin for integration with system configuration (sysinv)
service.

This is the initial inventory service infratructure commit.
Puppet configuration, SM integration and host integration with
sysinv(systemconfig) changes are pending and planned to be
delivered in future commits.

Tests Performed:
 Verify the changes are inert on config_controller installation
 and provisioning.
     Puppet and spec changes are required in order to create keystone,
     database and activate inventory services.

 Unit tests performed (when puppet configuration for keystone, database
 is applied):
     Trigger host configure_check, configure signals into
         systemconfig(sysinv).

     Verify python-inventoryclient and api service:
         Disks and related storage resources are pending.
         inventory host-cpu-list/show
         inventory host-device-list/show/modify
         inventory host-ethernetport-list/show
         inventory host-lldp-neighbor-list
         inventory host-lldp-agent-list/show
         inventory host-memory-list/show
         inventory host-node-list/show
         inventory host-port-list/show

     Tox Unit tests:
         inventory: pep8
         python-inventoryclient: py27, pep8, cover, pylint

Change-Id: I744ac0de098608c55b9356abf180cc36601cfb8d
Story: 2002950
Task: 22952
Signed-off-by: John Kung <john.kung@windriver.com>
2018-12-06 13:17:35 -05:00
Yan Chen a92c543fd5 No json_object_put() for the json_obj created by json_object_object_get_ex().
It is stated in the json_object.h from version 0.11:
https://github.com/json-c/json-c/blob/json-c-0.11/json_object.h#L271

As on json-c 0.11, there's no assert to check for the ref_count, we wont get
  crashed. But on json-c 0.13.1 (latest release), json_object_put will check
  for the ref_count first, so mtcAgent will crash.

Test Done:
Run mtcAgent with json-c version 0.13.1 with this patch, no crash found.

Closes-Bug: 1807097

Change-Id: I7f954c97804ae01f831c94a36b9dbdbb34dbf083
Signed-off-by: Yan Chen <yan.chen@intel.com>
2018-12-07 00:47:21 +08:00
zhipengl b3eee43630 Remove NSUPDATE patch for dhcp package
After discussion with Eslimi, this patch disables DDNS on dhclient,
as the network port 2105 used by dhclient conflict with same port
used on mtcClient. Now we change the port used by mtcClient from 2105
to 2118 to fix conflict, then we can remove this patch.

Deployment test pass.

Story: 2003757
Task: 26445

Change-Id: I70559d73f51f85c840042cc4fc206fcd5bc3de27
Signed-off-by: zhipengl <zhipengs.liu@intel.com>
2018-12-05 05:41:46 +00:00
Zuul 0fe41e4184 Merge "fix the wrong code to set the terminate char" 2018-11-29 15:12:42 +00:00
zhipengl b8a9342b42 Refactor patches for openstack-aodh package
Use openstack-aodh-config package to package service and config
files for openstack-aodh package.
The openstack-aodh-config need to be set to the same node filter
as we did for openstack-aodh.

Deployment test pass and related service/script file check pass.

Story: 2003768
Task: 28044

Change-Id: I454762abc1dbd6c5db639be2b1f046e23c131d91
Signed-off-by: zhipengl <zhipengs.liu@intel.com>
2018-11-29 00:12:38 +08:00
Zuul 1ee175baef Merge "Prevent early active monitoring of compute processes in AIO" 2018-11-27 23:00:17 +00:00
Zuul 536206274d Merge "Filter out Barbican from compute and storage hosts." 2018-11-27 15:29:39 +00:00
Zuul ae6ce39608 Merge "PXE Boot Server robustness" 2018-11-27 00:37:41 +00:00
Kristine Bujold 0715862fa5 PXE Boot Server robustness
This commits add the creation of a symlink for EFI/grub.cfg. It also
allows the user to overwrite an existing directory, the user will be
prompted to confirm this action.

Closes-Bug: 1794863

Change-Id: I566a3b39c601921cd73cca3f291f43e5dc0ef626
Signed-off-by: Kristine Bujold <kristine.bujold@windriver.com>
2018-11-26 15:40:28 -05:00
Alex Kozyrev 3aa612adab Filter out Barbican from compute and storage hosts.
Barbican is going to be running on controller node only.
So, I'm removing all Barbican rpms from compute and storage nodes.

Change-Id: Ib00d697a2c9816cc7c3f181bb0f4d298bba973bd
Story: 2003108
Task: 27700
Signed-off-by: Alex Kozyrev <alex.kozyrev@windriver.com>
2018-11-26 14:36:34 -05:00
zhipengl 74c0eafffe Refactor patches for net-snmp package.
Use net-snmp-config package to package script and service file for
net-snmp-config package.
Like net-snmp, net-snmp-config will also not be installed on computer
and storage node.

Deployment test and ping test between VMs pass
Config,serivce and script files check pass.

Story: 2003768
Task: 27586

Change-Id: I2f9d1bfdbe0b27fbd58137df8a3fd36d3053defa
Signed-off-by: zhipengl <zhipengs.liu@intel.com>
2018-11-24 12:15:15 +00:00
Yan Chen b2290e4fd9 fix the wrong code to set the terminate char
Description:

In mtce/src/hwmon/hwmonThreads.cpp, line 266:
    ++dst_ptr = '\0' ;
should be modified as
    *(++dst_ptr) = '\0' ;
Otherwise the code is useless and will generate a compile error in
  higher version gcc.

Reproduce:

Compile the mcte code with gcc 8.2.1 will cause a compile error.
And after the fix, the error is gone.

Closes-Bug: 1804599
Change-Id: I25df255fb14aa3d96c62927eeb7d3e23ae29af2b
Signed-off-by: Yan Chen <yan.chen@intel.com>
2018-11-23 02:01:45 +08:00
Eric MacDonald dc531dc815 Fix mtce guest build failure
A recent update to stx-metal/mtce-common removed a daemon_config
structure member that the stx-nfv/mtce-guest git depends on.
This was not detected during UT of the mtc-common change because
of a missing build dependency that should force a rebuild of the
mtce guest.

Delivering the code fix to unblock the community.
Will deliver the build dependency change shortly.

Change-Id: Ice08424f156ffc84e38651fbc40ebc184170eb20
Closes-Bug: 1804579
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2018-11-22 10:26:18 -05:00
Zuul abf0ff3986 Merge "Implement Active-Active Heartbeat as HA Improvement" 2018-11-21 16:42:56 +00:00
Eric MacDonald 6d0cc6a2a8 Prevent early active monitoring of compute processes in AIO
The commit shown below introduced a main loop audit that
mistakenly registers subfunction processes that are in the
waiting for /var/run/.compute_config_complete 'polling'
state during unlock enable.

By doing so inadvertently changes its monitor FSM stage
from 'Poll' to 'Manage' before configuration is complete.

Since config is not complete, the hbsClient has not initialized
its socket interface and is unable to service active monitoring
requests. This leads to quorum failure and watchdog reboot.

commit 537935bb0c
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Mon Jul 9 08:36:22 2018 -0400
Reorder process restart operations to prevent pmond futex deadlock

The Fix: Don't run the audit for processes that are in the
waiting for 'polling' state.

Test Plan:

Provision AIO , verify no quorum failure and inspect logs for
correct behavior.

Change-Id: I179c78309517a34285783ee99bbb3d699915cb83
Closes-Bug: 1804318
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2018-11-21 10:04:00 -05:00
Eric MacDonald 0b922227ac Implement Active-Active Heartbeat as HA Improvement
This update introduces mtce changes to support Active-Active Heartbeating.

The purpose of Active-Active Heartbeating is help avoid Split-Brain.

Active-Active heartbeating has each controller maintain a 5 second
heartbeat response history cache of each network for all monitored
hosts as well as the on-going health of storage-0 if provisioned and
enabled.

This is referred to as the 'heartbeat cluster history'

Each controller then includes its cluster history in each heartbeat
pulse request message.

The hbsClient, now modified to handle heartbeat from both controllers,
saves each controllers' heartbeat cluster history in a local cache and
criss-crosses the data in its pulse responses.

So when the hbsClient receives a pulse request from controller-0 it
saves its reported history and then replaces that history information
in its response to controller-0 with what it saved from controller-1's
last pulse request ; i.e. its view of the system.

Controller-0, receiving a host's pulse response, saves its peers
heartbeat cluster history so that it has summary of heartbeat
cluster history for the last 5 seconds for each monitored network
of every monitored host in the system from both controllers'
perspectives. Same for controller-1 with controller-0's history.

The hbsAgent is then further enhanced to support a query request
for this information.

So now SM, when it needs to make a decision to avoid Split-Brain
or otherwise, can query either controller for its heartbeat cluster
history and get the last 5 second summary view of heartbeat (network)
responsivness from both controllers perspectives to help decide which
controller to make active.

This involved removing the hbsAgent process from SM control and monitor
and adding a new hbsAgent LSB init script for process launch, service
file to run the init script and pmon config file for hbsAgent process
monitoring.

With hbsAgent now running on both controllers, changes to maintenance
were required to send inventory to hbsAgent on both controllers,
listen for hbsAgent event messages over the management interface
and inform both hbsAgents which controller is active.

The hbsAgent running on the inactive controller does not
 - does not send heartbeat events to maintenance
 - does not send raise or clear alarms or produce customer logs

Test Plan:

Feature:
PASS: Verify hbsAgent runs on both controllers
PASS: Verify hbsAgent as pmon monitored process (not SM)
PASS: Verify system install and cluster collection in all system types (10+)
PASS: Verify active controller hbsAgent detects and handles heartbeat loss
PASS: Verify inactive controller hbsAgent detects and logs heartbeat loss
PASS: Verify heartbeat cluster history collection functions properly.
PASS: Verify storage-0 state tracking in cluster into.
PASS: Verify storage-0 not responding handling
PASS: Verify heartbeat response is sent back to only the requesting controller.
PASS: Verify heartbeat history is correct from each controller
PASS: Verify MNFA from active controller after install to controller-0
PASS: Verify MNFA from active controller after swact to controller-1
PASS: Verify MNFA for 80%+ of the hosts in the storage system
PASS: Verify SM cluster query operation and content from both controllers
PASS: Verify restart of inactive hbsAgent doesn't clear existing heartbeat alarms

Logging:
PASS: Verify cluster info logs.
PASS: Verify feature design logging.
PASS: Verify hbsAgent and hbsClient design logs on all hosts add value
PASS: Verify design logging from both controllers in heartbeat loss case
PASS: Verify design logging from both controllers in MNFA case
PASS: Verify clog  logs cluster info vault status and updates for controllers
PASS: Verify clog1 logs full cluster state change for all hosts
PASS: Verify clog2 logs cluster info save/append logs for controllers
PASS: Verify clog3 memory dumps a cluster history
PASS: Verify USR2 forces heartbeat and cluster info log dump
PASS: Verify hourly heartbeat and cluster info log dump
PASS: Verify loss events force heartbeat and cluster info log dump

Regression:
PASS: Verify Large System DOR
PASS: Verify pmond regression test that now includes hbsAgent
PASS: Verify Lock/Unlock of inactive controller (x3)
PASS: Verify Swact behavior (x10)
PASS: Verify compute Lock/Unlock
PASS: Verify storage-0 Lock/Unlock
PASS: Verify compute Host Failure and Graceful Recovery
PASS: Verify Graceful Recovery Retry to Max:3 then Full Enable
PASS: Verify Delete Host
PASS: Verify Patching hbsAgent and hbsClient
PASS: Verify event driven cluster push

Story: 2003576
Task: 24907

Change-Id: I5baf5bcca23601a99473d039356d58250ffb01b5
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2018-11-20 19:57:18 +00:00
Zuul 4380c0ca13 Merge "refactor lighttpd" 2018-11-20 01:17:55 +00:00
Zuul 21d31c2b2b Merge "refactor openldap" 2018-11-20 01:12:21 +00:00
Zuul 07818aac5e Merge "Increase disk size requirement from 10G to 16G for docker" 2018-11-16 19:22:12 +00:00
Al Bailey 73edb1fdf1 Increase disk size requirement from 10G to 16G for docker
Base disk increases by 6G from 124G to 130G for
small disk and from 164 to 170 for large.

Story: 2002876
Task: 27948
Change-Id: I3d987c0d70bf18a91cb4c977ac16fcdabe2cb9fc
Signed-off-by: Al Bailey <Al.Bailey@windriver.com>
2018-11-16 12:05:25 -06:00
slin14 026eea5ef4 refactor openldap
Package openldap-config is added to config customized config file
of openldap. These config files were packaged to openldap-servers,
so add the filter the same as openldap-servers.

Story: 2003768
Task: 26462

Change-Id: Id40967bcbed40998649602c209e8608532584058
Signed-off-by: slin14 <shuicheng.lin@intel.com>
2018-11-16 21:56:59 +08:00