This commit is initial submission of bootstrap playbook which
enables the bootstrap of initial controller. The playbook
defaults are meant for configuring the localhost in vbox
development environment. Custom hosts file and user overrides
are required for configuring multiple hosts and lab specific setup.
Secret file and SSH keys are required for production test enviroment.
Tests performed:
- installation
- config_controller complete to ensure the current method of
configuring the first controller is intact
- localhost bootstrap with default hosts file
- multiple remote hosts bootstrap with custom hosts file
- reconfigurations with user overrides
- stx-application applied in AIOSX and AIODX
- Failure & skip play cases (invalid config inputs, incorrect load,
connection failure, no changes replay, etc...)
TODO:
- Support for standard & storage configurations
- Docker proxy/custom registry related tests
- Package bootstrap playbook in SDK
- Config_controller cleanup
Change-Id: If553f1eeed32606bacc690ef277e60606e9d93ea
Story: 200476
Task: 29686
Task: 29687
Co-Authored-By: Ovidiu Poncea <ovidiu.poncea@windriver.com>
Signed-off-by: Tee Ngo <tee.ngo@windriver.com>
All rmon resource monitoring has been moved to collectd.
This update removes rmon from mtce and the load.
Story: 2002823
Task: 30045
Test Plan:
PASS: Build and install a standard system.
PASS: Inspect mtce rpm list
PASS: Inspect logs
PASS: Check pmon.d
Depends-On: https://review.openstack.org/#/c/643739
Change-Id: I7572a1d0a9cf746abfba3d67352534d96f60c5a7
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Cleanup unwanted openstack setup on bare metal.
Preparing the manifests to have the services removed from SM.
Bypass setting up openstack services on controller, worker and
storage.
Cleanup haproxy ports for services that will not be running
on bare metal.
Cleanup upgrade, remote logging, postgres, and anything else
related to openstack services that no longer run on bare
metal.
Remove all manifests and templates that are no longer being used.
Strip out any static hiera data that is no longer needed.
Story: 2004764
Task: 29850
Depends-On: Ice10fe6da6b34f1d9206f26e112eb555e2088932
Depends-On: I3c1cc8673be5cf6ab15f9158199bc24fccb44f17
Depends-On: Ie43cf11ebf1edcf3a8bb357205c4c59d2962b4fa
Change-Id: I2be8e9ab418835125ff433d06d2930df37534501
Signed-off-by: Al Bailey <Al.Bailey@windriver.com>
https://review.openstack.org/#/c/628687/ stopped packaging the
query_ntp_servers.sh script. However, since there were no other
files being packaged into that directory the spec file choose
not to create an empty directory.
When config controller called the mtce.pp manifest to install
dynamic files into /etc/rmonfiles.d it could not. So it failed.
This update adds a directory check block to the mtce.pp file
to create the directoy if its not present.
Testing: Install AIO SX in SM1
Change-Id: Ib2dfadb261be6f9ebbaa7213eb6669b25158c779
Closes-Bug: 1811693
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
This update replaced the compute personality & subfunction
to worker, and updated internal and customer visible
references.
In addition, the compute-huge package has been renamed to
worker-utils as it contains various scripts/services that
used to affine running tasks or interface IRQ to specific CPUs.
The worker_reserved.conf is now installed to /etc/platform.
The cpu function 'VM' has also been renamed to 'Application'.
Tests Performed:
Non-containerized deployment
AIO-SX: Sanity and Nightly automated test suite
AIO-DX: Sanity and Nightly automated test suite
2+2 System: Sanity and Nightly automated test suite
2+2 System: Horizon Patch Orchestration
Kubernetes deployment:
AIO-SX: Create, delete, reboot and rebuild instances
2+2+2 System: worker nodes are unlock enable and no alarms
Story: 2004022
Task: 27013
Change-Id: I0e0be6b3a6f25f7fb8edf64ea4326854513aa396
Signed-off-by: Tao Liu <tao.liu@windriver.com>
The mtc.ini file is updated a second time in AIO config.
Due to the scope of the SM ports being for controller only
and no defaults we see the sm port assignments missing in
AIO configs.
This update defaults the SM port numbers and changes the scope
of the parameters so that they get set on all node types for
all system types.
Testing included provisioning an AIO system.
Change-Id: Ib53921c4b59a9e67ed136a03504bdf0775de6dff
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Create the platform openrc file in /etc/platform, while
leaving existing /etc/nova/openrc file alone for now.
New platform/client.pp file is created and most of the
contents of openstack/client.pp moved there.
openstack/client.pp can be removed once kubernetes is the
default.
Change-Id: Ib6de59da6dfc9f34a24054405b6cda30d0b74ac1
Story: 2002876
Task: 27499
Signed-off-by: Kevin Smith <kevin.smith@windriver.com>
In support of the HA Improvements feature maintenance is required to,
upon request, send SM a summary of maintenance's heartbeat responsiveness
during the last 20 heartbeat periods.
This update adds the required port assignments to the mtc.ini file
in support of said communications.
With this update the mtc.ini file will be updated to contain the
following entries.
; Communication ports between SM and maintenance
sm_server_port = 2124 ; port sm receives mtce commands from
sm_client_port = 2224 ; port mtce receives sm commands from
Change-Id: I05c022f7e4dcdeaea71bc0020641baa331daae57
Story: 2003576
Task: 26837
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
The current maintenance heartbeat failure action handling is to Fail
and Gracefully Recover the host. This means that maintenance will
ensure that a heartbeat failed host is rebooted/reset before it is
recovered but will avoid rebooting it a second time if its recovered
uptime indicates that it has already rebooted.
This update expands that single action handling behavior to support
three new actions. In doing so it adds a new configuration service
parameter called heartbeat_failure_action. The customer can configure
this new parameter with any one of the following 4 actions in order of
decreasing impact.
fail - Host is failed and gracefuly recovered.
- Current Network specific alarms continue to be raised/cleared.
Note: Prior to this update this was standard system behavior.
degrade - Host is only degraded while it is failing heartbeat.
- Current Network specific alarms continue to be raised/cleared.
- heartbeat degrade reason is cleared as are the alarms when
heartbeat responses resume.
alarm - The only indication of a heartbeat failure is by alarm.
- Same set of alarms as in above action cases
- Only in this case no degrade, no failure, no reboot/reset
none - Heartbeat is disabled ; no multicase heartbeat message is sent.
- All existing heartbeat alarms are cleared.
- The heartbeat soak as part of the enable sequence is bypassed.
The selected action is a system wide setting.
The selected setting also applies to Multi-Node Failure Avoidance.
The default action is the legacy action Fail.
This update also
1. Removes redundant inservice failure alarm for MNFA case in support
of degrade only action. Keeping it would make that alarm handling
case unnecessarily complicated.
2. No longer used 'hbs calibration' code is removed (cleanup).
3. Small amount of heartbeat logging cleanup.
Test Plan:
PASS: fail: Verify MNFA and recovery
PASS: fail: Verify Single Host heartbeat failure and recovery
PASS: fail: Verify Single Host heartbeat failure and recovery (from none)
PASS: degrade: Verify MNFA and recovery
PASS: degrade: Verify Single Host heartbeat failure and recovery
PASS: degrade: Verify Single Host heartbeat failure and recovery (from alarm)
PASS: alarm: Verify MNFA and recovery
PASS: alarm: Verify Single Host heartbeat failure and recovery
PASS: alarm: Verify Single Host heartbeat failure and recovery (from degrade)
PASS: none: Verify heartbeat disable, fail ignore and no recovery
PASS: none: Verify Single Host heartbeat ignore and no recovery
PASS: none: Verify Single Host heartbeat ignode and no recovery (from fail)
PASS: Verify action change behavior from none to alarm with active MNFA
PASS: Verify action change behavior from alarm to degrade with active MNFA
PASS: Verify action change behavior from degrade to none with active MNFA
PASS: Verify action change behavior from none to fail with active MNFA
PASS: Verify action change behavior from fail to none with active MNFA
PASS: Verify action change behavior from degrade to fail then MNFA timeout
PASS: Verify all heartbeat action change customer logs
PASS: verify heartbeat stats clear over action change
PASS: Verify LO DOR (several large labs - compute and storage systems)
PASS: Verify recovery from failure of active controller
PASS: Verify 3 host failure behavior with MNFA threshold at 3 (action:fail)
PASS: Verify 2 host failure behavior with MNFA threshold at 3 (action:fail)
Change-Id: I198505fb7a923cc760b12082acff1e5bac929ef2
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
The maintenance system implements a high availability (HA) feature
designed to detect the simultaneous heartbeat failure of a group
of hosts and avoid failing all those hosts until heartbeat resumes
or after a set period of time.
This feature is called Multi-Node Failure Avoidance, aka MNFA, and
currently has the hosts threshold set to 3 and timeout set to 100 secs.
This update implements enhancements to that existing feature by
making the 'number-of-hosts threshold' and 'timeout period'
customer configurable service parameters.
The new service parameters are listed under platform:maintenance which
display with the following command
> system service-parameter-list
mnfa_threshold: This new label and value is added to the puppet
managed /etc/mtc.ini and represents the number of hosts that are
required to fail heartbeat as a group; within the heartbeat
failure window (heartbeat_failure_threshold) after which maintenance
activates MNFA Mode.
This update changes the default number of failing hosts from
3 to 2 while allowing a configurable range from 2 to 100.
mnfa_timeout: This new label and value is added to the puppet
managed /etc/mtc.ini. While MNFA mode is active, it will remain active
until the number of failing hosts drop below the mnfa_threshold or this
timer expires. The MNFA mode deactivates on the first occurance of
either case. Upon deactivation the remaining failed hosts are no
longer treated as a failure group but instead are all Gracefully
Recovered individually. A value of zero imposes no timeout making the
deactivation criteria solely host based.
This update changes the default 100 second timer to 0; no-timeout
while permitting valid a times range from 100 to 86400 secs or 1 day.
DocImpact
Story: 2003576
Task: 24903
Change-Id: I2fb737a4cd3c235845b064449949fcada303d6b2
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Making initial changes to enable new upgrades. Most
of the changes are related to removing older upgrade code that
is no longer necessary (i.e. all the packstack to mattstack
conversion code).
Change-Id: I8fe4c8c0d3f12fd7b4fc45b226bf969ffda72dc7
Story: 2002886
Task: 22847
Signed-off-by: Jack Ding <jack.ding@windriver.com>