Platform Single Core Tuning

Adjustments to allow the execution of StarlingX services in only one
core. Reducing StarlingX resource consumption to just one core allows
the system to use the remaining resources for a larger workload,
increasing the availability of resources for end user applications.

Story: 2010087
Task: 45594

Signed-off-by: Alyson Deives Pereira <alyson.deivespereira@windriver.com>
Change-Id: Ibd65aa80c3e0b9f12e67e857a54f070a525a9c20
This commit is contained in:
Alyson Deives Pereira 2022-06-22 16:16:10 -03:00
parent de25477d4e
commit f618ca6764
1 changed files with 533 additions and 0 deletions

View File

@ -0,0 +1,533 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License. http://creativecommons.org/licenses/by/3.0/legalcode
=====================================
StarlingX Platform Single-Core Tuning
=====================================
Storyboard: `#2010087`_
The objective of this spec is to identify and make changes required for the
StarlingX Platform to enable its operation on a single processor core.
Problem description
===================
Resource usage is very intensive on platforms with multiple cores and
processors. Reducing StarlingX resource consumption to just one core allows
the system to use the remaining resources for a larger workload, increasing
the availability of resources for end user applications.
To identify the required changes to address the usage of a single
core platform, we performed a proof-of-concept with minimal required changes.
To characterize the system behavior and to identify required product changes,
detailed system profiling was performed for key services.
The objective was to measure the individual services, but also to identify
potential system bottlenecks or performance changes based on the competing
CPU resources.
Below, there is a brief analysis of critical CPU-consuming services and
their impact on the system's steady-state operation when running on a single
platform core.
The object of this spec is to address the issues identified by implementing the
changes described in sections :ref:`Proposed change` and :ref:`Work Items`.
Top CPU Consumers
-----------------
kube-apiserver
**************
kube-apiserver health check requests show a high number of readyz requests
(a type of kubernetes endpoint API), indicating some pods could be taking a
long time to respond to a request or terminate.
From the investigation done at kube-apiserver, most requests are coming from
cert-manager injector which is in a legacy version (v0.15).
The requests are due to the leader election process that remains on even in
a single process.
sm-watchdog
***********
We executed different test scenarios to analyze the process behavior when
pods were created and deleted. During all the tests, we observed periodic CPU
spikes every 10 seconds.
This periodic task is represented by the SM_WATCHDOG_NFS_CHECK_IN_MS parameter,
which defines the cycle to verify |NFS| and recover it in case of any
anomalies.
The reason for the high CPU consumption is related to the mechanism used by
sm-watchdog to check the |NFS|.
To find all the :command:`nfsd` threads, the watchdog code looks at every
process within the proc file system, which means it scans every folder with
a process number looking for a stat file that represents an |NFS|.
beam.smp
********
During a test that created and deleted some pods, the beam.smp process
(from `RabbitMQ`_) ran with constant CPU usage and some spikes. There are 3
pairs of messages are repeated throughout the logs but the id of
the "publish" and "deliver" routes changes. The system logs related to
`RabbitMQ`_ contain AMQP calls to sync the requests/replies messages
generated by the sysinv-conductor. The behavior observed from `RabbitMQ`_
indicates it is serving as an |RPC|
service for sysinv.
sysinv-agent
************
The behavior presented by the sysinv-agent process reflects the sysinv logs and
aligns with what was expected. Every minute, the sysinv-agent wakes up to verify
if there is a need to modify the system configuration, e.g., memory or storage.
In short, the sysinv-agent doesn't represent a significant concern in
matters of overall system performance. One possible optimization to be
evaluated is related to the periodic task and its timeframe. Increasing the
time between the requests, optimizing the periodic operations, or converting
it to an on-demand task may bring some benefits related to CPU time.
sysinv-conductor
****************
In the sysinv-conductor process test, we observed two scenarios:
During the first scenario, which was the most frequently observed, the process
showed a typical daemon behavior with continuous low CPU usage. In the second
scenario, the process showed CPU spikes every 60 seconds. Skimming the
source code, we found some periodic task definitions controlled
by the audit interval. The overall impact of sysinv-conductor on CPU load is
low. Optimizing the code
could decrease the spikes during the system's steady-state operation. One
option, whenever possible, is to change the periodic tasks to on-demand tasks.
When this approach is not possible, there is still the option to optimize the
interval of the periodic tasks, increasing it, after evaluating and concluding
it does not impact the system stability.
Use Cases
---------
As an end-user, I want to improve system performance by enabling StarlingX
to run only within the compute resources of a single CPU core with the
remaining cores for my application workload.
.. _Proposed Change:
Proposed change
===============
Platform Core Adjustments
-------------------------
The following set of changes must be applied to reduce physical
core usage from 2 cores to 1.
System Inventory
****************
Changes to sysinv cpu_utils.py and stx-puppet platform params manifest file
are required to allow the platform to be configured with only a
single physical core via the “system host-cpu-modify” command.
Scale Down Services
*******************
Many platform services have a scaled number of threads or worker processes that
are directly proportional to the number of platform cores configured. However,
many have a minimum number of threads under the assumption that they must
support a minimum scale.
Changes to the number of workers logic in stx-puppet platform params manifest
file are required to allow these services to use only a single core.
Moreover, this change will also reduce the amount of memory allocated to a
single service.
The scale down will take place in case of single core allocation respecting
existing worker allocation rules.
In case of small footprints, the system defines the number of workers based on
number of platform cores (in case of AIO) with maximum limit as 2 for AIO-SX
and 3 for AIO-DX (linear scalability based on the number of platform cores).
The proposed changes would not change the existing rule. With the relaxation of
the minimum limit (from 2 to 1), the system would scale down the number of
threads to the minimum.
The following services shall be impacted:
.. list-table:: Impacted Services
:widths: 50 50
:header-rows: 1
* - Service
- Description
* - postgres
- Object-Relational database management
* - etcd
- Distributed key-value store
* - containerd
- Container Runtime
* - memcached
- Distributed Memory Object Cache
* - armada
- `Armada`_ Application Management
* - keystone
- Identity Management
* - barbican
- Secret Management
* - docker-registry
- Docker Container Registry
* - docker-token-server
- Docker Token Server
* - kube-apiserver
- Kubernetes API Server
* - kubelet
- Kubernetes Node Agent
Kubernetes Tuning
-----------------
These changes adjust some Kubernetes and etcd parameters and enhance the number
of parallel requests Kubernetes can handle based on the platform cores
allocated. Additional tests may be required to define the best tuning values.
* kube-apiserver:
* max-requests-inflight: Limit the number of API calls that will be
processed in parallel, which is a great control point of kube-apiserver
memory consumption. The API server can be very CPU intensive when
processing a lot of requests in parallel.
* kube-controller-manager, kube-scheduler, kubelet, kube-proxy:
* kube-api-burst/kube-api-qps: These 2 flags set the normal and burst rate
that the controller manager can talk to kube-apiserver.
* etcd:
* heartbeat-interval: This is the frequency with which the leader will
notify followers that he/she is still the leader.
* election-timeout: The election timeout should be set based on the
heartbeat interval and average round-trip time between members.
* snapshot-count: ETCD appends all key changes to a log file. This log grows
forever and is a complete linear history of every change made to the
keys.
Postgres Tuning
----------------
During our analysis, we identified many parameters related to parallel workers
and the vacuum process as a potential tuning source for Postgres.
This change adjusts the overall parameters based on the platform cores
allocated. Additional tests may be required to define the best tuning values.
Service Management Watchdog
----------------------------
Enhance the sm-watchdog process on two different fronts:
* Restrict its use to the required scenarios (avoid sm-watchdog on AIO-SX
configuration).
* Optimize the |NFS| monitoring to avoid the overhead on the proc file system
while looking for |NFS|.
System Inventory
----------------
Periodic and Runtime Tasks
**************************
Currently sysinv-conductor and sysinv-agent have many periodic tasks that
should be reviewed and, if possible, redesigned. The main focus is to reduce
sysinv regular spikes by
* Refactoring legacy code;
* Increasing time intervals when possible;
* Converting periodic tasks to on-demand tasks, when possible.
Remote Procedure Calls
**********************
System Inventory Remote Procedure Calls (|RPCs|) are performed using
`RabbitMQ`_ as a communication transport layer between the different processes.
The target is to convert internal calls from System Inventory |RPC| using
`RabbitMQ`_ to a serverless solution `ZeroMQ`_.
Affected sysinv modules:
* agent
* api
* conductor
* cmd
* fpga_agent
* helm
* Scripts/manage-partitions
Alternatives
------------
There is an alternative to use `gRPC`_ instead of `ZeroMQ`_. This solution
should be better analyzed if the actual solution is not usable.
Data model impact
-----------------
None
REST API impact
---------------
None
Security impact
---------------
None
Other end-user impact
---------------------
The default configuration for platform cores will be changed to 1 core and
system recommendations would be adjusted to comply with minimum required
platform cores based on processor/use case.
The end user must be aware of hardware requirements and limitations, and
configure the system according to his workload scenario.
.. _Performance Impact:
Performance Impact
------------------
To maintain system stability while operating with fewer compute resources,
it may be required to adjust the priority of critical system and platform
processes during the execution of this spec. If process starvation is
occurring, the system may reboot or declare specific services as failed and
attempt recovery. If this is experienced, the starved process priority will
need to be increased.
.. list-table:: Potential Service Impacts
:widths: 50 50
:header-rows: 1
* - Service
- Description
* - hostwd
- Host Watchdog
* - pmond
- Process Monitor
* - sm
- Service Manager
* - kubelet
- Kubernetes Node Agent
* - hbsAgent
- Heartbeat Service Agent
* - hbsClient
- Heartbeat Service Client
* - mtcAgent
- Maintenance Agent
* - mtcClient
- Maintenance Client
In a distributed cloud scenario, some timing impact is expected on subcloud
operations due to resource limitation but no impact on scalability is expected.
Another deployer impact
-----------------------
Automated deployment technologies should be aware of the new library `ZeroMQ`_.
Developer impact
----------------
We assume that there is no visible developer impact.
Upgrade impact
--------------
According to this spec, the new message queue library using `ZeroMQ`_
being added into sysinv could impact backup and restore, upgrade, and rollback.
Tests should be done to guarantee the new and old behavior.
Implementation
==============
Assignee(s)
-----------
Primary assignee:
* Guilherme Batista Leite (guilhermebatista)
Other contributors:
* Alexandre Horst (ahorst)
* Alyson Deives Pereira (adeivesp)
* Bruno Costa (bdacosta)
* Caio Cesar Ferreira (ccesarfe)
* Davi Frossard (dbarrosf)
* Eduardo Alberti (ealberti)
* Guilherme Alberici de Santi (galberic)
* Isac Sacchi e Souza (isouza)
* Marcos Paulo Oliveira Silva (mpaulool)
* Romão Martines (rmartine)
* Thiago Antonio Miranda (tamiranda)
Repos Impacted
--------------
List repositories in StarlingX that are impacted by this spec:
* starlingx/ansible-playboooks
* starlingx/config
* starlingx/config-files
* starlingx/integ
* starlingx/stx-puppet
* starlingx/docs
.. _Work Items:
Work Items
----------
Scale Down Services
*******************
* Adjust the following platform services to account for the minimum number of
threads/processes based on the system configuration and the number of
platform cores: barbican, containerd, docker-registry,
docker-token-server, keystone, kube-apiserver, kubelet, memcached, postgres.
System Inventory
****************
* Adjust sysinv check to allow 1 platform core utilization
* Change default behavior to 1 platform core utilization
* Legacy code refactoring
* Review existing periodic tasks converting them to on-demand if possible
* Adjust periodic tasks' timing interval based on each task's needs
* Refactor sysinv-fpga-agent to be launched into context only when it
is required
* Cleanup/review of the existing |RPC|'s to adopt a more consistent |RPC|
usage model and to reduce the number of different calls that need to be
supported.
* Convert internal calls from |RPC| using `RabbitMQ`_ to the service-less
solution `ZeroMQ`_.
Kubernetes
**********
* Adjust overall Kubernetes configuration parameters based on the platform
cores allocated
* Investigate/enhance the number of parallel requests it can handle based on
the platform cores allocated.
etcd
****
* Adjust etcd configuration parameters based on the platform cores allocated
Postgres
********
* Adjust overall Postgres configuration parameters based on the platform cores
allocated
* Evaluate and tweak the vacuum process
Service Management Watchdog
***************************
* Evaluate that the |NFS| audit condition is still present in our system and
that this audit is required before optimizing the solution
* Restrict its use to the required scenarios (avoid sm-watchdog on AIO-SX
configuration) or remove it totally in case its audit is unnecessary
* Optimize the |NFS| monitoring (if is required) to avoid the overhead on the
proc file system while looking for |NFS|
Overall Performance Evaluation
******************************
* After all proposed changes are implemented, evaluate the minimum hardware
requirements (processor frequency, cache size and number of cores) and
its workload scenario to enable StarlingX operation on a single platform core
* Verify if process starvation is occurring. If that is the case, adjust the
priority of critical system and platform processes,
as mentioned in :ref:`Performance Impact`
* Update the documentation with the minimum hardware requirements.
Dependencies
============
* Postgres should be up-versioned to 9.4.X or higher
Testing
=======
System Configurations
---------------------
The system configurations that we are assuming for testing are:
* Standalone - AIO-SX
* Standalone - AIO-DX
* Distributed Cloud
Test Scenarios
--------------
We elected some tests which should be defined or changed to
cover this spec:
* The usual unit testing in the impacted code areas
* Full system regression of all StarlingX applications functionality (system
application commands, lifecycle actions, etc)
* Performance testing to identify and address any performance impacts.
* Backup and restore tests
* Upgrade and rollback tests
* Sysinv RPC communication tests
* Distributed Cloud evaluation of scalability and parallel operations
* In addition, this spec changes the way a StarlingX system is installed
and configured, which will require changes in existing automated
installation and testing tools.
Documentation Impact
====================
The End User documentation will need to be updated to indicate the minimum
hardware requirements (number of cores, frequency and cache-sizes) and
workload scenarios when using a single platform core for StarlingX.
For instance, assuming that the more pods are running, the more CPU processing
is needed for their management (processes such as kubelet and containerd-shim),
the documentation should be reviewed to state the minimum number of platform CPU
cores based on the number of pods.
Documentation should also be reviewed to inform the replacement of
`RabbitMQ`_ for `ZeroMQ`_ for |RPC| communication between sysinv processes
(sysinv-agent, sysinv-api and sysinv-conductor).
In case of any new limitation, recommendation, or anything that needs
a requirement update are identified during the development of the proposed
changes in this spec, they shall be included in the documentation as well.
References
==========
#. `Armada`_
#. `FluxCD`_
#. `RabbitMQ`_
#. `Firehose`_
#. `ZeroMQ`_
#. `gRPC`_
History
=======
.. list-table:: Revisions
:header-rows: 1
* - Release Name
- Description
* - stx-8.0
- Introduced
.. Abbreviations
.. |NFS| replace:: :abbr:`NFS (Network File System)`
.. |RPC| replace:: :abbr:`RPC (Remote Procedure Call)`
.. |RPCs| replace:: :abbr:`RPCs (Remote Procedure Calls)`
.. Links
.. _#2010087: https://storyboard.openstack.org/#!/story/2010087
.. _Armada: https://airship-armada.readthedocs.io/en/latest/
.. _FluxCD: https://fluxcd.io/docs/
.. _RabbitMQ: https://www.rabbitmq.com/documentation.html
.. _Firehose: https://www.rabbitmq.com/firehose.html
.. _ZeroMQ: https://zguide.zeromq.org/docs/
.. _gRPC: https://grpc.io/docs/