diff --git a/doc/source/specs/stx-8.0/approved/starlingx-2010087-platform-single-core-tuning.rst b/doc/source/specs/stx-8.0/approved/starlingx-2010087-platform-single-core-tuning.rst new file mode 100644 index 0000000..df56d70 --- /dev/null +++ b/doc/source/specs/stx-8.0/approved/starlingx-2010087-platform-single-core-tuning.rst @@ -0,0 +1,533 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. http://creativecommons.org/licenses/by/3.0/legalcode + + + +===================================== +StarlingX Platform Single-Core Tuning +===================================== + +Storyboard: `#2010087`_ + +The objective of this spec is to identify and make changes required for the +StarlingX Platform to enable its operation on a single processor core. + +Problem description +=================== + +Resource usage is very intensive on platforms with multiple cores and +processors. Reducing StarlingX resource consumption to just one core allows +the system to use the remaining resources for a larger workload, increasing +the availability of resources for end user applications. + +To identify the required changes to address the usage of a single +core platform, we performed a proof-of-concept with minimal required changes. +To characterize the system behavior and to identify required product changes, +detailed system profiling was performed for key services. +The objective was to measure the individual services, but also to identify +potential system bottlenecks or performance changes based on the competing +CPU resources. + +Below, there is a brief analysis of critical CPU-consuming services and +their impact on the system's steady-state operation when running on a single +platform core. +The object of this spec is to address the issues identified by implementing the +changes described in sections :ref:`Proposed change` and :ref:`Work Items`. + +Top CPU Consumers +----------------- + +kube-apiserver +************** + +kube-apiserver health check requests show a high number of readyz requests +(a type of kubernetes endpoint API), indicating some pods could be taking a +long time to respond to a request or terminate. + +From the investigation done at kube-apiserver, most requests are coming from +cert-manager injector which is in a legacy version (v0.15). +The requests are due to the leader election process that remains on even in +a single process. + +sm-watchdog +*********** + +We executed different test scenarios to analyze the process behavior when +pods were created and deleted. During all the tests, we observed periodic CPU +spikes every 10 seconds. +This periodic task is represented by the SM_WATCHDOG_NFS_CHECK_IN_MS parameter, +which defines the cycle to verify |NFS| and recover it in case of any +anomalies. +The reason for the high CPU consumption is related to the mechanism used by +sm-watchdog to check the |NFS|. +To find all the :command:`nfsd` threads, the watchdog code looks at every +process within the proc file system, which means it scans every folder with +a process number looking for a stat file that represents an |NFS|. + +beam.smp +******** + +During a test that created and deleted some pods, the beam.smp process +(from `RabbitMQ`_) ran with constant CPU usage and some spikes. There are 3 +pairs of messages are repeated throughout the logs but the id of +the "publish" and "deliver" routes changes. The system logs related to +`RabbitMQ`_ contain AMQP calls to sync the requests/replies messages +generated by the sysinv-conductor. The behavior observed from `RabbitMQ`_ +indicates it is serving as an |RPC| +service for sysinv. + +sysinv-agent +************ +The behavior presented by the sysinv-agent process reflects the sysinv logs and +aligns with what was expected. Every minute, the sysinv-agent wakes up to verify +if there is a need to modify the system configuration, e.g., memory or storage. +In short, the sysinv-agent doesn't represent a significant concern in +matters of overall system performance. One possible optimization to be +evaluated is related to the periodic task and its timeframe. Increasing the +time between the requests, optimizing the periodic operations, or converting +it to an on-demand task may bring some benefits related to CPU time. + +sysinv-conductor +**************** +In the sysinv-conductor process test, we observed two scenarios: +During the first scenario, which was the most frequently observed, the process +showed a typical daemon behavior with continuous low CPU usage. In the second +scenario, the process showed CPU spikes every 60 seconds. Skimming the +source code, we found some periodic task definitions controlled +by the audit interval. The overall impact of sysinv-conductor on CPU load is +low. Optimizing the code +could decrease the spikes during the system's steady-state operation. One +option, whenever possible, is to change the periodic tasks to on-demand tasks. +When this approach is not possible, there is still the option to optimize the +interval of the periodic tasks, increasing it, after evaluating and concluding +it does not impact the system stability. + +Use Cases +--------- + +As an end-user, I want to improve system performance by enabling StarlingX +to run only within the compute resources of a single CPU core with the +remaining cores for my application workload. + +.. _Proposed Change: + +Proposed change +=============== + +Platform Core Adjustments +------------------------- + +The following set of changes must be applied to reduce physical +core usage from 2 cores to 1. + +System Inventory +**************** +Changes to sysinv cpu_utils.py and stx-puppet platform params manifest file +are required to allow the platform to be configured with only a +single physical core via the “system host-cpu-modify” command. + +Scale Down Services +******************* +Many platform services have a scaled number of threads or worker processes that +are directly proportional to the number of platform cores configured. However, +many have a minimum number of threads under the assumption that they must +support a minimum scale. +Changes to the number of workers logic in stx-puppet platform params manifest +file are required to allow these services to use only a single core. +Moreover, this change will also reduce the amount of memory allocated to a +single service. + +The scale down will take place in case of single core allocation respecting +existing worker allocation rules. +In case of small footprints, the system defines the number of workers based on +number of platform cores (in case of AIO) with maximum limit as 2 for AIO-SX +and 3 for AIO-DX (linear scalability based on the number of platform cores). +The proposed changes would not change the existing rule. With the relaxation of +the minimum limit (from 2 to 1), the system would scale down the number of +threads to the minimum. + +The following services shall be impacted: + +.. list-table:: Impacted Services + :widths: 50 50 + :header-rows: 1 + + * - Service + - Description + * - postgres + - Object-Relational database management + * - etcd + - Distributed key-value store + * - containerd + - Container Runtime + * - memcached + - Distributed Memory Object Cache + * - armada + - `Armada`_ Application Management + * - keystone + - Identity Management + * - barbican + - Secret Management + * - docker-registry + - Docker Container Registry + * - docker-token-server + - Docker Token Server + * - kube-apiserver + - Kubernetes API Server + * - kubelet + - Kubernetes Node Agent + + + +Kubernetes Tuning +----------------- + +These changes adjust some Kubernetes and etcd parameters and enhance the number +of parallel requests Kubernetes can handle based on the platform cores +allocated. Additional tests may be required to define the best tuning values. + +* kube-apiserver: + * max-requests-inflight: Limit the number of API calls that will be + processed in parallel, which is a great control point of kube-apiserver + memory consumption. The API server can be very CPU intensive when + processing a lot of requests in parallel. + +* kube-controller-manager, kube-scheduler, kubelet, kube-proxy: + * kube-api-burst/kube-api-qps: These 2 flags set the normal and burst rate + that the controller manager can talk to kube-apiserver. + +* etcd: + * heartbeat-interval: This is the frequency with which the leader will + notify followers that he/she is still the leader. + * election-timeout: The election timeout should be set based on the + heartbeat interval and average round-trip time between members. + * snapshot-count: ETCD appends all key changes to a log file. This log grows + forever and is a complete linear history of every change made to the + keys. + +Postgres Tuning +---------------- + +During our analysis, we identified many parameters related to parallel workers +and the vacuum process as a potential tuning source for Postgres. +This change adjusts the overall parameters based on the platform cores +allocated. Additional tests may be required to define the best tuning values. + +Service Management Watchdog +---------------------------- + +Enhance the sm-watchdog process on two different fronts: + +* Restrict its use to the required scenarios (avoid sm-watchdog on AIO-SX + configuration). +* Optimize the |NFS| monitoring to avoid the overhead on the proc file system + while looking for |NFS|. + +System Inventory +---------------- + +Periodic and Runtime Tasks +************************** + +Currently sysinv-conductor and sysinv-agent have many periodic tasks that +should be reviewed and, if possible, redesigned. The main focus is to reduce +sysinv regular spikes by + +* Refactoring legacy code; +* Increasing time intervals when possible; +* Converting periodic tasks to on-demand tasks, when possible. + +Remote Procedure Calls +********************** + +System Inventory Remote Procedure Calls (|RPCs|) are performed using +`RabbitMQ`_ as a communication transport layer between the different processes. +The target is to convert internal calls from System Inventory |RPC| using +`RabbitMQ`_ to a serverless solution `ZeroMQ`_. + +Affected sysinv modules: + +* agent +* api +* conductor +* cmd +* fpga_agent +* helm +* Scripts/manage-partitions + + +Alternatives +------------ + +There is an alternative to use `gRPC`_ instead of `ZeroMQ`_. This solution +should be better analyzed if the actual solution is not usable. + +Data model impact +----------------- + +None + +REST API impact +--------------- + +None + +Security impact +--------------- + +None + +Other end-user impact +--------------------- + +The default configuration for platform cores will be changed to 1 core and +system recommendations would be adjusted to comply with minimum required +platform cores based on processor/use case. +The end user must be aware of hardware requirements and limitations, and +configure the system according to his workload scenario. + +.. _Performance Impact: + +Performance Impact +------------------ + +To maintain system stability while operating with fewer compute resources, +it may be required to adjust the priority of critical system and platform +processes during the execution of this spec. If process starvation is +occurring, the system may reboot or declare specific services as failed and +attempt recovery. If this is experienced, the starved process priority will +need to be increased. + +.. list-table:: Potential Service Impacts + :widths: 50 50 + :header-rows: 1 + + * - Service + - Description + * - hostwd + - Host Watchdog + * - pmond + - Process Monitor + * - sm + - Service Manager + * - kubelet + - Kubernetes Node Agent + * - hbsAgent + - Heartbeat Service Agent + * - hbsClient + - Heartbeat Service Client + * - mtcAgent + - Maintenance Agent + * - mtcClient + - Maintenance Client + +In a distributed cloud scenario, some timing impact is expected on subcloud +operations due to resource limitation but no impact on scalability is expected. + +Another deployer impact +----------------------- +Automated deployment technologies should be aware of the new library `ZeroMQ`_. + +Developer impact +---------------- +We assume that there is no visible developer impact. + +Upgrade impact +-------------- +According to this spec, the new message queue library using `ZeroMQ`_ +being added into sysinv could impact backup and restore, upgrade, and rollback. +Tests should be done to guarantee the new and old behavior. + +Implementation +============== + +Assignee(s) +----------- + +Primary assignee: + * Guilherme Batista Leite (guilhermebatista) + +Other contributors: + * Alexandre Horst (ahorst) + * Alyson Deives Pereira (adeivesp) + * Bruno Costa (bdacosta) + * Caio Cesar Ferreira (ccesarfe) + * Davi Frossard (dbarrosf) + * Eduardo Alberti (ealberti) + * Guilherme Alberici de Santi (galberic) + * Isac Sacchi e Souza (isouza) + * Marcos Paulo Oliveira Silva (mpaulool) + * Romão Martines (rmartine) + * Thiago Antonio Miranda (tamiranda) + + +Repos Impacted +-------------- + +List repositories in StarlingX that are impacted by this spec: + * starlingx/ansible-playboooks + * starlingx/config + * starlingx/config-files + * starlingx/integ + * starlingx/stx-puppet + * starlingx/docs + +.. _Work Items: + +Work Items +---------- + +Scale Down Services +******************* + +* Adjust the following platform services to account for the minimum number of + threads/processes based on the system configuration and the number of + platform cores: barbican, containerd, docker-registry, + docker-token-server, keystone, kube-apiserver, kubelet, memcached, postgres. + +System Inventory +**************** + +* Adjust sysinv check to allow 1 platform core utilization +* Change default behavior to 1 platform core utilization +* Legacy code refactoring +* Review existing periodic tasks converting them to on-demand if possible +* Adjust periodic tasks' timing interval based on each task's needs +* Refactor sysinv-fpga-agent to be launched into context only when it + is required +* Cleanup/review of the existing |RPC|'s to adopt a more consistent |RPC| + usage model and to reduce the number of different calls that need to be + supported. +* Convert internal calls from |RPC| using `RabbitMQ`_ to the service-less + solution `ZeroMQ`_. + +Kubernetes +********** +* Adjust overall Kubernetes configuration parameters based on the platform + cores allocated +* Investigate/enhance the number of parallel requests it can handle based on + the platform cores allocated. + +etcd +**** + +* Adjust etcd configuration parameters based on the platform cores allocated + +Postgres +******** + +* Adjust overall Postgres configuration parameters based on the platform cores + allocated +* Evaluate and tweak the vacuum process + +Service Management Watchdog +*************************** + +* Evaluate that the |NFS| audit condition is still present in our system and + that this audit is required before optimizing the solution + +* Restrict its use to the required scenarios (avoid sm-watchdog on AIO-SX + configuration) or remove it totally in case its audit is unnecessary + +* Optimize the |NFS| monitoring (if is required) to avoid the overhead on the + proc file system while looking for |NFS| + +Overall Performance Evaluation +****************************** +* After all proposed changes are implemented, evaluate the minimum hardware + requirements (processor frequency, cache size and number of cores) and + its workload scenario to enable StarlingX operation on a single platform core + +* Verify if process starvation is occurring. If that is the case, adjust the + priority of critical system and platform processes, + as mentioned in :ref:`Performance Impact` + +* Update the documentation with the minimum hardware requirements. + +Dependencies +============ + +* Postgres should be up-versioned to 9.4.X or higher + +Testing +======= + + +System Configurations +--------------------- +The system configurations that we are assuming for testing are: + +* Standalone - AIO-SX +* Standalone - AIO-DX +* Distributed Cloud + +Test Scenarios +-------------- +We elected some tests which should be defined or changed to +cover this spec: + +* The usual unit testing in the impacted code areas +* Full system regression of all StarlingX applications functionality (system + application commands, lifecycle actions, etc) +* Performance testing to identify and address any performance impacts. +* Backup and restore tests +* Upgrade and rollback tests +* Sysinv RPC communication tests +* Distributed Cloud evaluation of scalability and parallel operations +* In addition, this spec changes the way a StarlingX system is installed + and configured, which will require changes in existing automated + installation and testing tools. + + +Documentation Impact +==================== +The End User documentation will need to be updated to indicate the minimum +hardware requirements (number of cores, frequency and cache-sizes) and +workload scenarios when using a single platform core for StarlingX. +For instance, assuming that the more pods are running, the more CPU processing +is needed for their management (processes such as kubelet and containerd-shim), +the documentation should be reviewed to state the minimum number of platform CPU +cores based on the number of pods. + +Documentation should also be reviewed to inform the replacement of +`RabbitMQ`_ for `ZeroMQ`_ for |RPC| communication between sysinv processes +(sysinv-agent, sysinv-api and sysinv-conductor). + +In case of any new limitation, recommendation, or anything that needs +a requirement update are identified during the development of the proposed +changes in this spec, they shall be included in the documentation as well. + +References +========== +#. `Armada`_ +#. `FluxCD`_ +#. `RabbitMQ`_ +#. `Firehose`_ +#. `ZeroMQ`_ +#. `gRPC`_ + +History +======= + +.. list-table:: Revisions + :header-rows: 1 + + * - Release Name + - Description + * - stx-8.0 + - Introduced + +.. Abbreviations +.. |NFS| replace:: :abbr:`NFS (Network File System)` +.. |RPC| replace:: :abbr:`RPC (Remote Procedure Call)` +.. |RPCs| replace:: :abbr:`RPCs (Remote Procedure Calls)` + +.. Links +.. _#2010087: https://storyboard.openstack.org/#!/story/2010087 +.. _Armada: https://airship-armada.readthedocs.io/en/latest/ +.. _FluxCD: https://fluxcd.io/docs/ +.. _RabbitMQ: https://www.rabbitmq.com/documentation.html +.. _Firehose: https://www.rabbitmq.com/firehose.html +.. _ZeroMQ: https://zguide.zeromq.org/docs/ +.. _gRPC: https://grpc.io/docs/