From f618ca6764bef239d20df0053bc8c7ac404c1387 Mon Sep 17 00:00:00 2001
From: Alyson Deives Pereira <alyson.deivespereira@windriver.com>
Date: Wed, 22 Jun 2022 16:16:10 -0300
Subject: [PATCH] Platform Single Core Tuning

Adjustments to allow the execution of StarlingX services in only one
core. Reducing StarlingX resource consumption to just one core allows
the system to use the remaining resources for a larger workload,
increasing the availability of resources for end user applications.

Story: 2010087
Task: 45594

Signed-off-by: Alyson Deives Pereira <alyson.deivespereira@windriver.com>
Change-Id: Ibd65aa80c3e0b9f12e67e857a54f070a525a9c20
---
 ...gx-2010087-platform-single-core-tuning.rst | 533 ++++++++++++++++++
 1 file changed, 533 insertions(+)
 create mode 100644 doc/source/specs/stx-8.0/approved/starlingx-2010087-platform-single-core-tuning.rst

diff --git a/doc/source/specs/stx-8.0/approved/starlingx-2010087-platform-single-core-tuning.rst b/doc/source/specs/stx-8.0/approved/starlingx-2010087-platform-single-core-tuning.rst
new file mode 100644
index 0000000..df56d70
--- /dev/null
+++ b/doc/source/specs/stx-8.0/approved/starlingx-2010087-platform-single-core-tuning.rst
@@ -0,0 +1,533 @@
+..
+  This work is licensed under a Creative Commons Attribution 3.0 Unported
+  License. http://creativecommons.org/licenses/by/3.0/legalcode
+
+
+
+=====================================
+StarlingX Platform Single-Core Tuning
+=====================================
+
+Storyboard: `#2010087`_
+
+The objective of this spec is to identify and make changes required for the
+StarlingX Platform to enable its operation on a single processor core.
+
+Problem description
+===================
+
+Resource usage is very intensive on platforms with multiple cores and
+processors. Reducing StarlingX resource consumption to just one core allows
+the system to use the remaining resources for a larger workload, increasing
+the availability of resources for end user applications.
+
+To identify the required changes to address the usage of a single
+core platform, we performed a proof-of-concept with minimal required changes.
+To characterize the system behavior and to identify required product changes,
+detailed system profiling was performed for key services.
+The objective was to measure the individual services, but also to identify
+potential system bottlenecks or performance changes based on the competing
+CPU resources.
+
+Below, there is a brief analysis of critical CPU-consuming services and
+their impact on the system's steady-state operation when running on a single
+platform core.
+The object of this spec is to address the issues identified by implementing the
+changes described in sections :ref:`Proposed change` and :ref:`Work Items`.
+
+Top CPU Consumers
+-----------------
+
+kube-apiserver
+**************
+
+kube-apiserver health check requests show a high number of readyz requests
+(a type of kubernetes endpoint API), indicating some pods could be taking a
+long time to respond to a request or terminate.
+
+From the investigation done at kube-apiserver, most requests are coming from
+cert-manager injector which is in a legacy version (v0.15).
+The requests are due to the leader election process that remains on even in
+a single process.
+
+sm-watchdog
+***********
+
+We executed different test scenarios to analyze the process behavior when
+pods were created and deleted. During all the tests, we observed periodic CPU
+spikes every 10 seconds.
+This periodic task is represented by the SM_WATCHDOG_NFS_CHECK_IN_MS parameter,
+which defines the cycle to verify |NFS| and recover it in case of any
+anomalies.
+The reason for the high CPU consumption  is related to the mechanism used by
+sm-watchdog to check the |NFS|.
+To find all the :command:`nfsd` threads, the watchdog code looks at every
+process within the proc file system, which means it scans every folder with
+a process number looking for a stat file that represents an |NFS|.
+
+beam.smp
+********
+
+During a test that created and deleted some pods, the beam.smp process
+(from `RabbitMQ`_) ran with constant CPU usage and some spikes. There are 3
+pairs of messages are repeated throughout the logs but the id of
+the "publish" and "deliver" routes changes. The system logs related to
+`RabbitMQ`_ contain AMQP calls to sync the requests/replies messages
+generated by the sysinv-conductor. The behavior observed from `RabbitMQ`_
+indicates it is serving as an |RPC|
+service for sysinv.
+
+sysinv-agent
+************
+The behavior presented by the sysinv-agent process reflects the sysinv logs and
+aligns with what was expected. Every minute, the sysinv-agent wakes up to verify
+if there is a need to modify the system configuration, e.g., memory or storage.
+In short, the sysinv-agent doesn't represent a significant concern in
+matters of overall system performance. One possible optimization to be
+evaluated is related to the periodic task and its timeframe. Increasing the
+time between the requests, optimizing the periodic operations, or converting
+it to an on-demand task may bring some benefits related to CPU time.
+
+sysinv-conductor
+****************
+In the sysinv-conductor process test, we observed two scenarios:
+During the first scenario, which was the most frequently observed, the process
+showed a typical daemon behavior with continuous low CPU usage. In the second
+scenario, the process showed CPU spikes every 60 seconds. Skimming the
+source code, we found some periodic task definitions controlled
+by the audit interval. The overall impact of sysinv-conductor on CPU load is
+low. Optimizing the code
+could decrease the spikes during the system's steady-state operation. One
+option, whenever possible, is to change the periodic tasks to on-demand tasks.
+When this approach is not possible, there is still the option to optimize the
+interval of the periodic tasks, increasing it, after evaluating and concluding
+it does not impact the system stability.
+
+Use Cases
+---------
+
+As an end-user, I want to improve system performance by enabling StarlingX
+to run only within the compute resources of a single CPU core with the
+remaining cores for my application workload.
+
+.. _Proposed Change:
+
+Proposed change
+===============
+
+Platform Core Adjustments
+-------------------------
+
+The following set of changes must be applied to reduce physical
+core usage from 2 cores to 1.
+
+System Inventory
+****************
+Changes to sysinv cpu_utils.py and stx-puppet platform params manifest file
+are required to allow the platform to be configured with only a
+single physical core via the “system host-cpu-modify” command.
+
+Scale Down Services
+*******************
+Many platform services have a scaled number of threads or worker processes that
+are directly proportional to the number of platform cores configured. However,
+many have a minimum number of threads under the assumption that they must
+support a minimum scale.
+Changes to the number of workers logic in stx-puppet platform params manifest
+file are required to allow these services to use only a single core.
+Moreover, this change will also reduce the amount of memory allocated to a
+single service.
+
+The scale down will take place in case of single core allocation respecting
+existing worker allocation rules.
+In case of small footprints, the system defines the number of workers based on
+number of platform cores (in case of AIO) with maximum limit as 2 for AIO-SX
+and 3 for AIO-DX (linear scalability based on the number of platform cores).
+The proposed changes would not change the existing rule. With the relaxation of
+the minimum limit (from 2 to 1), the system would scale down the number of
+threads to the minimum.
+
+The following services shall be impacted:
+
+.. list-table:: Impacted Services
+   :widths: 50 50
+   :header-rows: 1
+
+   * - Service
+     - Description
+   * - postgres
+     - Object-Relational database management
+   * - etcd
+     - Distributed key-value store
+   * - containerd
+     - Container Runtime
+   * - memcached
+     - Distributed Memory Object Cache
+   * - armada
+     - `Armada`_ Application Management
+   * - keystone
+     - Identity Management
+   * - barbican
+     - Secret Management
+   * - docker-registry
+     - Docker Container Registry
+   * - docker-token-server
+     - Docker Token Server
+   * - kube-apiserver
+     - Kubernetes API Server
+   * - kubelet
+     - Kubernetes Node Agent
+
+
+
+Kubernetes Tuning
+-----------------
+
+These changes adjust some Kubernetes and etcd parameters and enhance the number
+of parallel requests Kubernetes can handle based on the platform cores
+allocated. Additional tests may be required to define the best tuning values.
+
+* kube-apiserver:
+   * max-requests-inflight: Limit the number of API calls that will be
+     processed in parallel, which is a great control point of kube-apiserver
+     memory consumption. The API server can be very CPU intensive when
+     processing a lot of requests in parallel.
+
+* kube-controller-manager, kube-scheduler, kubelet, kube-proxy:
+   * kube-api-burst/kube-api-qps: These 2 flags set the normal and burst rate
+     that the controller manager can talk to kube-apiserver.
+
+* etcd:
+   * heartbeat-interval: This is the frequency with which the leader will
+     notify followers that he/she is still the leader.
+   * election-timeout: The election timeout should be set based on the
+     heartbeat interval and average round-trip time between members.
+   * snapshot-count: ETCD appends all key changes to a log file. This log grows
+     forever and is a complete linear history of every change made to the
+     keys.
+
+Postgres Tuning
+----------------
+
+During our analysis, we identified many parameters related to parallel workers
+and the vacuum process as a potential tuning source for Postgres.
+This change adjusts the overall parameters based on the platform cores
+allocated. Additional tests may be required to define the best tuning values.
+
+Service Management Watchdog
+----------------------------
+
+Enhance the sm-watchdog process on two different fronts:
+
+* Restrict its use to the required scenarios (avoid sm-watchdog on AIO-SX
+  configuration).
+* Optimize the |NFS| monitoring to avoid the overhead on the proc file system
+  while looking for |NFS|.
+
+System Inventory
+----------------
+
+Periodic and Runtime Tasks
+**************************
+
+Currently sysinv-conductor and sysinv-agent have many periodic tasks that
+should be reviewed and, if possible, redesigned. The main focus is to reduce
+sysinv regular spikes by
+
+* Refactoring legacy code;
+* Increasing time intervals when possible;
+* Converting periodic tasks to on-demand tasks, when possible.
+
+Remote Procedure Calls
+**********************
+
+System Inventory Remote Procedure Calls (|RPCs|) are performed using
+`RabbitMQ`_ as a communication transport layer between the different processes.
+The target is to convert internal calls from System Inventory |RPC| using
+`RabbitMQ`_ to a serverless solution `ZeroMQ`_.
+
+Affected sysinv modules:
+
+* agent
+* api
+* conductor
+* cmd
+* fpga_agent
+* helm
+* Scripts/manage-partitions
+
+
+Alternatives
+------------
+
+There is an alternative to use `gRPC`_ instead of `ZeroMQ`_. This solution
+should be better analyzed if the actual solution is not usable.
+
+Data model impact
+-----------------
+
+None
+
+REST API impact
+---------------
+
+None
+
+Security impact
+---------------
+
+None
+
+Other end-user impact
+---------------------
+
+The default configuration for platform cores will be changed to 1 core and
+system recommendations would be adjusted to comply with minimum required
+platform cores based on processor/use case.
+The end user must be aware of hardware requirements and limitations, and
+configure the system according to his workload scenario.
+
+.. _Performance Impact:
+
+Performance Impact
+------------------
+
+To maintain system stability while operating with fewer compute resources,
+it may be required to adjust the priority of critical system and platform
+processes during the execution of this spec. If process starvation is
+occurring, the system may reboot or declare specific services as failed and
+attempt recovery. If this is experienced, the starved process priority will
+need to be increased.
+
+.. list-table:: Potential Service Impacts
+   :widths: 50 50
+   :header-rows: 1
+
+   * - Service
+     - Description
+   * - hostwd
+     - Host Watchdog
+   * - pmond
+     - Process Monitor
+   * - sm
+     - Service Manager
+   * - kubelet
+     - Kubernetes Node Agent
+   * - hbsAgent
+     - Heartbeat Service Agent
+   * - hbsClient
+     - Heartbeat Service Client
+   * - mtcAgent
+     - Maintenance Agent
+   * - mtcClient
+     - Maintenance Client
+
+In a distributed cloud scenario, some timing impact is expected on subcloud
+operations due to resource limitation but no impact on scalability is expected.
+
+Another deployer impact
+-----------------------
+Automated deployment technologies should be aware of the new library `ZeroMQ`_.
+
+Developer impact
+----------------
+We assume that there is no visible developer impact.
+
+Upgrade impact
+--------------
+According to this spec, the new message queue library using `ZeroMQ`_
+being added into sysinv could impact backup and restore, upgrade, and rollback.
+Tests should be done to guarantee the new and old behavior.
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  * Guilherme Batista Leite (guilhermebatista)
+
+Other contributors:
+  * Alexandre Horst (ahorst)
+  * Alyson Deives Pereira (adeivesp)
+  * Bruno Costa (bdacosta)
+  * Caio Cesar Ferreira (ccesarfe)
+  * Davi Frossard (dbarrosf)
+  * Eduardo Alberti (ealberti)
+  * Guilherme Alberici de Santi (galberic)
+  * Isac Sacchi e Souza (isouza)
+  * Marcos Paulo Oliveira Silva (mpaulool)
+  * Romão Martines (rmartine)
+  * Thiago Antonio Miranda (tamiranda)
+
+
+Repos Impacted
+--------------
+
+List repositories in StarlingX that are impacted by this spec:
+  * starlingx/ansible-playboooks
+  * starlingx/config
+  * starlingx/config-files
+  * starlingx/integ
+  * starlingx/stx-puppet
+  * starlingx/docs
+
+.. _Work Items:
+
+Work Items
+----------
+
+Scale Down Services
+*******************
+
+* Adjust the following platform services to account for the minimum number of
+  threads/processes based on the system configuration and the number of
+  platform cores: barbican, containerd, docker-registry,
+  docker-token-server, keystone, kube-apiserver, kubelet, memcached, postgres.
+
+System Inventory
+****************
+
+* Adjust sysinv check to allow 1 platform core utilization
+* Change default behavior to 1 platform core utilization
+* Legacy code refactoring
+* Review existing periodic tasks converting them to on-demand if possible
+* Adjust periodic tasks' timing interval based on each task's needs
+* Refactor sysinv-fpga-agent to be launched into context only when it
+  is required
+* Cleanup/review of the existing |RPC|'s to adopt a more consistent |RPC|
+  usage model and to reduce the number of different calls that need to be
+  supported.
+* Convert internal calls from |RPC| using `RabbitMQ`_ to the service-less
+  solution `ZeroMQ`_.
+
+Kubernetes
+**********
+* Adjust overall Kubernetes configuration parameters based on the platform
+  cores allocated
+* Investigate/enhance the number of parallel requests it can handle based on
+  the platform cores allocated.
+
+etcd
+****
+
+* Adjust etcd configuration parameters based on the platform cores allocated
+
+Postgres
+********
+
+* Adjust overall Postgres configuration parameters based on the platform cores
+  allocated
+* Evaluate and tweak the vacuum process
+
+Service Management Watchdog
+***************************
+
+* Evaluate that the |NFS| audit condition is still present in our system and
+  that this audit is required before optimizing the solution
+
+* Restrict its use to the required scenarios (avoid sm-watchdog on AIO-SX
+  configuration) or remove it totally in case its audit is unnecessary
+
+* Optimize the |NFS| monitoring (if is required) to avoid the overhead on the
+  proc file system while looking for |NFS|
+
+Overall Performance Evaluation
+******************************
+* After all proposed changes are implemented, evaluate the minimum hardware
+  requirements (processor frequency, cache size and number of cores) and
+  its workload scenario to enable StarlingX operation on a single platform core
+
+* Verify if process starvation is occurring. If that is the case, adjust the
+  priority of critical system and platform processes,
+  as mentioned in :ref:`Performance Impact`
+
+* Update the documentation with the minimum hardware requirements.
+
+Dependencies
+============
+
+* Postgres should be up-versioned to 9.4.X or higher
+
+Testing
+=======
+
+
+System Configurations
+---------------------
+The system configurations that we are assuming for testing are:
+
+* Standalone - AIO-SX
+* Standalone - AIO-DX
+* Distributed Cloud
+
+Test Scenarios
+--------------
+We elected some tests which should be defined or changed to
+cover this spec:
+
+* The usual unit testing in the impacted code areas
+* Full system regression of all StarlingX applications functionality (system
+  application commands, lifecycle actions, etc)
+* Performance testing to identify and address any performance impacts.
+* Backup and restore tests
+* Upgrade and rollback tests
+* Sysinv RPC communication tests
+* Distributed Cloud evaluation of scalability and parallel operations
+* In addition, this spec changes the way a StarlingX system is installed
+  and configured, which will require changes in existing automated
+  installation and testing tools.
+
+
+Documentation Impact
+====================
+The End User documentation will need to be updated to indicate the minimum
+hardware requirements (number of cores, frequency and cache-sizes) and
+workload scenarios when using a single platform core for StarlingX.
+For instance, assuming that the more pods are running, the more CPU processing
+is needed for their management (processes such as kubelet and containerd-shim),
+the documentation should be reviewed to state the minimum number of platform CPU
+cores based on the number of pods.
+
+Documentation should also be reviewed to inform the replacement of
+`RabbitMQ`_ for `ZeroMQ`_ for |RPC| communication between sysinv processes
+(sysinv-agent, sysinv-api and sysinv-conductor).
+
+In case of any new limitation, recommendation, or anything that needs
+a requirement update are identified during the development of the proposed
+changes in this spec, they shall be included in the documentation as well.
+
+References
+==========
+#. `Armada`_
+#. `FluxCD`_
+#. `RabbitMQ`_
+#. `Firehose`_
+#. `ZeroMQ`_
+#. `gRPC`_
+
+History
+=======
+
+.. list-table:: Revisions
+   :header-rows: 1
+
+   * - Release Name
+     - Description
+   * - stx-8.0
+     - Introduced
+
+.. Abbreviations
+.. |NFS| replace:: :abbr:`NFS (Network File System)`
+.. |RPC| replace:: :abbr:`RPC (Remote Procedure Call)`
+.. |RPCs| replace:: :abbr:`RPCs (Remote Procedure Calls)`
+
+.. Links
+.. _#2010087: https://storyboard.openstack.org/#!/story/2010087
+.. _Armada: https://airship-armada.readthedocs.io/en/latest/
+.. _FluxCD: https://fluxcd.io/docs/
+.. _RabbitMQ: https://www.rabbitmq.com/documentation.html
+.. _Firehose: https://www.rabbitmq.com/firehose.html
+.. _ZeroMQ: https://zguide.zeromq.org/docs/
+.. _gRPC: https://grpc.io/docs/