Merge "Add spec for Infrastructure and Cluster Monitoring"

This commit is contained in:
Zuul 2019-08-16 14:29:56 +00:00 committed by Gerrit Code Review
commit 4a55572e52
1 changed files with 290 additions and 0 deletions

View File

@ -0,0 +1,290 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License. http://creativecommons.org/licenses/by/3.0/legalcode
..
======================================================
System Monitor - Cluster and Infrastructure Monitoring
======================================================
This spec describes StarlingX System Monitor which allows for the persistence
and analysis of Cluster and Infrastructure Monitoring data.
https://storyboard.openstack.org/#!/story/2005733
Problem description
===================
The monitoring service needs to be Kubernetes Cluster aware so that it can
provide monitoring of both the infrastructure and pods/containers.
The logs and metrics of the containerized system are not persisted when
a pod is deleted.
Analysis of logs and metrics requires a tool to aid in searching, filtering
and aggregations to mine information.
Use Cases
---------
A system administrator wants:
* Realtime collection, monitoring and analysis is required across the
infrastructure, cluster and application
* Persistent and scalable storage of both structured (metrics) and
unstructured (logs) data
* Meta-data enrichment of collected data to ensure context is not lost
(i.e. container and host information)
* Query and visualization for both realtime and historical data analysis
* On-box and off-box deployment options. In the off-box deployment, the
collection components would still need to be deployed on the system,
but the storage and visualization components would be optional.
* Automatic system deployment configuration for different deployment
configurations (AIO-SX, AIO-DX, Standard) - e.g. system Helm overrides
Proposed change
===============
The Elastic (www.elastic.co) set of software components will be deployed
as an optional application in containers to achieve the
full stack monitoring.
The Elastic 7.x images are deployed via the "oss" or Apache-2.0 licensed
images so that the included components are under Apache-2.0 License.
This release brings improvements, including configurable index lifecycle
management and security features over Elasticsearch 6.x
The existing collectd custom metrics and monitoring data will be
integrated with Elasticsearch. This would supplant the current storage
in influxDB.
The following minimum components are required:
* Elasticsearch - https://www.elastic.co/products/elasticsearch
* Filebeat - https://www.elastic.co/products/filebeat
* Metricbeat - https://www.elastic.co/products/metricbeat
* Kibana - https://www.elastic.co/products/kibana
* Logstash - https://www.elastic.co/products/logstash For providing off-box data streams and the integration of collectd metrics
Alternatives
------------
A method to capture logs as required on demand is available via the 'collect'
Log Collection Tool, rather than real-time. However, if a pod is deleted,
and the data is not captured, the data is lost. This alternative also does not
include correlated Kubernetes events such as pod lifecycle events.
Other technologies were considered for metric collection:
* fluentd
* collectd
* prometheus
* influxdata
Elastic was chosen because it provides a solution for the full observability
spectrum in a common, unified solution.
Elastic is a widely and actively supported Open Source community that provides
many integrations and community contributions, making it well suited for a
flexible monitoring solution.
Data model impact
-----------------
The sysinv application and system helm-overrides framework is leveraged
to allow deployment of the stx-monitor application. The Helm plugins for
stx-monitor are added and subclassed from the base Helm class.
REST API impact
---------------
The deployed Kibana and Elasticsearch containers expose REST API services.
These Elastic Stable APIs are documented with each release, and breaking changes
to these APIs should only occur on major versions, and are documented.
Security impact
---------------
Exposure of Elasticsearch data is via Kibana ingress port. As the logs of
the system are collected, a security layer at ingress is provided to restrict
to admin.
Leverage 7.1+ security features for Role Based Access Control.
Other end user impact
---------------------
* The admin may interact with the System Monitoring via the Kibana GUI
* The admin may interact with the System Monitoring via the Elastic Search API
Performance Impact
------------------
stx-monitor must be applied in order to perform Elastic System Monitoring, so
it is inert when the application is not applied.
stx-monitor Helm charts configuration is a sysinv Helm plugin which supports
system overrides.
There will be an impact to the management network and management cores due to
the overhead in periodically collecting system resources. Additionally,
the Elasticsearch indexing and searching will increase memory and cpu usage of
the control nodes (or any other nodes labeled to serve Elastic components).
The stx-monitor application is configured by the system to engineered defaults
depending upon the system configuration such as the available Elastic master
nodes (controllers), storage available on elastic data nodes. The defaults
can be overridden via user Helm overrides during stx-monitor deployment.
Other deployer impact
---------------------
The versioned stx-monitor application is built as part of StarlingX build and
is available for download on a CENGN server.
The OAM Network is enabled and allows for access to:
* docker.elastic.co
* k8s.dcr.io
The application is applied via 'system application-apply'. The nodes on which
the application runs is controlled by 'system host-label-assign'.
Optionally, the administrator may also configure via
'system helm-override-update' to customize the Helm application
e.g. log and metrics filters and the index lifecycle policies.
Developer impact
----------------
Developers may apply the stx-monitor application to gain insights via logs
and metrics of their developed application.
Developers may create user overrides to customize the Helm charts
configuration.
Upgrade impact
--------------
Non-applicable as this is the initial release of System Monitoring.
stx-monitor is a versioned application tarball which deploys the Elastic
Monitoring service via Armada Helm charts as Kubernetes containers.
Implementation
==============
Assignee(s)
-----------
Primary assignee:
john.kung@windriver.com
kevin.smith@windriver.com
Architect/contributors:
matt.peters@windriver.com
Repos Impacted
--------------
List repositories in StarlingX that are impacted by this spec.
* starlingx/config
* starlingx/tools
* starlingx/upstream
* starlingx/docs
Work Items
----------
* Add helm-charts for Elastic components to mirror and build tarball-dl.lst
* Add Armada manifest to deploy Elastic components
* Add sysinv application handling for Elastic components
* Add sysinv application system overrides for Elastic components
* Set system engineered defaults for System Monitoring
Dependencies
============
* Elasticsearch 7.x stable Helm charts. This feature can be based on updates
to the current stable Helm charts (6.7.0) at https://github.com/helm/charts
* Kube-State-Metrics - https://k8s.gcr.io/kube-state-metrics
Testing
=======
Verify cluster pod logs and lifecycle events are persisted.
Verify infrastructure logs and metrics are persisted.
Verify that the data index lifecycle policy is sufficient to persist data.
Verify system engineering of the configured System Monitoring components.
Verify system configuration for AIO-SX, AIO-DX and Standard deployments.
Documentation Impact
====================
This story affects the StarlingX installation, configuration and
system engineering documentation.
References
==========
* StarlingX Metrics https://storyboard.openstack.org/#!/board/145
* Elasticsearch documentation https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html
* FileBeat https://www.elastic.co/guide/en/beats/filebeat/current/index.html
* Kibana https://www.elastic.co/guide/en/kibana/current/index.html
* Logstash https://www.elastic.co/guide/en/logstash/current/index.html
* MetricBeat https://www.elastic.co/guide/en/beats/metricbeat/current/index.html
* Licensing https://www.elastic.co/subscriptions
StarlingX stx-monitor application, when optionally applied, deploys the OSS (Apache-2.0) version.
History
=======
.. list-table:: Revisions
:header-rows: 1
* - Release Name
- Description
* - stx-3.0
- Introduced