Merge "Add spec for Infrastructure and Cluster Monitoring"
This commit is contained in:
commit
4a55572e52
|
@ -0,0 +1,290 @@
|
|||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License. http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
..
|
||||
|
||||
======================================================
|
||||
System Monitor - Cluster and Infrastructure Monitoring
|
||||
======================================================
|
||||
|
||||
This spec describes StarlingX System Monitor which allows for the persistence
|
||||
and analysis of Cluster and Infrastructure Monitoring data.
|
||||
|
||||
https://storyboard.openstack.org/#!/story/2005733
|
||||
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
The monitoring service needs to be Kubernetes Cluster aware so that it can
|
||||
provide monitoring of both the infrastructure and pods/containers.
|
||||
|
||||
The logs and metrics of the containerized system are not persisted when
|
||||
a pod is deleted.
|
||||
|
||||
Analysis of logs and metrics requires a tool to aid in searching, filtering
|
||||
and aggregations to mine information.
|
||||
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
A system administrator wants:
|
||||
|
||||
* Realtime collection, monitoring and analysis is required across the
|
||||
infrastructure, cluster and application
|
||||
|
||||
* Persistent and scalable storage of both structured (metrics) and
|
||||
unstructured (logs) data
|
||||
|
||||
* Meta-data enrichment of collected data to ensure context is not lost
|
||||
(i.e. container and host information)
|
||||
|
||||
* Query and visualization for both realtime and historical data analysis
|
||||
|
||||
* On-box and off-box deployment options. In the off-box deployment, the
|
||||
collection components would still need to be deployed on the system,
|
||||
but the storage and visualization components would be optional.
|
||||
|
||||
* Automatic system deployment configuration for different deployment
|
||||
configurations (AIO-SX, AIO-DX, Standard) - e.g. system Helm overrides
|
||||
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
The Elastic (www.elastic.co) set of software components will be deployed
|
||||
as an optional application in containers to achieve the
|
||||
full stack monitoring.
|
||||
|
||||
The Elastic 7.x images are deployed via the "oss" or Apache-2.0 licensed
|
||||
images so that the included components are under Apache-2.0 License.
|
||||
|
||||
This release brings improvements, including configurable index lifecycle
|
||||
management and security features over Elasticsearch 6.x
|
||||
|
||||
The existing collectd custom metrics and monitoring data will be
|
||||
integrated with Elasticsearch. This would supplant the current storage
|
||||
in influxDB.
|
||||
|
||||
The following minimum components are required:
|
||||
|
||||
* Elasticsearch - https://www.elastic.co/products/elasticsearch
|
||||
|
||||
* Filebeat - https://www.elastic.co/products/filebeat
|
||||
|
||||
* Metricbeat - https://www.elastic.co/products/metricbeat
|
||||
|
||||
* Kibana - https://www.elastic.co/products/kibana
|
||||
|
||||
* Logstash - https://www.elastic.co/products/logstash For providing off-box data streams and the integration of collectd metrics
|
||||
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
A method to capture logs as required on demand is available via the 'collect'
|
||||
Log Collection Tool, rather than real-time. However, if a pod is deleted,
|
||||
and the data is not captured, the data is lost. This alternative also does not
|
||||
include correlated Kubernetes events such as pod lifecycle events.
|
||||
|
||||
Other technologies were considered for metric collection:
|
||||
|
||||
* fluentd
|
||||
* collectd
|
||||
* prometheus
|
||||
* influxdata
|
||||
|
||||
Elastic was chosen because it provides a solution for the full observability
|
||||
spectrum in a common, unified solution.
|
||||
|
||||
Elastic is a widely and actively supported Open Source community that provides
|
||||
many integrations and community contributions, making it well suited for a
|
||||
flexible monitoring solution.
|
||||
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
The sysinv application and system helm-overrides framework is leveraged
|
||||
to allow deployment of the stx-monitor application. The Helm plugins for
|
||||
stx-monitor are added and subclassed from the base Helm class.
|
||||
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
The deployed Kibana and Elasticsearch containers expose REST API services.
|
||||
|
||||
These Elastic Stable APIs are documented with each release, and breaking changes
|
||||
to these APIs should only occur on major versions, and are documented.
|
||||
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
Exposure of Elasticsearch data is via Kibana ingress port. As the logs of
|
||||
the system are collected, a security layer at ingress is provided to restrict
|
||||
to admin.
|
||||
|
||||
Leverage 7.1+ security features for Role Based Access Control.
|
||||
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
* The admin may interact with the System Monitoring via the Kibana GUI
|
||||
|
||||
* The admin may interact with the System Monitoring via the Elastic Search API
|
||||
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
stx-monitor must be applied in order to perform Elastic System Monitoring, so
|
||||
it is inert when the application is not applied.
|
||||
|
||||
stx-monitor Helm charts configuration is a sysinv Helm plugin which supports
|
||||
system overrides.
|
||||
|
||||
There will be an impact to the management network and management cores due to
|
||||
the overhead in periodically collecting system resources. Additionally,
|
||||
the Elasticsearch indexing and searching will increase memory and cpu usage of
|
||||
the control nodes (or any other nodes labeled to serve Elastic components).
|
||||
|
||||
The stx-monitor application is configured by the system to engineered defaults
|
||||
depending upon the system configuration such as the available Elastic master
|
||||
nodes (controllers), storage available on elastic data nodes. The defaults
|
||||
can be overridden via user Helm overrides during stx-monitor deployment.
|
||||
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
The versioned stx-monitor application is built as part of StarlingX build and
|
||||
is available for download on a CENGN server.
|
||||
|
||||
The OAM Network is enabled and allows for access to:
|
||||
|
||||
* docker.elastic.co
|
||||
|
||||
* k8s.dcr.io
|
||||
|
||||
|
||||
The application is applied via 'system application-apply'. The nodes on which
|
||||
the application runs is controlled by 'system host-label-assign'.
|
||||
|
||||
Optionally, the administrator may also configure via
|
||||
'system helm-override-update' to customize the Helm application
|
||||
e.g. log and metrics filters and the index lifecycle policies.
|
||||
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
Developers may apply the stx-monitor application to gain insights via logs
|
||||
and metrics of their developed application.
|
||||
|
||||
Developers may create user overrides to customize the Helm charts
|
||||
configuration.
|
||||
|
||||
|
||||
Upgrade impact
|
||||
--------------
|
||||
|
||||
Non-applicable as this is the initial release of System Monitoring.
|
||||
|
||||
stx-monitor is a versioned application tarball which deploys the Elastic
|
||||
Monitoring service via Armada Helm charts as Kubernetes containers.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
john.kung@windriver.com
|
||||
kevin.smith@windriver.com
|
||||
|
||||
Architect/contributors:
|
||||
matt.peters@windriver.com
|
||||
|
||||
|
||||
Repos Impacted
|
||||
--------------
|
||||
|
||||
List repositories in StarlingX that are impacted by this spec.
|
||||
|
||||
* starlingx/config
|
||||
* starlingx/tools
|
||||
* starlingx/upstream
|
||||
* starlingx/docs
|
||||
|
||||
|
||||
Work Items
|
||||
----------
|
||||
* Add helm-charts for Elastic components to mirror and build tarball-dl.lst
|
||||
* Add Armada manifest to deploy Elastic components
|
||||
* Add sysinv application handling for Elastic components
|
||||
* Add sysinv application system overrides for Elastic components
|
||||
* Set system engineered defaults for System Monitoring
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
* Elasticsearch 7.x stable Helm charts. This feature can be based on updates
|
||||
to the current stable Helm charts (6.7.0) at https://github.com/helm/charts
|
||||
|
||||
* Kube-State-Metrics - https://k8s.gcr.io/kube-state-metrics
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Verify cluster pod logs and lifecycle events are persisted.
|
||||
|
||||
Verify infrastructure logs and metrics are persisted.
|
||||
|
||||
Verify that the data index lifecycle policy is sufficient to persist data.
|
||||
|
||||
Verify system engineering of the configured System Monitoring components.
|
||||
|
||||
Verify system configuration for AIO-SX, AIO-DX and Standard deployments.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
This story affects the StarlingX installation, configuration and
|
||||
system engineering documentation.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
* StarlingX Metrics https://storyboard.openstack.org/#!/board/145
|
||||
|
||||
* Elasticsearch documentation https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html
|
||||
* FileBeat https://www.elastic.co/guide/en/beats/filebeat/current/index.html
|
||||
* Kibana https://www.elastic.co/guide/en/kibana/current/index.html
|
||||
* Logstash https://www.elastic.co/guide/en/logstash/current/index.html
|
||||
* MetricBeat https://www.elastic.co/guide/en/beats/metricbeat/current/index.html
|
||||
|
||||
* Licensing https://www.elastic.co/subscriptions
|
||||
StarlingX stx-monitor application, when optionally applied, deploys the OSS (Apache-2.0) version.
|
||||
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - stx-3.0
|
||||
- Introduced
|
Loading…
Reference in New Issue