Merge "Spec for alarming expiring certificates"

This commit is contained in:
Zuul 2021-07-21 13:10:10 +00:00 committed by Gerrit Code Review
commit f738144690
1 changed files with 288 additions and 0 deletions

View File

@ -0,0 +1,288 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License. http://creativecommons.org/licenses/by/3.0/legalcode
=========================================================
Alarm ExpiringSoon and Expired Certificates on StarlingX
=========================================================
Storyboard:
https://storyboard.openstack.org/#!/story/2008946
This feature introduces alarms using the existing Fault Management (FM)
framework for certificates that are expired and about-to-expire.
Problem Description
===================
Expired certificates prevent the proper operation of the platform. The
platform currently supports various certificates that are manually created
off-platform, installed and updated by user, while some certificates are
managed and auto-renewed by cert-manager.
In case of manual installation & management, the certificates have to be
closely monitored to avoid expiry. The cert-manager managed certificates
will auto-renew, but may fail to do so in case of errors or failure to
communicate with external CAs. The user will need an appropriate warning
mechanism in such use cases.
Example Use Cases
-----------------
* Docker registry certificate is expiring soon and user may be unaware
of the expiry date approaching.
* The ssl certificate has expired and did not auto-renew as expected.
User unable to securely communicate via HTTPS.
In the uses cases described above, the end user would've been unaware of
potential problems on the platform. With this proposed feature, the user
will be forewarned, so corrective action may be taken.
Proposed change
===============
A new service called 'cert-alarm', will be introduced on the platform for
auditing certificates' expiry dates and communicating with the fault
management system to raise & clear alarms. The service will run as a
controller service (managed by service manager (sm) as active-standby)
on all active controllers.
Certificate management on the platform has the following three
methods currently supported:
* Using Cert-Manager (a k8s resource)
* Using k8s TLS secret, but not managed by cert-manager (a k8s resource)
* Using 'system certificate-install' command where certificates resides
as a PEM files on filesystem and sysinv database (not a k8s resource)
The default will be to alarm all the certificate entities, with user having
the ability to opt-out if certificate is a k8s resource. This control will
be provided via kubernetes annotations. Kubernetes annotations will also be
able to customize some of the alarm settings, see below.
The configurable options will include the ability to
* Enable/disable alarm (default=enabled)
* Change alarm-before number of days (default=15d)
* Change alarm severity (default=depends on alarmtype)
* Custom alarm text (default=None)
Customization of alarms for certificates managed by
'system certificate-install' (PEM file) will not be allowed/supported.
StarlingX intends to move all certificates to be configured via k8s resources
in future releases, so customization effort for non-k8s certificate
configuration will not be included as part of this feature.
Audit CertExpiry
----------------
A full audit on all certificate resources will be performed on service
startup, restart and periodically. The periodic timer will run every
24 hours.
* The auditing mechanism will iterate over the cert-manager managed
certificates (and their associated k8s TLS secrets), and will only raise
an alarm if the user-configured renew-before of the certificate is past,
i.e., cert-manager has attempted renewing the certificate, but failed.
* cert-alarm will then iterate over all k8s TLS secrets that are not managed
by cert-manager.
* cert-alarm will then process the rest of the non-k8s resources that reside
in sysinv DB (stored as PEM files on filesystem).
In case of active alarms, another audit will run every hour only on those
entities.
Alternatives
------------
Part of the solution can possibly be implemented with cronjobs and/or
KubeCronJobs which can audit the expiry dates, but does not provide enough
control and customized coding options.
Another alternative to introducing the cert-alarm service is to introduce a
new k8s application as a k8s deployment to perform the monitoring. This is
not being pursued here since containerization of StarlingX flock is feature
on its own.
For maintaining the list of certificates to monitor, it was also considered
updating the database entries and extending the tables for customization.
Since the user can have new applications that are unknown to the platform,
and those certificates should also be monitored, the proposed solution was
chosen to use existing Certificate and TLS Secrets in the k8s etcd database,
with annotations for customizing the certificate alarming behaviour. Another
model could be to allow the user to pass config parameters at runtime, which
is not suitable for the platform.
Data model impact
-----------------
New alarm types for expiringSoon and expired certificates will be defined in
the fault management system to support this feature.
Couple of examples of alarm details are::
Alarm raised after SSL certificate is expired will have
Alarm ID: 255.001
Reason Text: "Certificate 'system certificate-install -mode ssl' expired"
Entity ID: system-certificate=ssl
Severity: Critical
Timestamp: <Timestamp value when alarm is raised>
Alarm raised to warn about docker registry certificate expiring soon
Alarm ID: 260.012
Reason Text: "Certificate 'docker-registry' expiring soon in <X> days, on <date>"
Entity ID: system-certificate=docker.registry
Severity: Major
Timestamp: <Timestamp value when alarm is raised>
REST API impact
---------------
None.
Security impact
---------------
None. The feature will access a certificate on the platform in order
to check expiry dates.
Other end user impact
---------------------
User will see alarms on the 'fm alarm-list' output as certificates approach
close to expiry date. Certificates that are expired will see a higher severity
alarm alerting the user.
User will need to update annotations in order to change default behavior.
Examples of annotation are shown below. New annotations supported marked with
comment in the following k8s resource.
.. code-block:: none
Name: system-restapi-gui-certificate
Namespace: deployment
Kind: Secret
Type: kubernetes.io/tls
Annotations: cert-manager.io/alt-names
cert-manager.io/certificate-name: system-restapi-gui-certificate
cert-manager.io/common-name: 10.10.10.3
cert-manager.io/ip-sans: 10.10.10.3
cert-manager.io/issuer-kind: Issuer
cert-manager.io/issuer-name: my-ica-cert-and-key-issuer
cert-manager.io/uri-sans:
starlingx.io/alarm: enabled # New annotation
starlingx.io/alarm-befor: 30d # New annotation
starlingx.io/alarm-severity: critical # New annotation
starlingx.io/alarm-text: "foobar" # New annotation
Performance Impact
------------------
In large Distributed Cloud systems, there can be thousands of certificates
and TLS secrets (there is a unique DcAdminEpIntermediateCA for each subcloud).
In order to scale, the cert-alarm audit algorithm will skip the
DcAdminEpIntermediateCA Certificates/Secrets for the subclouds that are
present on the SystemController. Since these DcAdminEpIntermediateCA
secrets are also avaialable on each subcloud, they will be audited and
alarmed on the subcloud.
The full certificate alarm audit is run once every 24 hours, and the optional
hourly certificate alarm audit only runs when a certificate alarm is active
and only audits alarmed certificates. The frequency of checks is thus low, and
not expected to have a performance impact.
Other deployer impact
---------------------
None.
Developer impact
----------------
None.
Upgrade impact
--------------
If an alarm is indicated as managementAffecting, this will impact upgrades.
The intention of this feature is to mark only expired platform certificates
as managementAffecting. In such a case, user will be unable to perform
and complete upgrades until the certificate is updated.
Only platform certificates such as 'Kubernetes-RootCA', 'ssl',
'docker_registry' will be managementAffecting (& will carry a Critical
severity).
Implementation
==============
Assignee(s)
-----------
Primary assignee:
* Sabeel Ansari
Repos Impacted
--------------
* config
* fault
* ha
Work Items
----------
* Create new service cert-alarm
* Framework code to support new alarms
* FM integration to discover existing alarms + publish alarms
* Unit tests
Dependencies
============
None
Testing
=======
* New code introduced must include unit test cases
* Alarms should be raised as certificates approach expiry dates
* Alarms should be raised when certificates are expired. The alarm
should have higher severity than ExpiringSoon
* Only an ExpiringSoon or Expired alarm should exist (never both)
for a certificate entity
* Testing should include all platform managed certificates
* Testing should include all certificate configurations: cert-manager
managed certificates, certificates in k8s TLS secrets, and manually
installed certificates via 'system certificate-install'.
* Testing should be perform on all configurations - AIO-SX, AIO-DX, Standard,
Distributed Cloud etc
* Alarms should be persistent after reboots, upgrades etc. In addition
cert-alarm should be able to monitor any newly introduced certificate
entities in the N+1 release.
Documentation Impact
====================
End user documentation needs to be updated with all new alarm codes
and their respective details and impact. In addition, documentation
should also recommend corrective action that needs to be taken by user
to address the alarm.
The documentation will also capture details for customizing certificate
alarming behavior using k8s annotation.
References
==========
* https://cert-manager.io/docs/
* https://kubernetes.io/docs/concepts/overview/working-with-objects/annotations/
History
=======