Merge "Spec for alarming expiring certificates"
This commit is contained in:
commit
f738144690
|
@ -0,0 +1,288 @@
|
||||||
|
..
|
||||||
|
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||||
|
License. http://creativecommons.org/licenses/by/3.0/legalcode
|
||||||
|
|
||||||
|
|
||||||
|
=========================================================
|
||||||
|
Alarm ExpiringSoon and Expired Certificates on StarlingX
|
||||||
|
=========================================================
|
||||||
|
|
||||||
|
Storyboard:
|
||||||
|
https://storyboard.openstack.org/#!/story/2008946
|
||||||
|
|
||||||
|
This feature introduces alarms using the existing Fault Management (FM)
|
||||||
|
framework for certificates that are expired and about-to-expire.
|
||||||
|
|
||||||
|
Problem Description
|
||||||
|
===================
|
||||||
|
|
||||||
|
Expired certificates prevent the proper operation of the platform. The
|
||||||
|
platform currently supports various certificates that are manually created
|
||||||
|
off-platform, installed and updated by user, while some certificates are
|
||||||
|
managed and auto-renewed by cert-manager.
|
||||||
|
|
||||||
|
In case of manual installation & management, the certificates have to be
|
||||||
|
closely monitored to avoid expiry. The cert-manager managed certificates
|
||||||
|
will auto-renew, but may fail to do so in case of errors or failure to
|
||||||
|
communicate with external CAs. The user will need an appropriate warning
|
||||||
|
mechanism in such use cases.
|
||||||
|
|
||||||
|
Example Use Cases
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
* Docker registry certificate is expiring soon and user may be unaware
|
||||||
|
of the expiry date approaching.
|
||||||
|
* The ssl certificate has expired and did not auto-renew as expected.
|
||||||
|
User unable to securely communicate via HTTPS.
|
||||||
|
|
||||||
|
In the uses cases described above, the end user would've been unaware of
|
||||||
|
potential problems on the platform. With this proposed feature, the user
|
||||||
|
will be forewarned, so corrective action may be taken.
|
||||||
|
|
||||||
|
Proposed change
|
||||||
|
===============
|
||||||
|
|
||||||
|
A new service called 'cert-alarm', will be introduced on the platform for
|
||||||
|
auditing certificates' expiry dates and communicating with the fault
|
||||||
|
management system to raise & clear alarms. The service will run as a
|
||||||
|
controller service (managed by service manager (sm) as active-standby)
|
||||||
|
on all active controllers.
|
||||||
|
|
||||||
|
Certificate management on the platform has the following three
|
||||||
|
methods currently supported:
|
||||||
|
* Using Cert-Manager (a k8s resource)
|
||||||
|
* Using k8s TLS secret, but not managed by cert-manager (a k8s resource)
|
||||||
|
* Using 'system certificate-install' command where certificates resides
|
||||||
|
as a PEM files on filesystem and sysinv database (not a k8s resource)
|
||||||
|
|
||||||
|
The default will be to alarm all the certificate entities, with user having
|
||||||
|
the ability to opt-out if certificate is a k8s resource. This control will
|
||||||
|
be provided via kubernetes annotations. Kubernetes annotations will also be
|
||||||
|
able to customize some of the alarm settings, see below.
|
||||||
|
|
||||||
|
The configurable options will include the ability to
|
||||||
|
* Enable/disable alarm (default=enabled)
|
||||||
|
* Change alarm-before number of days (default=15d)
|
||||||
|
* Change alarm severity (default=depends on alarmtype)
|
||||||
|
* Custom alarm text (default=None)
|
||||||
|
|
||||||
|
Customization of alarms for certificates managed by
|
||||||
|
'system certificate-install' (PEM file) will not be allowed/supported.
|
||||||
|
StarlingX intends to move all certificates to be configured via k8s resources
|
||||||
|
in future releases, so customization effort for non-k8s certificate
|
||||||
|
configuration will not be included as part of this feature.
|
||||||
|
|
||||||
|
|
||||||
|
Audit CertExpiry
|
||||||
|
----------------
|
||||||
|
|
||||||
|
A full audit on all certificate resources will be performed on service
|
||||||
|
startup, restart and periodically. The periodic timer will run every
|
||||||
|
24 hours.
|
||||||
|
|
||||||
|
* The auditing mechanism will iterate over the cert-manager managed
|
||||||
|
certificates (and their associated k8s TLS secrets), and will only raise
|
||||||
|
an alarm if the user-configured renew-before of the certificate is past,
|
||||||
|
i.e., cert-manager has attempted renewing the certificate, but failed.
|
||||||
|
* cert-alarm will then iterate over all k8s TLS secrets that are not managed
|
||||||
|
by cert-manager.
|
||||||
|
* cert-alarm will then process the rest of the non-k8s resources that reside
|
||||||
|
in sysinv DB (stored as PEM files on filesystem).
|
||||||
|
|
||||||
|
In case of active alarms, another audit will run every hour only on those
|
||||||
|
entities.
|
||||||
|
|
||||||
|
Alternatives
|
||||||
|
------------
|
||||||
|
|
||||||
|
Part of the solution can possibly be implemented with cronjobs and/or
|
||||||
|
KubeCronJobs which can audit the expiry dates, but does not provide enough
|
||||||
|
control and customized coding options.
|
||||||
|
|
||||||
|
Another alternative to introducing the cert-alarm service is to introduce a
|
||||||
|
new k8s application as a k8s deployment to perform the monitoring. This is
|
||||||
|
not being pursued here since containerization of StarlingX flock is feature
|
||||||
|
on its own.
|
||||||
|
|
||||||
|
For maintaining the list of certificates to monitor, it was also considered
|
||||||
|
updating the database entries and extending the tables for customization.
|
||||||
|
Since the user can have new applications that are unknown to the platform,
|
||||||
|
and those certificates should also be monitored, the proposed solution was
|
||||||
|
chosen to use existing Certificate and TLS Secrets in the k8s etcd database,
|
||||||
|
with annotations for customizing the certificate alarming behaviour. Another
|
||||||
|
model could be to allow the user to pass config parameters at runtime, which
|
||||||
|
is not suitable for the platform.
|
||||||
|
|
||||||
|
Data model impact
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
New alarm types for expiringSoon and expired certificates will be defined in
|
||||||
|
the fault management system to support this feature.
|
||||||
|
|
||||||
|
Couple of examples of alarm details are::
|
||||||
|
|
||||||
|
Alarm raised after SSL certificate is expired will have
|
||||||
|
Alarm ID: 255.001
|
||||||
|
Reason Text: "Certificate 'system certificate-install -mode ssl' expired"
|
||||||
|
Entity ID: system-certificate=ssl
|
||||||
|
Severity: Critical
|
||||||
|
Timestamp: <Timestamp value when alarm is raised>
|
||||||
|
|
||||||
|
Alarm raised to warn about docker registry certificate expiring soon
|
||||||
|
Alarm ID: 260.012
|
||||||
|
Reason Text: "Certificate 'docker-registry' expiring soon in <X> days, on <date>"
|
||||||
|
Entity ID: system-certificate=docker.registry
|
||||||
|
Severity: Major
|
||||||
|
Timestamp: <Timestamp value when alarm is raised>
|
||||||
|
|
||||||
|
|
||||||
|
REST API impact
|
||||||
|
---------------
|
||||||
|
|
||||||
|
None.
|
||||||
|
|
||||||
|
Security impact
|
||||||
|
---------------
|
||||||
|
|
||||||
|
None. The feature will access a certificate on the platform in order
|
||||||
|
to check expiry dates.
|
||||||
|
|
||||||
|
Other end user impact
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
User will see alarms on the 'fm alarm-list' output as certificates approach
|
||||||
|
close to expiry date. Certificates that are expired will see a higher severity
|
||||||
|
alarm alerting the user.
|
||||||
|
|
||||||
|
User will need to update annotations in order to change default behavior.
|
||||||
|
Examples of annotation are shown below. New annotations supported marked with
|
||||||
|
comment in the following k8s resource.
|
||||||
|
|
||||||
|
.. code-block:: none
|
||||||
|
|
||||||
|
Name: system-restapi-gui-certificate
|
||||||
|
Namespace: deployment
|
||||||
|
Kind: Secret
|
||||||
|
Type: kubernetes.io/tls
|
||||||
|
Annotations: cert-manager.io/alt-names
|
||||||
|
cert-manager.io/certificate-name: system-restapi-gui-certificate
|
||||||
|
cert-manager.io/common-name: 10.10.10.3
|
||||||
|
cert-manager.io/ip-sans: 10.10.10.3
|
||||||
|
cert-manager.io/issuer-kind: Issuer
|
||||||
|
cert-manager.io/issuer-name: my-ica-cert-and-key-issuer
|
||||||
|
cert-manager.io/uri-sans:
|
||||||
|
starlingx.io/alarm: enabled # New annotation
|
||||||
|
starlingx.io/alarm-befor: 30d # New annotation
|
||||||
|
starlingx.io/alarm-severity: critical # New annotation
|
||||||
|
starlingx.io/alarm-text: "foobar" # New annotation
|
||||||
|
|
||||||
|
|
||||||
|
Performance Impact
|
||||||
|
------------------
|
||||||
|
|
||||||
|
In large Distributed Cloud systems, there can be thousands of certificates
|
||||||
|
and TLS secrets (there is a unique DcAdminEpIntermediateCA for each subcloud).
|
||||||
|
In order to scale, the cert-alarm audit algorithm will skip the
|
||||||
|
DcAdminEpIntermediateCA Certificates/Secrets for the subclouds that are
|
||||||
|
present on the SystemController. Since these DcAdminEpIntermediateCA
|
||||||
|
secrets are also avaialable on each subcloud, they will be audited and
|
||||||
|
alarmed on the subcloud.
|
||||||
|
|
||||||
|
The full certificate alarm audit is run once every 24 hours, and the optional
|
||||||
|
hourly certificate alarm audit only runs when a certificate alarm is active
|
||||||
|
and only audits alarmed certificates. The frequency of checks is thus low, and
|
||||||
|
not expected to have a performance impact.
|
||||||
|
|
||||||
|
Other deployer impact
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
None.
|
||||||
|
|
||||||
|
Developer impact
|
||||||
|
----------------
|
||||||
|
|
||||||
|
None.
|
||||||
|
|
||||||
|
Upgrade impact
|
||||||
|
--------------
|
||||||
|
|
||||||
|
If an alarm is indicated as managementAffecting, this will impact upgrades.
|
||||||
|
The intention of this feature is to mark only expired platform certificates
|
||||||
|
as managementAffecting. In such a case, user will be unable to perform
|
||||||
|
and complete upgrades until the certificate is updated.
|
||||||
|
|
||||||
|
Only platform certificates such as 'Kubernetes-RootCA', 'ssl',
|
||||||
|
'docker_registry' will be managementAffecting (& will carry a Critical
|
||||||
|
severity).
|
||||||
|
|
||||||
|
Implementation
|
||||||
|
==============
|
||||||
|
|
||||||
|
Assignee(s)
|
||||||
|
-----------
|
||||||
|
|
||||||
|
Primary assignee:
|
||||||
|
|
||||||
|
* Sabeel Ansari
|
||||||
|
|
||||||
|
|
||||||
|
Repos Impacted
|
||||||
|
--------------
|
||||||
|
|
||||||
|
* config
|
||||||
|
* fault
|
||||||
|
* ha
|
||||||
|
|
||||||
|
Work Items
|
||||||
|
----------
|
||||||
|
|
||||||
|
* Create new service cert-alarm
|
||||||
|
* Framework code to support new alarms
|
||||||
|
* FM integration to discover existing alarms + publish alarms
|
||||||
|
* Unit tests
|
||||||
|
|
||||||
|
Dependencies
|
||||||
|
============
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
Testing
|
||||||
|
=======
|
||||||
|
|
||||||
|
* New code introduced must include unit test cases
|
||||||
|
* Alarms should be raised as certificates approach expiry dates
|
||||||
|
* Alarms should be raised when certificates are expired. The alarm
|
||||||
|
should have higher severity than ExpiringSoon
|
||||||
|
* Only an ExpiringSoon or Expired alarm should exist (never both)
|
||||||
|
for a certificate entity
|
||||||
|
* Testing should include all platform managed certificates
|
||||||
|
* Testing should include all certificate configurations: cert-manager
|
||||||
|
managed certificates, certificates in k8s TLS secrets, and manually
|
||||||
|
installed certificates via 'system certificate-install'.
|
||||||
|
* Testing should be perform on all configurations - AIO-SX, AIO-DX, Standard,
|
||||||
|
Distributed Cloud etc
|
||||||
|
* Alarms should be persistent after reboots, upgrades etc. In addition
|
||||||
|
cert-alarm should be able to monitor any newly introduced certificate
|
||||||
|
entities in the N+1 release.
|
||||||
|
|
||||||
|
Documentation Impact
|
||||||
|
====================
|
||||||
|
|
||||||
|
End user documentation needs to be updated with all new alarm codes
|
||||||
|
and their respective details and impact. In addition, documentation
|
||||||
|
should also recommend corrective action that needs to be taken by user
|
||||||
|
to address the alarm.
|
||||||
|
|
||||||
|
The documentation will also capture details for customizing certificate
|
||||||
|
alarming behavior using k8s annotation.
|
||||||
|
|
||||||
|
References
|
||||||
|
==========
|
||||||
|
|
||||||
|
* https://cert-manager.io/docs/
|
||||||
|
* https://kubernetes.io/docs/concepts/overview/working-with-objects/annotations/
|
||||||
|
|
||||||
|
|
||||||
|
History
|
||||||
|
=======
|
||||||
|
|
Loading…
Reference in New Issue