Merge "Spec for alarming expiring certificates"
This commit is contained in:
commit
f738144690
|
@ -0,0 +1,288 @@
|
|||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License. http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
|
||||
=========================================================
|
||||
Alarm ExpiringSoon and Expired Certificates on StarlingX
|
||||
=========================================================
|
||||
|
||||
Storyboard:
|
||||
https://storyboard.openstack.org/#!/story/2008946
|
||||
|
||||
This feature introduces alarms using the existing Fault Management (FM)
|
||||
framework for certificates that are expired and about-to-expire.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Expired certificates prevent the proper operation of the platform. The
|
||||
platform currently supports various certificates that are manually created
|
||||
off-platform, installed and updated by user, while some certificates are
|
||||
managed and auto-renewed by cert-manager.
|
||||
|
||||
In case of manual installation & management, the certificates have to be
|
||||
closely monitored to avoid expiry. The cert-manager managed certificates
|
||||
will auto-renew, but may fail to do so in case of errors or failure to
|
||||
communicate with external CAs. The user will need an appropriate warning
|
||||
mechanism in such use cases.
|
||||
|
||||
Example Use Cases
|
||||
-----------------
|
||||
|
||||
* Docker registry certificate is expiring soon and user may be unaware
|
||||
of the expiry date approaching.
|
||||
* The ssl certificate has expired and did not auto-renew as expected.
|
||||
User unable to securely communicate via HTTPS.
|
||||
|
||||
In the uses cases described above, the end user would've been unaware of
|
||||
potential problems on the platform. With this proposed feature, the user
|
||||
will be forewarned, so corrective action may be taken.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
A new service called 'cert-alarm', will be introduced on the platform for
|
||||
auditing certificates' expiry dates and communicating with the fault
|
||||
management system to raise & clear alarms. The service will run as a
|
||||
controller service (managed by service manager (sm) as active-standby)
|
||||
on all active controllers.
|
||||
|
||||
Certificate management on the platform has the following three
|
||||
methods currently supported:
|
||||
* Using Cert-Manager (a k8s resource)
|
||||
* Using k8s TLS secret, but not managed by cert-manager (a k8s resource)
|
||||
* Using 'system certificate-install' command where certificates resides
|
||||
as a PEM files on filesystem and sysinv database (not a k8s resource)
|
||||
|
||||
The default will be to alarm all the certificate entities, with user having
|
||||
the ability to opt-out if certificate is a k8s resource. This control will
|
||||
be provided via kubernetes annotations. Kubernetes annotations will also be
|
||||
able to customize some of the alarm settings, see below.
|
||||
|
||||
The configurable options will include the ability to
|
||||
* Enable/disable alarm (default=enabled)
|
||||
* Change alarm-before number of days (default=15d)
|
||||
* Change alarm severity (default=depends on alarmtype)
|
||||
* Custom alarm text (default=None)
|
||||
|
||||
Customization of alarms for certificates managed by
|
||||
'system certificate-install' (PEM file) will not be allowed/supported.
|
||||
StarlingX intends to move all certificates to be configured via k8s resources
|
||||
in future releases, so customization effort for non-k8s certificate
|
||||
configuration will not be included as part of this feature.
|
||||
|
||||
|
||||
Audit CertExpiry
|
||||
----------------
|
||||
|
||||
A full audit on all certificate resources will be performed on service
|
||||
startup, restart and periodically. The periodic timer will run every
|
||||
24 hours.
|
||||
|
||||
* The auditing mechanism will iterate over the cert-manager managed
|
||||
certificates (and their associated k8s TLS secrets), and will only raise
|
||||
an alarm if the user-configured renew-before of the certificate is past,
|
||||
i.e., cert-manager has attempted renewing the certificate, but failed.
|
||||
* cert-alarm will then iterate over all k8s TLS secrets that are not managed
|
||||
by cert-manager.
|
||||
* cert-alarm will then process the rest of the non-k8s resources that reside
|
||||
in sysinv DB (stored as PEM files on filesystem).
|
||||
|
||||
In case of active alarms, another audit will run every hour only on those
|
||||
entities.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
Part of the solution can possibly be implemented with cronjobs and/or
|
||||
KubeCronJobs which can audit the expiry dates, but does not provide enough
|
||||
control and customized coding options.
|
||||
|
||||
Another alternative to introducing the cert-alarm service is to introduce a
|
||||
new k8s application as a k8s deployment to perform the monitoring. This is
|
||||
not being pursued here since containerization of StarlingX flock is feature
|
||||
on its own.
|
||||
|
||||
For maintaining the list of certificates to monitor, it was also considered
|
||||
updating the database entries and extending the tables for customization.
|
||||
Since the user can have new applications that are unknown to the platform,
|
||||
and those certificates should also be monitored, the proposed solution was
|
||||
chosen to use existing Certificate and TLS Secrets in the k8s etcd database,
|
||||
with annotations for customizing the certificate alarming behaviour. Another
|
||||
model could be to allow the user to pass config parameters at runtime, which
|
||||
is not suitable for the platform.
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
New alarm types for expiringSoon and expired certificates will be defined in
|
||||
the fault management system to support this feature.
|
||||
|
||||
Couple of examples of alarm details are::
|
||||
|
||||
Alarm raised after SSL certificate is expired will have
|
||||
Alarm ID: 255.001
|
||||
Reason Text: "Certificate 'system certificate-install -mode ssl' expired"
|
||||
Entity ID: system-certificate=ssl
|
||||
Severity: Critical
|
||||
Timestamp: <Timestamp value when alarm is raised>
|
||||
|
||||
Alarm raised to warn about docker registry certificate expiring soon
|
||||
Alarm ID: 260.012
|
||||
Reason Text: "Certificate 'docker-registry' expiring soon in <X> days, on <date>"
|
||||
Entity ID: system-certificate=docker.registry
|
||||
Severity: Major
|
||||
Timestamp: <Timestamp value when alarm is raised>
|
||||
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
None.
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None. The feature will access a certificate on the platform in order
|
||||
to check expiry dates.
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
User will see alarms on the 'fm alarm-list' output as certificates approach
|
||||
close to expiry date. Certificates that are expired will see a higher severity
|
||||
alarm alerting the user.
|
||||
|
||||
User will need to update annotations in order to change default behavior.
|
||||
Examples of annotation are shown below. New annotations supported marked with
|
||||
comment in the following k8s resource.
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
Name: system-restapi-gui-certificate
|
||||
Namespace: deployment
|
||||
Kind: Secret
|
||||
Type: kubernetes.io/tls
|
||||
Annotations: cert-manager.io/alt-names
|
||||
cert-manager.io/certificate-name: system-restapi-gui-certificate
|
||||
cert-manager.io/common-name: 10.10.10.3
|
||||
cert-manager.io/ip-sans: 10.10.10.3
|
||||
cert-manager.io/issuer-kind: Issuer
|
||||
cert-manager.io/issuer-name: my-ica-cert-and-key-issuer
|
||||
cert-manager.io/uri-sans:
|
||||
starlingx.io/alarm: enabled # New annotation
|
||||
starlingx.io/alarm-befor: 30d # New annotation
|
||||
starlingx.io/alarm-severity: critical # New annotation
|
||||
starlingx.io/alarm-text: "foobar" # New annotation
|
||||
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
In large Distributed Cloud systems, there can be thousands of certificates
|
||||
and TLS secrets (there is a unique DcAdminEpIntermediateCA for each subcloud).
|
||||
In order to scale, the cert-alarm audit algorithm will skip the
|
||||
DcAdminEpIntermediateCA Certificates/Secrets for the subclouds that are
|
||||
present on the SystemController. Since these DcAdminEpIntermediateCA
|
||||
secrets are also avaialable on each subcloud, they will be audited and
|
||||
alarmed on the subcloud.
|
||||
|
||||
The full certificate alarm audit is run once every 24 hours, and the optional
|
||||
hourly certificate alarm audit only runs when a certificate alarm is active
|
||||
and only audits alarmed certificates. The frequency of checks is thus low, and
|
||||
not expected to have a performance impact.
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
None.
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
None.
|
||||
|
||||
Upgrade impact
|
||||
--------------
|
||||
|
||||
If an alarm is indicated as managementAffecting, this will impact upgrades.
|
||||
The intention of this feature is to mark only expired platform certificates
|
||||
as managementAffecting. In such a case, user will be unable to perform
|
||||
and complete upgrades until the certificate is updated.
|
||||
|
||||
Only platform certificates such as 'Kubernetes-RootCA', 'ssl',
|
||||
'docker_registry' will be managementAffecting (& will carry a Critical
|
||||
severity).
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
|
||||
* Sabeel Ansari
|
||||
|
||||
|
||||
Repos Impacted
|
||||
--------------
|
||||
|
||||
* config
|
||||
* fault
|
||||
* ha
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Create new service cert-alarm
|
||||
* Framework code to support new alarms
|
||||
* FM integration to discover existing alarms + publish alarms
|
||||
* Unit tests
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
* New code introduced must include unit test cases
|
||||
* Alarms should be raised as certificates approach expiry dates
|
||||
* Alarms should be raised when certificates are expired. The alarm
|
||||
should have higher severity than ExpiringSoon
|
||||
* Only an ExpiringSoon or Expired alarm should exist (never both)
|
||||
for a certificate entity
|
||||
* Testing should include all platform managed certificates
|
||||
* Testing should include all certificate configurations: cert-manager
|
||||
managed certificates, certificates in k8s TLS secrets, and manually
|
||||
installed certificates via 'system certificate-install'.
|
||||
* Testing should be perform on all configurations - AIO-SX, AIO-DX, Standard,
|
||||
Distributed Cloud etc
|
||||
* Alarms should be persistent after reboots, upgrades etc. In addition
|
||||
cert-alarm should be able to monitor any newly introduced certificate
|
||||
entities in the N+1 release.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
End user documentation needs to be updated with all new alarm codes
|
||||
and their respective details and impact. In addition, documentation
|
||||
should also recommend corrective action that needs to be taken by user
|
||||
to address the alarm.
|
||||
|
||||
The documentation will also capture details for customizing certificate
|
||||
alarming behavior using k8s annotation.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
* https://cert-manager.io/docs/
|
||||
* https://kubernetes.io/docs/concepts/overview/working-with-objects/annotations/
|
||||
|
||||
|
||||
History
|
||||
=======
|
||||
|
Loading…
Reference in New Issue