From 7cf4967fae9e10993a4ebd37588565152b031515 Mon Sep 17 00:00:00 2001 From: Sabeel Ansari Date: Wed, 9 Jun 2021 08:57:55 -0400 Subject: [PATCH] Spec for alarming expiring certificates Story: 2008946 Signed-off-by: Sabeel Ansari Change-Id: I335bf344114d485e2a929db36ad10ec2f175508c --- .../security-2008946-alarm-expiring-certs.rst | 288 ++++++++++++++++++ 1 file changed, 288 insertions(+) create mode 100644 doc/source/specs/stx-6.0/approved/security-2008946-alarm-expiring-certs.rst diff --git a/doc/source/specs/stx-6.0/approved/security-2008946-alarm-expiring-certs.rst b/doc/source/specs/stx-6.0/approved/security-2008946-alarm-expiring-certs.rst new file mode 100644 index 0000000..e8a3c3b --- /dev/null +++ b/doc/source/specs/stx-6.0/approved/security-2008946-alarm-expiring-certs.rst @@ -0,0 +1,288 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. http://creativecommons.org/licenses/by/3.0/legalcode + + +========================================================= +Alarm ExpiringSoon and Expired Certificates on StarlingX +========================================================= + +Storyboard: +https://storyboard.openstack.org/#!/story/2008946 + +This feature introduces alarms using the existing Fault Management (FM) +framework for certificates that are expired and about-to-expire. + +Problem Description +=================== + +Expired certificates prevent the proper operation of the platform. The +platform currently supports various certificates that are manually created +off-platform, installed and updated by user, while some certificates are +managed and auto-renewed by cert-manager. + +In case of manual installation & management, the certificates have to be +closely monitored to avoid expiry. The cert-manager managed certificates +will auto-renew, but may fail to do so in case of errors or failure to +communicate with external CAs. The user will need an appropriate warning +mechanism in such use cases. + +Example Use Cases +----------------- + +* Docker registry certificate is expiring soon and user may be unaware + of the expiry date approaching. +* The ssl certificate has expired and did not auto-renew as expected. + User unable to securely communicate via HTTPS. + +In the uses cases described above, the end user would've been unaware of +potential problems on the platform. With this proposed feature, the user +will be forewarned, so corrective action may be taken. + +Proposed change +=============== + +A new service called 'cert-alarm', will be introduced on the platform for +auditing certificates' expiry dates and communicating with the fault +management system to raise & clear alarms. The service will run as a +controller service (managed by service manager (sm) as active-standby) +on all active controllers. + +Certificate management on the platform has the following three +methods currently supported: +* Using Cert-Manager (a k8s resource) +* Using k8s TLS secret, but not managed by cert-manager (a k8s resource) +* Using 'system certificate-install' command where certificates resides +as a PEM files on filesystem and sysinv database (not a k8s resource) + +The default will be to alarm all the certificate entities, with user having +the ability to opt-out if certificate is a k8s resource. This control will +be provided via kubernetes annotations. Kubernetes annotations will also be +able to customize some of the alarm settings, see below. + +The configurable options will include the ability to +* Enable/disable alarm (default=enabled) +* Change alarm-before number of days (default=15d) +* Change alarm severity (default=depends on alarmtype) +* Custom alarm text (default=None) + +Customization of alarms for certificates managed by +'system certificate-install' (PEM file) will not be allowed/supported. +StarlingX intends to move all certificates to be configured via k8s resources +in future releases, so customization effort for non-k8s certificate +configuration will not be included as part of this feature. + + +Audit CertExpiry +---------------- + +A full audit on all certificate resources will be performed on service +startup, restart and periodically. The periodic timer will run every +24 hours. + +* The auditing mechanism will iterate over the cert-manager managed + certificates (and their associated k8s TLS secrets), and will only raise + an alarm if the user-configured renew-before of the certificate is past, + i.e., cert-manager has attempted renewing the certificate, but failed. +* cert-alarm will then iterate over all k8s TLS secrets that are not managed + by cert-manager. +* cert-alarm will then process the rest of the non-k8s resources that reside + in sysinv DB (stored as PEM files on filesystem). + +In case of active alarms, another audit will run every hour only on those +entities. + +Alternatives +------------ + +Part of the solution can possibly be implemented with cronjobs and/or +KubeCronJobs which can audit the expiry dates, but does not provide enough +control and customized coding options. + +Another alternative to introducing the cert-alarm service is to introduce a +new k8s application as a k8s deployment to perform the monitoring. This is +not being pursued here since containerization of StarlingX flock is feature +on its own. + +For maintaining the list of certificates to monitor, it was also considered +updating the database entries and extending the tables for customization. +Since the user can have new applications that are unknown to the platform, +and those certificates should also be monitored, the proposed solution was +chosen to use existing Certificate and TLS Secrets in the k8s etcd database, +with annotations for customizing the certificate alarming behaviour. Another +model could be to allow the user to pass config parameters at runtime, which +is not suitable for the platform. + +Data model impact +----------------- + +New alarm types for expiringSoon and expired certificates will be defined in +the fault management system to support this feature. + +Couple of examples of alarm details are:: + + Alarm raised after SSL certificate is expired will have + Alarm ID: 255.001 + Reason Text: "Certificate 'system certificate-install -mode ssl' expired" + Entity ID: system-certificate=ssl + Severity: Critical + Timestamp: + + Alarm raised to warn about docker registry certificate expiring soon + Alarm ID: 260.012 + Reason Text: "Certificate 'docker-registry' expiring soon in days, on " + Entity ID: system-certificate=docker.registry + Severity: Major + Timestamp: + + +REST API impact +--------------- + +None. + +Security impact +--------------- + +None. The feature will access a certificate on the platform in order +to check expiry dates. + +Other end user impact +--------------------- + +User will see alarms on the 'fm alarm-list' output as certificates approach +close to expiry date. Certificates that are expired will see a higher severity +alarm alerting the user. + +User will need to update annotations in order to change default behavior. +Examples of annotation are shown below. New annotations supported marked with +comment in the following k8s resource. + +.. code-block:: none + + Name: system-restapi-gui-certificate + Namespace: deployment + Kind: Secret + Type: kubernetes.io/tls + Annotations: cert-manager.io/alt-names + cert-manager.io/certificate-name: system-restapi-gui-certificate + cert-manager.io/common-name: 10.10.10.3 + cert-manager.io/ip-sans: 10.10.10.3 + cert-manager.io/issuer-kind: Issuer + cert-manager.io/issuer-name: my-ica-cert-and-key-issuer + cert-manager.io/uri-sans: + starlingx.io/alarm: enabled # New annotation + starlingx.io/alarm-befor: 30d # New annotation + starlingx.io/alarm-severity: critical # New annotation + starlingx.io/alarm-text: "foobar" # New annotation + + +Performance Impact +------------------ + +In large Distributed Cloud systems, there can be thousands of certificates +and TLS secrets (there is a unique DcAdminEpIntermediateCA for each subcloud). +In order to scale, the cert-alarm audit algorithm will skip the +DcAdminEpIntermediateCA Certificates/Secrets for the subclouds that are +present on the SystemController. Since these DcAdminEpIntermediateCA +secrets are also avaialable on each subcloud, they will be audited and +alarmed on the subcloud. + +The full certificate alarm audit is run once every 24 hours, and the optional +hourly certificate alarm audit only runs when a certificate alarm is active +and only audits alarmed certificates. The frequency of checks is thus low, and +not expected to have a performance impact. + +Other deployer impact +--------------------- + +None. + +Developer impact +---------------- + +None. + +Upgrade impact +-------------- + +If an alarm is indicated as managementAffecting, this will impact upgrades. +The intention of this feature is to mark only expired platform certificates +as managementAffecting. In such a case, user will be unable to perform +and complete upgrades until the certificate is updated. + +Only platform certificates such as 'Kubernetes-RootCA', 'ssl', +'docker_registry' will be managementAffecting (& will carry a Critical +severity). + +Implementation +============== + +Assignee(s) +----------- + +Primary assignee: + +* Sabeel Ansari + + +Repos Impacted +-------------- + +* config +* fault +* ha + +Work Items +---------- + +* Create new service cert-alarm +* Framework code to support new alarms +* FM integration to discover existing alarms + publish alarms +* Unit tests + +Dependencies +============ + +None + +Testing +======= + +* New code introduced must include unit test cases +* Alarms should be raised as certificates approach expiry dates +* Alarms should be raised when certificates are expired. The alarm + should have higher severity than ExpiringSoon +* Only an ExpiringSoon or Expired alarm should exist (never both) + for a certificate entity +* Testing should include all platform managed certificates +* Testing should include all certificate configurations: cert-manager + managed certificates, certificates in k8s TLS secrets, and manually + installed certificates via 'system certificate-install'. +* Testing should be perform on all configurations - AIO-SX, AIO-DX, Standard, + Distributed Cloud etc +* Alarms should be persistent after reboots, upgrades etc. In addition + cert-alarm should be able to monitor any newly introduced certificate + entities in the N+1 release. + +Documentation Impact +==================== + +End user documentation needs to be updated with all new alarm codes +and their respective details and impact. In addition, documentation +should also recommend corrective action that needs to be taken by user +to address the alarm. + +The documentation will also capture details for customizing certificate +alarming behavior using k8s annotation. + +References +========== + +* https://cert-manager.io/docs/ +* https://kubernetes.io/docs/concepts/overview/working-with-objects/annotations/ + + +History +======= +