From d9005f6ff5729cac6111519ee7a29ce31cd709a8 Mon Sep 17 00:00:00 2001
From: Mingyuan Qi <mingyuan.qi@intel.com>
Date: Mon, 14 Sep 2020 11:02:26 +0800
Subject: [PATCH] EdgeWorker management phase one

Introduce edgeworker personality

Story: 2008129

Change-Id: If74fb3d3863b05df9875a13e414f02bbfae4842e
Signed-off-by: Mingyuan Qi <mingyuan.qi@intel.com>
---
 ...008129-edgeworker-management-phase-one.rst | 393 ++++++++++++++++++
 1 file changed, 393 insertions(+)
 create mode 100644 doc/source/specs/stx-5.0/approved/Management-2008129-edgeworker-management-phase-one.rst

diff --git a/doc/source/specs/stx-5.0/approved/Management-2008129-edgeworker-management-phase-one.rst b/doc/source/specs/stx-5.0/approved/Management-2008129-edgeworker-management-phase-one.rst
new file mode 100644
index 0000000..b95de10
--- /dev/null
+++ b/doc/source/specs/stx-5.0/approved/Management-2008129-edgeworker-management-phase-one.rst
@@ -0,0 +1,393 @@
+..
+  This work is licensed under a Creative Commons Attribution 3.0 Unported
+  License. http://creativecommons.org/licenses/by/3.0/legalcode
+
+===============================
+EdgeWorker Management Phase One
+===============================
+
+Storyboard:
+https://storyboard.openstack.org/#!/story/2008129
+
+This story will introduce a new node personality 'edgeworker' to StarlingX.
+
+The biggest difference between 'edgeworker' node and 'worker' node is that
+the OS of 'edgeworker' nodes are not installed or configured by StarlingX
+controller and they may vary due to different cases, for example Ubuntu,
+Debian, Fedora... The basic idea is to deploy containerd and kubelet service to
+the 'edgeworker' nodes, so that the StarlingX Kubernetes platform will be
+extended to 'edgeworker' nodes.
+
+The second difference is that 'edgeworker' are usually deployed close to edge
+devices while 'worker' nodes are usually servers deployed in the server room.
+The 'edgeworker' personality are suitable for the nodes that users may want to
+install their customized OS and may require a deployment physically close to
+the data producer or consumer devices.
+
+The way to leverage advantages of StarlingX functionality is to get most flock
+agents containerized and enabled on edgeworker nodes. That is also aligned with
+long term strategy of flock service containerization.
+
+The whole topic is broken down into 4 phases approximately:
+
+* Phase One
+
+  * Add edgeworker personality
+  * Add ansible-playbook to join edgeworker node to STX K8S cluster
+  * Support Ubuntu and CentOS as target OS
+
+* Phase Two
+
+  * Containerize a set of flock agents to get edgeworker node inventoried
+  * Enhance multiple Ceph cluster operation
+
+* Phase Three
+
+  * Support Openstack running on edgeworker nodes
+  * Support L3/Tunnel mgmt. network
+  * Containerize rest of flock agents
+
+* Phase Four
+
+  * Enable software management on edgeworker nodes
+  * Enable optional authentication for new nodes
+  * Extend target OS support
+
+This spec focuses on *Phase One*.
+
+Problem description
+===================
+
+In a typical IoT or industrial use case, StarlingX is usually used to
+facilitate the whole edge cluster setup and management. But there are different
+types of nodes existing in the cluster that are not in current StarlingX
+management scope. Various reasons are hindering administrator to get these
+nodes deployed as 'worker' nodes, from software to hardware.
+In particular, the common setbacks are:
+
+* OS of the nodes could not or don't want to be installed by StarlingX.
+* The nodes are running a Type I hypervisor.
+* The hardware resources do not meet StarlingX worker node's minimum
+  requirement.
+* The nodes are connected to StarlingX controllers over a L3 network.
+
+In this story, these nodes are categorized into a new personality to
+distinguish from 'worker' nodes. The new personality is called 'edgeworker'
+since these nodes are usually deployed close to the edge device side. An edge
+device could probably be an I/O device, a camera, a servo motor or a sensor.
+
+The first three setbacks will be addressed in this phase one, while network
+requirement and manageability enhancement will be addressed in the next few
+phases. Separate specs for different phases will be submitted during different
+releases.
+
+Use Cases
+---------
+
+* Administrator wants to have all the 'edgeworker' nodes managed by StarlingX
+
+  * Make 'edgeworker' in the host list (Phase one)
+  * Check/Lock/Unlock 'edgeworker' node state (Phase two)
+  * Query 'edgeworker' hardware resources info (Phase two)
+  * Configure 'edgeworker' resources for specific usage (Phase two and later)
+  * Manage alarms generated by 'edgeworker' (Phase three)
+  * Update 'edgeworker' packages (Phase four)
+
+* Administrator does not want StarlingX to install OS on the 'edgeworker'
+  nodes
+* User wants to orchestrate container workloads to 'edgeworker' nodes
+* User wants to orchestrate VM workloads to 'edgeworker' nodes as an option
+
+
+Proposed change
+===============
+
+**Edgeworker personality**
+
+Adding a new personality will require changes in sysinv db, sysinv api and
+sysinv conductor, as well as cgts-client.
+
+#.  *sysinv db*
+
+    In order to get 'edgeworker' node into sysinv, the 'edgeworker' value will
+    be added to enum type invPersonalityEnum in sysinv db. Accordingly, adding
+    'edgeworker' to db models is required as well. After this change, a host
+    from sysinv db perspective could be assigned as edgeworker personality.
+
+#.  *sysinv api*
+
+    Mainly focus on host api, adding checks during host add for 'edgeworker'
+    hosts.
+    Possible checks:
+
+    * mgmt ip if mgmt network is not dynamic
+    * host name validation
+    * personality check
+
+#.  *sysinv conductor*
+
+    sysinv conductor is responsible for mgmt ip allocation when the mgmt
+    network is in dynamic type.
+
+#.  *cgts client*
+
+    Add 'edgeworker' choice for argument 'personality' of host-add/
+    host-update command.
+
+After underlying changes applied, the administrator is able to use
+
+    ::
+
+      # system host-add -n <hostname> -p edgeworker
+      or
+      # system host-update <id> hostname=<hostname> personality=edgeworker
+
+to add an edgeworker node to the inventory.
+
+When an edgeworker node is added to inventory, sysinv could provide
+following services:
+
+* DHCP service (Phase one)
+* Host lock/unlock (Phase two)
+* Host interface modification and assignment (Phase two)
+* Host hardware resource query (Phase two)
+* Label assignment (Phase two)
+
+The function that will not be supported on edgeworker:
+
+* host-upgrade
+* bmc integration
+
+  An edgeworker node is not a server, but a normal PC like industrial
+  PC/NUC/workstation. BMC is not a required feature for those nodes. The node
+  life cycle management is done in-band or by the maintainer manually. The use
+  case which uses edgeworker nodes does not expect an out-of-band node
+  management for these nodes.
+
+Additional semantic check will be added for these functions.
+
+Other functions will be described in detail in each phase's spec.
+
+**ansible playbook for provisioning edgeworker nodes**
+
+The main steps for provisioning an edgeworker node are installing kubelet,
+kubeadm and containerd packages to the node due to different Linux
+distributions and joining the node to StarlingX Kubernetes platform. Besides
+these steps, system configurations like ntp setup, interface configuration,
+dns setup etc. are needed as well.
+
+The first two Linux distributions we propose to support for edgeworker are
+*Ubuntu* and *CentOS*.
+
+The version of all the kubernetes packages on edgeworker nodes must be exactly
+the same as the packages on controllers. If they are not, the playbook will
+reinstall the packages to the proper version.
+
+The playbook sequence to provision an edgeworker node:
+
+#.  Preparations on controller
+
+    * Send containerd config and cert to edgeworker
+    * Generate K8S bootstrap token and calculate certificate hash
+
+#.  Preparations on edgeworker
+
+    * Config network (interface and dns)
+    * Setup proxy if needed
+    * Install essential packages
+    * Setup ntp
+
+#.  Add edgeworker node to STX Kubernetes
+
+    * Install containerd, kubelet, kubeadm packages (based on OS)
+    * Config sysctl and swap
+    * Join k8s cluster
+    * Download images
+
+There will be one playbook with different roles included.
+
+Alternatives
+------------
+
+There are several open source projects that can provision a Kubernetes node.
+
+* Kubespray
+
+  Kubespray [1]_ is a composition of Ansible playbooks, inventory, provisioning
+  tools, and domain knowledge for generic OS/Kubernetes clusters configuration
+  management tasks. Kubespray performs generic OS configuration as well as
+  Kubernetes cluster bootstrapping.
+
+  Kubespray provides the whole functionality of provisioning a Kubernetes
+  node just like the edgeworker provisioning playbook does. However, Kubespray
+  supports multiple container runtimes, multiple CNI plugins and control plane
+  bootstrap which are too much in functionality to provision an edgeworker.
+
+  What edgeworker need is a playbook for certain container runtime, certain
+  CNI plugins and provision a Kubernetes node only.
+
+* KubeEdge
+
+  KubeEdge [2]_ is an open source system for extending native containerized
+  application orchestration capabilities to hosts at Edge. KubeEdge could
+  run upon an existing Kubernetes cluster and deploy a customized kubelet
+  service called 'edged' to the edge node. In between the apiserver and edged,
+  the EdgeController is the bridge who manages edge nodes and pods metadata
+  so that the data can be targeted to a specific edge node.
+
+  KubeEdge is able to provision edge nodes from cloud. But the kubelet service
+  is customized to fulfill the specific requirement that the administrator is
+  able to manage the pods running on edge nodes from public cloud platform.
+  The customized kubelet(edged) brings compatibility issues when Kubernetes
+  upgrading to a newer release, which leads to an extra effort to test/upgrade
+  KubeEdge during each Kubernetes upgrade since edgeworker provision is a key
+  step to enable these nodes.
+
+  Besides, KubeEdge has a whole edge device management logic that is not in
+  current StarlingX platform scope.
+
+Data model impact
+-----------------
+
+The only data model change is to insert 'edgeworker' to 'invPersonalityEnum'
+in sysinv db model.
+
+REST API impact
+---------------
+
+None
+
+Security impact
+---------------
+
+The potential security threat and mitigation could be:
+
+* Malicious node
+
+  It must be guaranteed by the administrator that no unauthorized node could
+  physically connect into the management network.
+  The authentication of the edgeworker node onboard will be introduced in the
+  later phases.
+
+* Malicious packages in edgeworker node
+
+  It must be guaranteed by the administrator that the packages running in
+  edgeworker nodes are secure since the OS is managed by the administrator.
+
+Other end user impact
+---------------------
+
+None
+
+Performance Impact
+------------------
+
+None
+
+Other deployer impact
+---------------------
+
+The deployer is required to run edgeworker provision playbook after adding or
+updating the node as edgeworker personality.
+
+Developer impact
+----------------
+
+None
+
+Upgrade impact
+--------------
+
+The kubelet needs to be upgraded during the Kubernetes upgrade process. The
+upgrade process will trigger an additional script/playbook to check the version
+of the packages on edgeworker nodes, and upgrade them according to their own
+distribution.
+
+The distribution's repo may not update the corresponding packages to the newest
+version, due to Kubernetes version skew support policy [3]_ , up to two minor
+versions older against apiserver is acceptable for kubelet and kube-proxy.
+
+The SW patching/updating will be addressed in phase four. It could either be a
+3rd party solution or plugins of current SW management. Because current SW
+management could not patch/update packages other than RPMs, while the OS of
+edgeworker nodes could be different types of packages.
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  Mingyuan Qi
+
+
+Repos Impacted
+--------------
+
+config
+ansible-playbook
+
+Work Items
+----------
+
+The work items are already introduced in section `Proposed change`_ above.
+
+
+Dependencies
+============
+
+None
+
+
+Testing
+=======
+
+* Sysinv unit test
+
+* Sysinv host operation test
+
+* Adding edgeworker nodes in different deploy mode test
+
+  * Simplex
+  * Duplex
+  * Standard
+
+* Ansible-playbook test for each target OS
+
+  * Host configuration
+  * Package installation
+  * Edgeworker node join to the Kubernetes cluster
+
+
+Documentation Impact
+====================
+
+* Add a new page to describe the edgeworker nodes requirement, limitation and use case.
+* Add new page to describe the following deployment:
+
+  * Duplex + edgeworker
+  * Standard + edgeworker
+
+* Modify all deployment docs to insert an option to deploy edgewoker nodes and link it
+  to underlying deployment with edgeworker nodes.
+
+
+References
+==========
+
+.. [1]  Kubespray https://github.com/kubernetes-sigs/kubespray
+.. [2]  KubeEdge https://kubeedge.io
+.. [3]  Kubernetes version skew policy https://kubernetes.io/docs/setup/release/version-skew-policy/
+
+
+History
+=======
+
+.. list-table:: Revisions
+   :header-rows: 1
+
+   * - Release Name
+     - Description
+   * - stx.5.0
+     - Edgeworker management phase one introduced