integ

Commit Graph

Author	SHA1	Message	Date
Scott Little	3637d66ae4	Relocated some packages to repo 'monitoring' List of relocated subdirectories: monitoring/collectd-extensions monitoring/influxdb-extensions tools/monitor-tools tools/vm-topology Story: 2006166 Task: 35687 Depends-On: I6c62895f8dda5b8dc4ff56680c73c49f3f3d7935 Depends-On: I665dc7fabbfffc798ad57843eb74dca16e7647a3 Change-Id: Iffacd50340005320540cd9ba1495cde0b2231cd0 Signed-off-by: Scott Little <scott.little@windriver.com> Depends-On: I14e631137ff5658a54d62ad3d7aa2cd0ffaba6e0	2019-09-05 20:31:52 -04:00
Bin Qian	6218755c3d	Soften NTP alarm language for syncing with peer In IPv6 setup, NTP refid is hash result of reference's IPv6 address. In such case, do not try to intepret the refid to tell if the peer has a reliable source. When the NTP service uses peer controller as reference, the alarm is a reminder to the admin user instead of reporting an issue. This is a minor alarm. Closes-Bug: 1834071 Change-Id: Ia2770ba7ed77640e58e8c35254a504b57487ff8f Signed-off-by: Bin Qian <bin.qian@windriver.com>	2019-08-16 11:34:14 -04:00
Zuul	3876f288e1	Merge "Use ntpq refid to tell if peer controller reaches reliable time source"	2019-08-09 19:35:27 +00:00
Zuul	9e05349216	Merge "Collapse the glance filesystem into platform"	2019-08-09 18:43:46 +00:00
Bin Qian	e5bf093cc8	Use ntpq refid to tell if peer controller reaches reliable time source This is a partial fix only for ipv4. The ntpq.py verify if a valid source is the reference of peer controller when the peer controller is selected as time server. This change will avoid raising false alarm when a controller uses peer controller as time server while the peer uses a reliable time source (e.g, external time server, or accurate time device). Partial-Bug: 1834071 Change-Id: I9140e14b79cb09088c8061a06fae22df97526a70 Signed-off-by: Bin Qian <bin.qian@windriver.com>	2019-08-09 16:00:58 +00:00
Kristine Bujold	daf33b135d	Collapse the glance filesystem into platform The filesystem /opt/cgcs is removed and its content moved under /opt/platform. Resources related to drbd-cgcs and /opt/cgcs are updated to drbd-plaform and /opt/plaform. Tested in AIO-SX, AIO-DX and Standard hardware labs. Depends-On: https://review.opendev.org/674360 Partial-Bug: 1830142 Change-Id: I6d0555f00ab269f7d9567fff365180b66adce8b3 Signed-off-by: Kristine Bujold <kristine.bujold@windriver.com>	2019-08-08 10:54:11 -04:00
Alex Kozyrev	d11ccac73f	Alarm is not raised in case PTP GM is lost "No lock" PTP alarm is raised only when GM is and was not present in a network. Current logic only reaises this alarm in case MAC address of GM is the same as local MAC address. But it is only the case when no external GM ever appeared in a PTP setup. In case GM was present in a network and then lost we need to check port status instead. PTP MAC address still points to an external GM. But port status is changed from SLAVE to LISTENING state. Change-Id: I30365685e6f44566702cc82534ab6ebf0613a731 Closes-bug: 1836884 Signed-off-by: Alex Kozyrev <alex.kozyrev@windriver.com>	2019-08-06 16:03:01 -04:00
Sun Austin	04df91698f	collectd: list of fs to monitor not up to date The following df monitor list in collectd will be updated Adding: /var/lib/docker /var/lib/docker-distribution /var/lib/kubelet /var/lib/nova/instances Removing: /etc/nova/instances Closes-Bug: 1837103 Change-Id: I9f0f4bf27968e0e1b85a0b5b314ab9c3c15fec2d Signed-off-by: Sun Austin <austin.sun@intel.com>	2019-07-19 14:49:21 +08:00
Eric MacDonald	2c47014484	Set PTP Monitor period to 5 minutes The collectd PTP monitor plugin development left the audit period at 1 minute. The PTP monitoring design called for a 5 minute audit period. Change-Id: I7eb6af6f88934028e2fb91c9655ee2db1143dc64 Closes-Bug: 1836392 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2019-07-17 06:37:42 -04:00
Eric Barrett	97f7cb4a58	Align PKG-INFO for Collectd & Influxdb Extensions Update the license section to ASL 2.0 Closes-Bug: 1836227 Change-Id: Ib84ad47af2d4e8ff02a5801dfde16fc6a440c63d Signed-off-by: Eric Barrett <eric.barrett@windriver.com>	2019-07-15 16:53:36 -04:00
Al Bailey	40e57e46a5	Fix the runtime requirements for collectd-extensions The python code in the collectd-extensions requires several python modules in order to run, but is missing the explicit dependency against those modules in the package. These include: fm-api httplib2 influxdb oslo-concurrency tsconfig Change-Id: I9ace889fdb7fac031792486c3e5ddf3bc2cae770 Story: 2004764 Task: 33630 Signed-off-by: Al Bailey <Al.Bailey@windriver.com>	2019-06-07 14:47:32 -05:00
Eric MacDonald	904da6755d	Reduce the collectd samples retention period Collectd creates a samples database within the InfluxDB database which is stored in the rootfs. The current 4 week retention period is too long for larger systems and could lead to the rootfs filling up. This update reduces that retention perid to 1 week to protect the rootfs from being filled up with sample data until the samples database is moved to a more appropriate location. Change-Id: Ic59712849fa228f19d15919594d23edc43109a0b Closes-Bug: 1827301 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2019-05-09 15:46:58 -04:00
Erich Cordoba	2e293e0894	Change license on collectd and influxdb extensions to Apache 2.0 Story: 2005542 Task: 30714 Change-Id: If1573deac5f1a0f0cfe8bb8c5fba5b4bb2e1a7c3 Signed-off-by: Erich Cordoba <erich.cordoba.malibran@intel.com>	2019-05-02 17:31:13 -05:00
Eric MacDonald	8841bceb80	Make collectd plugins use FM API V2 Using Fm API V2 allows the collectd plugins to distinguish between FM connection failures and no existing alarm query requests on process startup as well as failure to clear or assert alarms during runtime so that such actions can be retried on next audit interval. This allows the plugins to be more robust in its alarm management and avoids leaving stuck alarms which fixes the following three reported stuck alarm bugs. Closes-Bug: https://bugs.launchpad.net/starlingx/+bug/1802535 Closes-Bug: https://bugs.launchpad.net/starlingx/+bug/1813974 Closes-Bug: https://bugs.launchpad.net/starlingx/+bug/1814944 Additional improvements were made to each plugin to handle failure paths better with the V2 API. Additional changes made by this update include: 1. fixed stale unmounted filesystems alarm handling 2. percent usage alarm actual readings are updated on change 3. fix of threshold values 4. add 2 decimal point resolution to % usage alarm text 5. added commented FIT code to mem, cpu and df plugins 6. reversed True/False return polarity in interface plugin functions Test Plan: Regression: PASS: normal alarm handling with FM V2 API ; process startup PASS: normal alarm handling with FM V2 API ; runtime alarm assert PASS: normal alarm handling with FM V2 API ; runtime alarm clear PASS: Verify alarms of unmounted fs gets automatically cleared PASS: Verify interface alarm/clear operation Robustness: PASS: Verify general startup behavior of all plugins while FM is not running only to see it start at some later time. PASS: Verify alarm handling over process startup with existing cpu alarms while FM not running. PASS: Verify alarm handling over process startup with existing mem alarms while FM not running. PASS: Verify alarm handling over process startup with existing df alarms while FM not running. PASS: Verify runtime cpu plugin alarm assertion retry handling PASS: Verify runtime cpu plugin alarm clear retry handling PASS: Verify runtime cpu plugin handling over process restart PASS: Verify alarm handling over process startup with existing cpu alarms while FM initially not running and then started. PASS: Verify runtime mem plugin alarm assertion retry handling PASS: Verify runtime mem plugin alarm clear retry handling PASS: Verify runtime mem plugin handling over process restart PASS: Verify alarm handling over process startup with existing mem alarms while FM initially not running and then started. PASS: Verify runtime df plugin alarm assertion retry handling PASS: Verify runtime df plugin alarm clear retry handling PASS: Verify runtime df plugin handling over process restart PASS: Verify alarm handling over process startup with existing df alarms while FM initially not running and then started. PASS: Verify alarm set/clear threshold boundaries for cpu plugin PASS: Verify alarm set/clear threshold boundaries for memory plugin PASS: Verify alarm set/clear threshold boundaries for df plugin New Features: ... threshold exceeded ; threshold 80.00%, actual 80.33% PASS: Verify percent usage alarms are refreshed with current value PASS: Verify percent usage alarms show two decimal points Change-Id: Ibe173617d11c17bdc4b41115e25bd8c18b49807e Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2019-05-01 12:13:27 -04:00
Zuul	e263b4c484	Merge "Remove references to infra in collect"	2019-04-24 14:22:51 +00:00
Eric Barrett	cb0b2ffe1e	Enable Flake8 Docstring Errors Flake8 currently ignores the following errors: H401: docstring should not start with a space H404: multi line docstring should start without a leading new line H405: multi line docstring summary not separated with an empty line Enable them for more consistent formatting of docstrings Change-Id: I385e28e9c6eca3c02a3def51ff64b00b7a63a853 Story: 2004515 Task: 30076 Signed-off-by: Eric Barrett <eric.barrett@windriver.com>	2019-04-18 11:50:45 -04:00
Teresa Ho	1266d7bf4e	Remove references to infra in collect Updated the host interface monitor plugin for collectd. Replaced the reference of infra interface with cluster-host interface. Story: 2004273 Task: 30518 Depends-On: https://review.openstack.org/#/c/652713/ Change-Id: I33ca3c6a970e45bd278cd399149cdb2c985def7d Signed-off-by: Teresa Ho <teresa.ho@windriver.com>	2019-04-16 09:52:33 -04:00
Eric Barrett	27133180fe	Enable Flake8 300 Series Errors Flake8 currently ignores the following errors: E302: expected 2 blank lines[COMMA] found 1 E303: too many blank lines E305: expected 2 blank lines after class or function definition, found 1 Change-Id: Idfb00e530967f1a345bc2e263ad77597f83cc5d3 Story: 2004515 Task: 30076 Signed-off-by: Eric Barrett <eric.barrett@windriver.com>	2019-04-03 10:40:01 -04:00
Zuul	be64ee567c	Merge "Add PTP monitoring to collectd"	2019-04-02 15:55:50 +00:00
Eric MacDonald	b4a23c57aa	Add PTP monitoring to collectd This update adds Precision Time Protocol (PTP) monitoring to the current list of inhouse developed collectd plugins. Refer to the ptp.py header for a description of the monitoring service algorithm and inline comments for detailed behavior. Test Plan: Useability: ----------- PASS: Verify monitoring behavior around ptp service enable and disable PASS: Verify ptp monitoring behavior over lock and unlock PASS: Verify behavior with bonded interfaces (skew oot alarm) PASS: Verify no-lock hosts lock to remote grandmaster when available PASS: Verify AIO SX PTP Enable over Lock/Unlock System Level: ------------- PASS: Verify large system install PASS: Verify AIO SX system install Host Level: ----------- PASS: Verify controller monitoring PASS: Verify worker monitoring PASS: Verify storage monitoring PASS: Verify worker/storage behavior when the only controller is rebooted. PASS: Verify startup handling of fm calls while fm is not running PASS: Verify runtime handling of fm calls while fm is not running Config Level: ------------- PASS: Verify PTP Enable and auto start monitoring PASS: Verify PTP Disable and auto stop monitoring PASS: Verify audit interval is every 60 seconds PASS: Verify hardware timestamp monitoring PASS: Verify software timestamp monitoring PASS: verify legacy timestamp monitoring PASS: Verify hardware to software config change PASS: Verify software to legacy config change PASS; Verify legacy to hardware config change PASS: Verify software to hardware config change Alarm Management: ----------------- PASS: Verify end-to-end handling of 'nolock' alarm management PASS: Verify end-to-end handling of 'out-of-tolerance' alarm management PASS: Verify end-to-end handling of 'process' alarm management PASS: Verify end-to-end handling of 'unsupported mode' alarm management PASS: Verify all ptp alarms get cleared on collectd process start PASS: Verify plugin startup behavior when FM is not running PASS: Verify plugin with FM V2 API PASS: Verify thresholed out-of-tolerance alarm handling PASS: Verify plugin logging is value added PASS: Verify alarm assert debounce of 2 PASS: Verify alarm clear with no debounce PASS: Verify only major out-of-tolerance alarm for software mode PASS: Verify only major out-of-tolerance alarm for legacy mode PASS: Verify minor/major out-of-tolerance alarm for hardware mode PASS: Verify no-lock alarm if compute GM ID is the same as its own PASS: Verify no-lock alarm is not raised on GM reboot PASS: Verify GM switches to alternate when GM host is rebooted Change-Id: If36aece94dd5511bf9deba0753f3863237e2a7fe Story: 2002823 Task: 29492 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2019-04-01 09:25:49 -04:00
Eric Barrett	11cc2a21bb	Enable Flake8 Whitespace Errors Flake8 currently ignores a number of whitespace related errors: E201: whitespace after '[' E202: whitespace before '}' E203: whitespace before ':' E211: whitespace before '(' E221: multiple spaces before operator E222: multiple spaces after operator E225: missing whitespace around operator E226: missing whitespace around arithmetic operator E231: missing whitespace after ',' E251: unexpected spaces around keyword / parameter equals E261: at least two spaces before inline comment Enable them for more thorough testing of code Change-Id: Id03f36070b8f16694a12f4d36858680b6e00d530 Story: 2004515 Task: 30076 Signed-off-by: Eric Barrett <eric.barrett@windriver.com>	2019-03-26 15:02:53 -04:00
Eric MacDonald	edf29d1b7a	Add Remote Logging Server connectivity monitoring to collectd This update adds titled support to the starlingX set of collectd monitoring plugins. This update excludes monitoring of IPV6 remote logging servers. Only IPV4 remote logging servers are supported. Story: 2002823 Task: 28636 Test Plan: PASS: Verify monitoring on controller nodes only PASS: Verify system install PASS: Verify plugin logging is value added PASS: Verify connectivity failure to success handling PASS: Verify connectivity success to failure handling PASS: Verify connected / not connected logging on service state change PASS: Verify connected / not connected logging on connectivity state change PASS: Verify service enabled to disabled state transition with alarm asserted PASS: Verify service enabled to disabled state transition while connected PASS: Verify service disabled to enabled state transition with connectivity PASS: Verify service disabled to enabled state transition without connectivity PASS: Verify plugin audit interval is every 60 seconds PASS: Verify plugin alarm assert debounce of 2 PASS: Verify plugin alarm clear with no debounce PASS: Verify plugin alarm assert over process start on TCP conn failure PASS: Verify plugin alarm severity as Minor PASS: Verify plugin alarm clear over process restart PASS: Verify plugin alarm is cleared on service disable transition PASS: Verify plugin sample data Change-Id: I73cd35170ed19abce17bb4f511f0c5e04bc101c6 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2019-03-13 21:35:34 -04:00
Zuul	5246aa4f28	Merge "Add numa node and huge page memory monitoring"	2019-02-13 21:41:36 +00:00
Eric MacDonald	4dadf61bea	Add numa node and huge page memory monitoring This update adds titled support to the existing Platform Memory monitor collectd plugin. Instance Mapping Plugin Refinements Instance Name ------------------------------------- ---------- Platform Memory platform Platform Memory Numa Node 0 node0 Platform Memory Numa Node 1 node1 Platform Memory Numa Node 0 Huge Pages node0_hugepages Platform Memory Numa Node 1 Huge Pages node1_hugepages New Alarm Entity IDs added to existing 100.103 alarm ID host=<hostname>.numa=node0 host=<hostname>.numa=node1 host=<hostname>.numa=node0_hugepages host=<hostname>.numa=node1_hugepages Modified memory plugin thresholds and added alarm notifier to support collectd requiring samples to be 'gt' rather than 'ge' the specified thresholds for a severity change. This update also corrects a few subtle pep8 warnings to a few of the existing python plugins. There is no need for an rmond update because numa and huge page monitoring was never enabled in rmond. Story: 2002823 Task: 29369 PASS: Verify logging of all memory instance types PASS: Verify monitoring of new numa node memory PASS: Verify monitoring of new numa node huge page memory PASS: Verify memory instance alarm handling in fm notifier PASS: Verify memory instance alarm load on startup PASS: Verify memory instance alarm clear ; runtime condition gone PASS: Verify memory instance alarm clear ; startup condition gone Regression: PASS: Verify End-To-End Sample Collection for all monitored resources. Corner Case: PASS: Verify alarm reporting with threshold of zero PROG: Verify memory alarm raised at threshold value PASS: Verify memory alarm cleared 1 below threshold value PASS: Verify above case for both major and critical thresholds Change-Id: I4e2612ac7b3d906be4b0a140286dbbb095ce7e1b Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2019-02-13 10:30:42 -05:00
Eric MacDonald	e8c9676d98	Add network interface monitoring plugin to collectd This update introduces interface monitoring for oam, mgmt and infra networks as a collectd plugin. The interface plugin runs and queries the new maintenance Link Monitor daemon for Link Model and Information every 10 seconds. The plugin then manages alarms based on the link model similar to how rmon did in the past ; port and interface alarms. Severity: Interface and Port levels Alarm Level Minor Major Critical ----------- ----- --------------------- ---------------------------- Interface N/A One of lag pair is Up All Interface ports are Down Port N/A Physical Link is Down N/A Degrade support for interface monitoring is add to the mtce degrade notifier. Any link down condition results in a host degrade condition like was in rmon. Sample Data: represented as % of total links Up for that network interface 100 or 100% percent used - all links of interface are up. 50 or 50% percent used - one of lag pair is Up and the other is Down 0 or 0% percent used - all ports for that network are Down The plugin documents all of this in its header. This update also 1. Adds the new lmond process to syslog-ng config file. 2. Adds the new lmond process to the mtce patch script. 3. Modifies the cpu, df and memory threshold settings by -1. rmon thresholds were precise whereas collectd requires that the samples cross the thresholds, not just meet them. So for example, in terms of a 90% usage action the threshold needs to be 89. Test Plan: (WIP but almost complete) PASS: Verify interface plugin startup PASS: Verify interface plugin logging PASS: Verify interface plugin Link Status Query and response handling PASS: Verify monitor, sample storage and grafana display PASS: verify port and interface alarm matches what rmon produced PASS: Verify lmon port config from manifest configured plugin PASS: Verify lmon port config from lmon.conf PASS: Verify single interface failure handling and recovery PASS: Verify lagged interface failure handling and recovery PASS: Verify link loss of lagged interface shared between mgmt and oam (hp380) PASS: Verify network interface failure handling ; single port PASS: Verify network interface degrade handling ; lagged interface PEND: Verify network interface degrade handling ; vlan interface PASS: Verify HTTP request timeout period and handling PASS: Verify link status query failure handling - invalid uri (timeout) PASS: Verify link status query failure handling - missing uri (timeout) PASS: Verify link status query failure handling - status fail PASS: Verify link status query failure handling - bad json resp Change-Id: I2e2dfe6ddfa06a46770245540c7153d330bdf196 Story: 2002823 Task: 28635 Depends-On: https://review.openstack.org/#/c/633264 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2019-02-06 14:18:14 -05:00
Eric MacDonald	fab989b5bc	Add on-demand instance support to collectd alarm manager plugin Many plugins need support for on-demand instance sampling and alarming. The filesystem and memory monitoring plugins are perfect examples. The number of numa nodes or monitored file systems vary from host to host. This update adds on-demand instance support. Any plugin can now support multiple instances. As new plugin instances are learned ; memory is allocated for them and linked to that plugins base object and managed as a separate instance but within the scope of its parent. The following additional enhancements were made to the common alarm and degrade plugins. 1. added /opt/etcd as a new monitored filesystem. 2. added common support for vswitch alarm/degrade handling. 3. a few general cleanup changes for code maintainability. Change-Id: I05b4de78f30fc27362c63b6dbfc97268d6588e4f Story: 2002823 Task:29297 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2019-02-05 12:39:11 -05:00
Eric MacDonald	abaff6b275	Remove alarm query before clear in NTP plugin Issue titled 'NTP 100.14 alarm is not cleared' exposed an issue where the NTP plugin alarm clear operation is circumvented when its pre-curser fm_api.get_fault call returns None if the fm process is not running. From the callers point of view the None return suggests that the alarm to be cleared does not exist so the code skips the call to clear. This update works around this by simply issuing the clear without the query. Change-Id: Idcc05bb0e7e1aa1082af1e8ecdcb1a5463b19440 Closes-Bug: 1812440 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2019-01-18 16:32:53 -05:00
Eric MacDonald	4d7c958711	Add NTP server monitoring as a collectd plugin This update replaces the currently existing but disabled ntpq.py plugin with one that does not rely on an external query_ntp_servers.sh. This new ntpq.py is an entirely new self contained implementation of what rmon and query_ntp_servers.sh was doing but now more efficiently all in one python plugin file. Story: 2002823 Task: 22859 Test Plan: PASS: Verify handling of one and two unreachable NTP servers. PASS: verify handling of pingable but not an NTP server. PASS: Verify NTP server re-provisioning from unreachable to reachable server. PASS: Verify NTP server re-provisioning from reachable to unreachable server. PASS: Verify NTP server alarms suppressed while controller is locked. PASS: Verify NTP asserted alarms show up on unlock until cleared. PASS: Verify NTP server monitoring occurs on controller only. PASS: Verify NTP unreachable server alarms are cleared over a collectd restart PASS: Verify NTP minor IP alarms are cleared on process startup PASS: Verify NTP minor IP alarm clear retries when FM call fails on process startup. PASS: Verify NTP alarm assertion retry while FM call fails at runtime. PASS: Verify NTP alarm clear retry while FM call fails at runtime. PASS: Verify NTP monitoring after controller Swact. PASS: Verify NTP monitoring cadence is every 10 minutes. PASS: Verify NTP plugin logs are useful and assist debug without flooding. Change-Id: I67c4c5518a6e5dec64b4e419ab7ee2ffcefb9bf3 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2019-01-09 12:47:00 -05:00
Eric MacDonald	c8f39de9a0	Implement collectd startup in manifest apply post stage Starting collectd too early in the manifest apply is seen to occasionally fail due to a dependency configuration on hostname resolution in FQDNLookup not being complete. Since influxdb is used by collectd and is a controller only service this update moves it to the manifest apply post stage as well and is filtered out from non controller load types. This issue is fixed by the following multi-git changes. stx-metal: Filter influxdb out of storage and compute only loads. No real inter git merge dependency stx-integ: This update. Add startup Before=pmond dependency stx-config: Move collectd config and startup to manifest apply post stage Move influxdb config and startup to manifest apply post stage Test Plan: PASS: Build iso PASS: verify install storage system and collectd startup PASS: Verify Storage system DOR PASS: Verify influxdb and extensions excluded in non-controller loads PASS: Verify collectd starts properly on all nodes (CC,DOR,UNLOCK) PASS: Verify influxdb starts properly on controller nodes (CC,DOR,UNLOCK) PASS: Verify collectd pmond process monitoring and recovery PASS: Verify influxdb pmond process monitoring and recovery PEND: Verify collectd statistics storage and fetch to/from influxdb PEND: Install AIO DX and verify collectd and influxdb startup Change-Id: I47d70b05bdbdd22f8fce2f56fcc287fac7371ace Closes-Bug: 1797909 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2019-01-02 10:21:06 -05:00
Zuul	204bd9ea0c	Merge "Change compute node to worker node personality"	2018-12-14 22:40:53 +00:00
Eric MacDonald	0ec1725371	Fix collectd Memory plugin Strict Mode learning Existing code sets overcommit strict mode to True if any non-zero value is returned from a read of /proc/sys/vm/overcommit_memory. This is incorrect. Strict mode should only be set when the returned value is 2. Change-Id: I2c5328624571bb3b2f478d5a79615650bb92cbd2 Closes-Bug: 1808225 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2018-12-13 09:31:03 -05:00
Tao Liu	d4fec24f6c	Change compute node to worker node personality The compute personality & subfunction has been changed to worker, and compute_reserved.conf has been rename to worker_reserved.conf. Compute configuration flags have been updated to worker flags. This update changes misc dependencies to compute personality, compute_reserved.conf and configuration flag files. It aslo removed puppet-nova dependencies to compute_reserved.conf. Tests Performed: Non-containerized deployment AIO-SX: Sanity and Nightly automated test suite AIO-DX: Sanity and Nightly automated test suite 2+2 System: Sanity and Nightly automated test suite 2+2 System: Horizon Patch Orchestration Kubernetes deployment: AIO-SX: Create, delete, reboot and rebuild instances 2+2+2 System: worker nodes are unlock enable and no alarms Story: 2004022 Task: 27013 Depends-On: https://review.openstack.org/#/c/624452/ Change-Id: Iccf5584058a2154f1c4ffdb061938e76b9965861 Signed-off-by: Tao Liu <tao.liu@windriver.com>	2018-12-12 15:09:04 -05:00
Eric MacDonald	5142fac498	Make collectd alarm notifier retry alarm clear attempts that fail The Starling-X collectd alarm notification handler Fault Manager (FM) call to clear an alarm can lead to a stuck alarm if that FM request fails, say due to a concurrent swact operation, and the clear is not retried. The alarm will remain stuck until there is another same alarm assertion, followed by deassertion that leads to a successful clear. The fix is to execute a 'return' in the alarm clear failure path so that the alarm notifier's alarm manager control structure is not updated with the clear state so that the clear will be automatically retried on the next audit interval. Change-Id: Iddf4e0e7b99eab0bf0748230a25851419e7c06fa Closes-Bug: 1793314 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>	2018-09-20 14:21:32 -04:00
melissaml	ff1ba812c0	Remove the duplicated word Change-Id: I68dc653708a33536b69ede4f032457ab951c24dd	2018-08-17 15:34:51 +08:00
Eric MacDonald	279e0d38e9	Recreate /var/run/influxdb dir upon recovery This update fixes an issue where the /var/run/influxdb directory is not being re-created over a DOR because the controller manifest that creates it is not being run in that recovery mode. The fix is to enhance the influxdb service file to ensure this directory is created whenever the service is started. Story: 2002823 Task: 22740 Change-Id: Iecd81969ae1611b963fae5595f60c3eb2d2da851 Signed-off-by: Jack Ding <jack.ding@windriver.com>	2018-07-20 10:53:56 -04:00
Eric MacDonald	892489acd7	Collectd+InfluxDb-RMON Replacement(ALL METRICS) P1 This is the primary update that introduces collectd monitoring and sample storage into the influxdb database. Two new packages are introduced by this update - collectd-extensions package which includes - newly developed collectd platform memory, cpu and filesystem plugins - note that the example, ntpq and interface plugins are not complete and are not enabled by this update. - pmond process monitoring / recovery support for collectd - updated service file for pidfile management ; needed by pmond - influxdb-extensions package which includes - pmond process monitoring / recovery support for influxdb - updated service file for pidfile management ; needed by pmond - log rotate support for influxdb Change-Id: I06511fecb781781ed5491c926ad4b1273a1bc23b Signed-off-by: Jack Ding <jack.ding@windriver.com>	2018-07-03 11:06:24 -04:00

36 Commits