Commit Graph

36 Commits

Author SHA1 Message Date
Scott Little 3637d66ae4 Relocated some packages to repo 'monitoring'
List of relocated subdirectories:

monitoring/collectd-extensions
monitoring/influxdb-extensions
tools/monitor-tools
tools/vm-topology

Story: 2006166
Task: 35687
Depends-On: I6c62895f8dda5b8dc4ff56680c73c49f3f3d7935
Depends-On: I665dc7fabbfffc798ad57843eb74dca16e7647a3
Change-Id: Iffacd50340005320540cd9ba1495cde0b2231cd0
Signed-off-by: Scott Little <scott.little@windriver.com>
Depends-On: I14e631137ff5658a54d62ad3d7aa2cd0ffaba6e0
2019-09-05 20:31:52 -04:00
Bin Qian 6218755c3d Soften NTP alarm language for syncing with peer
In IPv6 setup, NTP refid is hash result of reference's IPv6 address.
In such case, do not try to intepret the refid to tell if the peer
has a reliable source.

When the NTP service uses peer controller as reference, the alarm
is a reminder to the admin user instead of reporting an issue. This
is a minor alarm.

Closes-Bug: 1834071
Change-Id: Ia2770ba7ed77640e58e8c35254a504b57487ff8f
Signed-off-by: Bin Qian <bin.qian@windriver.com>
2019-08-16 11:34:14 -04:00
Zuul 3876f288e1 Merge "Use ntpq refid to tell if peer controller reaches reliable time source" 2019-08-09 19:35:27 +00:00
Zuul 9e05349216 Merge "Collapse the glance filesystem into platform" 2019-08-09 18:43:46 +00:00
Bin Qian e5bf093cc8 Use ntpq refid to tell if peer controller reaches reliable time source
This is a partial fix only for ipv4.

The ntpq.py verify if a valid source is the reference of
peer controller when the peer controller is selected as
time server.
This change will avoid raising false alarm when a
controller uses peer controller as time server while
the peer uses a reliable time source (e.g, external time
server, or accurate time device).

Partial-Bug: 1834071

Change-Id: I9140e14b79cb09088c8061a06fae22df97526a70
Signed-off-by: Bin Qian <bin.qian@windriver.com>
2019-08-09 16:00:58 +00:00
Kristine Bujold daf33b135d Collapse the glance filesystem into platform
The filesystem /opt/cgcs is removed and its content moved under
/opt/platform.

Resources related to drbd-cgcs and /opt/cgcs are updated to
drbd-plaform and /opt/plaform.

Tested in AIO-SX, AIO-DX and Standard hardware labs.

Depends-On: https://review.opendev.org/674360
Partial-Bug: 1830142

Change-Id: I6d0555f00ab269f7d9567fff365180b66adce8b3
Signed-off-by: Kristine Bujold <kristine.bujold@windriver.com>
2019-08-08 10:54:11 -04:00
Alex Kozyrev d11ccac73f Alarm is not raised in case PTP GM is lost
"No lock" PTP alarm is raised only when GM is and was not present
in a network. Current logic only reaises this alarm in case MAC
address of GM is the same as local MAC address. But it is only
the case when no external GM ever appeared in a PTP setup.
In case GM was present in a network and then lost we need to check
port status instead. PTP MAC address still points to an external GM.
But port status is changed from SLAVE to LISTENING state.

Change-Id: I30365685e6f44566702cc82534ab6ebf0613a731
Closes-bug: 1836884
Signed-off-by: Alex Kozyrev <alex.kozyrev@windriver.com>
2019-08-06 16:03:01 -04:00
Sun Austin 04df91698f collectd: list of fs to monitor not up to date
The following df monitor list in collectd will be updated
Adding:
/var/lib/docker
/var/lib/docker-distribution
/var/lib/kubelet
/var/lib/nova/instances

Removing:
/etc/nova/instances

Closes-Bug: 1837103

Change-Id: I9f0f4bf27968e0e1b85a0b5b314ab9c3c15fec2d
Signed-off-by: Sun Austin <austin.sun@intel.com>
2019-07-19 14:49:21 +08:00
Eric MacDonald 2c47014484 Set PTP Monitor period to 5 minutes
The collectd PTP monitor plugin development left the
audit period at 1 minute.

The PTP monitoring design called for a 5 minute
audit period.

Change-Id: I7eb6af6f88934028e2fb91c9655ee2db1143dc64
Closes-Bug: 1836392
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-07-17 06:37:42 -04:00
Eric Barrett 97f7cb4a58 Align PKG-INFO for Collectd & Influxdb Extensions
Update the license section to ASL 2.0
Closes-Bug: 1836227

Change-Id: Ib84ad47af2d4e8ff02a5801dfde16fc6a440c63d
Signed-off-by: Eric Barrett <eric.barrett@windriver.com>
2019-07-15 16:53:36 -04:00
Al Bailey 40e57e46a5 Fix the runtime requirements for collectd-extensions
The python code in the collectd-extensions requires several
python modules in order to run, but is missing the explicit
dependency against those modules in the package.

These include:
  fm-api
  httplib2
  influxdb
  oslo-concurrency
  tsconfig

Change-Id: I9ace889fdb7fac031792486c3e5ddf3bc2cae770
Story: 2004764
Task: 33630
Signed-off-by: Al Bailey <Al.Bailey@windriver.com>
2019-06-07 14:47:32 -05:00
Eric MacDonald 904da6755d Reduce the collectd samples retention period
Collectd creates a samples database within the
InfluxDB database which is stored in the rootfs.

The current 4 week retention period is too long
for larger systems and could lead to the rootfs
filling up.

This update reduces that retention perid to 1 week
to protect the rootfs from being filled up with
sample data until the samples database is moved
to a more appropriate location.

Change-Id: Ic59712849fa228f19d15919594d23edc43109a0b
Closes-Bug: 1827301
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-05-09 15:46:58 -04:00
Erich Cordoba 2e293e0894 Change license on collectd and influxdb extensions to Apache 2.0
Story: 2005542
Task: 30714

Change-Id: If1573deac5f1a0f0cfe8bb8c5fba5b4bb2e1a7c3
Signed-off-by: Erich Cordoba <erich.cordoba.malibran@intel.com>
2019-05-02 17:31:13 -05:00
Eric MacDonald 8841bceb80 Make collectd plugins use FM API V2
Using Fm API V2 allows the collectd plugins to distinguish
between FM connection failures and no existing alarm query
requests on process startup as well as failure to clear or
assert alarms during runtime so that such actions can be
retried on next audit interval.

This allows the plugins to be more robust in its alarm
management and avoids leaving stuck alarms which fixes
the following three reported stuck alarm bugs.

Closes-Bug: https://bugs.launchpad.net/starlingx/+bug/1802535
Closes-Bug: https://bugs.launchpad.net/starlingx/+bug/1813974
Closes-Bug: https://bugs.launchpad.net/starlingx/+bug/1814944

Additional improvements were made to each plugin to handle
failure paths better with the V2 API.

Additional changes made by this update include:

1. fixed stale unmounted filesystems alarm handling
2. percent usage alarm actual readings are updated on change
3. fix of threshold values
4. add 2 decimal point resolution to % usage alarm text
5. added commented FIT code to mem, cpu and df plugins
6. reversed True/False return polarity in interface plugin functions

Test Plan:

Regression:
PASS: normal alarm handling with FM V2 API ; process startup
PASS: normal alarm handling with FM V2 API ; runtime alarm assert
PASS: normal alarm handling with FM V2 API ; runtime alarm clear

PASS: Verify alarms of unmounted fs gets automatically cleared
PASS: Verify interface alarm/clear operation

Robustness:
PASS: Verify general startup behavior of all plugins while FM
      is not running only to see it start at some later time.
PASS: Verify alarm handling over process startup with existing
      cpu alarms while FM not running.
PASS: Verify alarm handling over process startup with existing
      mem alarms while FM not running.
PASS: Verify alarm handling over process startup with existing
      df alarms while FM not running.

PASS: Verify runtime cpu plugin alarm assertion retry handling
PASS: Verify runtime cpu plugin alarm clear retry handling
PASS: Verify runtime cpu plugin handling over process restart
PASS: Verify alarm handling over process startup with existing
      cpu alarms while FM initially not running and then started.

PASS: Verify runtime mem plugin alarm assertion retry handling
PASS: Verify runtime mem plugin alarm clear retry handling
PASS: Verify runtime mem plugin handling over process restart
PASS: Verify alarm handling over process startup with existing
      mem alarms while FM initially not running and then started.

PASS: Verify runtime df plugin alarm assertion retry handling
PASS: Verify runtime df plugin alarm clear retry handling
PASS: Verify runtime df plugin handling over process restart
PASS: Verify alarm handling over process startup with existing
      df alarms while FM initially not running and then started.

PASS: Verify alarm set/clear threshold boundaries for cpu plugin
PASS: Verify alarm set/clear threshold boundaries for memory plugin
PASS: Verify alarm set/clear threshold boundaries for df plugin

New Features: ... threshold exceeded ; threshold 80.00%, actual 80.33%
PASS: Verify percent usage alarms are refreshed with current value
PASS: Verify percent usage alarms show two decimal points

Change-Id: Ibe173617d11c17bdc4b41115e25bd8c18b49807e
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-05-01 12:13:27 -04:00
Zuul e263b4c484 Merge "Remove references to infra in collect" 2019-04-24 14:22:51 +00:00
Eric Barrett cb0b2ffe1e Enable Flake8 Docstring Errors
Flake8 currently ignores the following errors:
H401: docstring should not start with a space
H404: multi line docstring should start without a leading new line
H405: multi line docstring summary not separated with an empty line
Enable them for more consistent formatting of docstrings

Change-Id: I385e28e9c6eca3c02a3def51ff64b00b7a63a853
Story: 2004515
Task: 30076
Signed-off-by: Eric Barrett <eric.barrett@windriver.com>
2019-04-18 11:50:45 -04:00
Teresa Ho 1266d7bf4e Remove references to infra in collect
Updated the host interface monitor plugin for collectd.
Replaced the reference of infra interface with
cluster-host interface.

Story: 2004273
Task: 30518
Depends-On: https://review.openstack.org/#/c/652713/

Change-Id: I33ca3c6a970e45bd278cd399149cdb2c985def7d
Signed-off-by: Teresa Ho <teresa.ho@windriver.com>
2019-04-16 09:52:33 -04:00
Eric Barrett 27133180fe Enable Flake8 300 Series Errors
Flake8 currently ignores the following errors:
E302: expected 2 blank lines[COMMA] found 1
E303: too many blank lines
E305: expected 2 blank lines after class or function definition, found 1

Change-Id: Idfb00e530967f1a345bc2e263ad77597f83cc5d3
Story: 2004515
Task: 30076
Signed-off-by: Eric Barrett <eric.barrett@windriver.com>
2019-04-03 10:40:01 -04:00
Zuul be64ee567c Merge "Add PTP monitoring to collectd" 2019-04-02 15:55:50 +00:00
Eric MacDonald b4a23c57aa Add PTP monitoring to collectd
This update adds Precision Time Protocol (PTP) monitoring
to the current list of inhouse developed collectd plugins.

Refer to the ptp.py header for a description of the monitoring
service algorithm and inline comments for detailed behavior.

Test Plan:

Useability:
-----------
PASS: Verify monitoring behavior around ptp service enable and disable
PASS: Verify ptp monitoring behavior over lock and unlock
PASS: Verify behavior with bonded interfaces (skew oot alarm)
PASS: Verify no-lock hosts lock to remote grandmaster when available
PASS: Verify AIO SX PTP Enable over Lock/Unlock

System Level:
-------------
PASS: Verify large system install
PASS: Verify AIO SX system install

Host Level:
-----------
PASS: Verify controller monitoring
PASS: Verify worker monitoring
PASS: Verify storage monitoring
PASS: Verify worker/storage behavior when the only controller is rebooted.
PASS: Verify startup handling of fm calls while fm is not running
PASS: Verify runtime handling of fm calls while fm is not running

Config Level:
-------------
PASS: Verify PTP Enable and auto start monitoring
PASS: Verify PTP Disable and auto stop monitoring
PASS: Verify audit interval is every 60 seconds
PASS: Verify hardware timestamp monitoring
PASS: Verify software timestamp monitoring
PASS: verify legacy   timestamp monitoring
PASS: Verify hardware to software config change
PASS: Verify software to legacy   config change
PASS; Verify   legacy to hardware config change
PASS: Verify software to hardware config change

Alarm Management:
-----------------
PASS: Verify end-to-end handling of 'nolock' alarm management
PASS: Verify end-to-end handling of 'out-of-tolerance' alarm management
PASS: Verify end-to-end handling of 'process' alarm management
PASS: Verify end-to-end handling of 'unsupported mode' alarm management
PASS: Verify all ptp alarms get cleared on collectd process start
PASS: Verify plugin startup behavior when FM is not running
PASS: Verify plugin with FM V2 API
PASS: Verify thresholed out-of-tolerance alarm handling
PASS: Verify plugin logging is value added
PASS: Verify alarm assert debounce of 2
PASS: Verify alarm clear with no debounce
PASS: Verify only major out-of-tolerance alarm for software mode
PASS: Verify only major out-of-tolerance alarm for legacy mode
PASS: Verify minor/major out-of-tolerance alarm for hardware mode
PASS: Verify no-lock alarm if compute GM ID is the same as its own
PASS: Verify no-lock alarm is not raised on GM reboot
PASS: Verify GM switches to alternate when GM host is rebooted

Change-Id: If36aece94dd5511bf9deba0753f3863237e2a7fe
Story: 2002823
Task: 29492
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-04-01 09:25:49 -04:00
Eric Barrett 11cc2a21bb Enable Flake8 Whitespace Errors
Flake8 currently ignores a number of whitespace related errors:
E201: whitespace after '['
E202: whitespace before '}'
E203: whitespace before ':'
E211: whitespace before '('
E221: multiple spaces before operator
E222: multiple spaces after operator
E225: missing whitespace around operator
E226: missing whitespace around arithmetic operator
E231: missing whitespace after ','
E251: unexpected spaces around keyword / parameter equals
E261: at least two spaces before inline comment
Enable them for more thorough testing of code

Change-Id: Id03f36070b8f16694a12f4d36858680b6e00d530
Story: 2004515
Task: 30076
Signed-off-by: Eric Barrett <eric.barrett@windriver.com>
2019-03-26 15:02:53 -04:00
Eric MacDonald edf29d1b7a Add Remote Logging Server connectivity monitoring to collectd
This update adds titled support to the starlingX set
of collectd monitoring plugins.

This update excludes monitoring of IPV6 remote logging servers.
Only IPV4 remote logging servers are supported.

Story: 2002823
Task: 28636

Test Plan:
PASS: Verify monitoring on controller nodes only
PASS: Verify system install
PASS: Verify plugin logging is value added
PASS: Verify connectivity failure to success handling
PASS: Verify connectivity success to failure handling
PASS: Verify connected / not connected logging on service state change
PASS: Verify connected / not connected logging on connectivity state change
PASS: Verify service enabled to disabled state transition with alarm asserted
PASS: Verify service enabled to disabled state transition while connected
PASS: Verify service disabled to enabled state transition with connectivity
PASS: Verify service disabled to enabled state transition without connectivity
PASS: Verify plugin audit interval is every 60 seconds
PASS: Verify plugin alarm assert debounce of 2
PASS: Verify plugin alarm clear with no debounce
PASS: Verify plugin alarm assert over process start on TCP conn failure
PASS: Verify plugin alarm severity as Minor
PASS: Verify plugin alarm clear over process restart
PASS: Verify plugin alarm is cleared on service disable transition
PASS: Verify plugin sample data

Change-Id: I73cd35170ed19abce17bb4f511f0c5e04bc101c6
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-03-13 21:35:34 -04:00
Zuul 5246aa4f28 Merge "Add numa node and huge page memory monitoring" 2019-02-13 21:41:36 +00:00
Eric MacDonald 4dadf61bea Add numa node and huge page memory monitoring
This update adds titled support to the existing
Platform Memory monitor collectd plugin.

Instance Mapping

Plugin Refinements                      Instance Name
-------------------------------------   ----------
Platform Memory                         platform
Platform Memory Numa Node 0             node0
Platform Memory Numa Node 1             node1
Platform Memory Numa Node 0 Huge Pages  node0_hugepages
Platform Memory Numa Node 1 Huge Pages  node1_hugepages

New Alarm Entity IDs added to existing 100.103 alarm ID

host=<hostname>.numa=node0
host=<hostname>.numa=node1
host=<hostname>.numa=node0_hugepages
host=<hostname>.numa=node1_hugepages

Modified memory plugin thresholds and added alarm notifier
to support collectd requiring samples to be 'gt' rather
than 'ge' the specified thresholds for a severity change.

This update also corrects a few subtle pep8 warnings to
a few of the existing python plugins.

There is no need for an rmond update because numa and
huge page monitoring was never enabled in rmond.

Story: 2002823
Task: 29369

PASS: Verify logging of all memory instance types
PASS: Verify monitoring of new numa node memory
PASS: Verify monitoring of new numa node huge page memory
PASS: Verify memory instance alarm handling in fm notifier
PASS: Verify memory instance alarm load on startup
PASS: Verify memory instance alarm clear ; runtime condition gone
PASS: Verify memory instance alarm clear ; startup condition gone

Regression:
PASS: Verify End-To-End Sample Collection for all monitored resources.
Corner Case:
PASS: Verify alarm reporting with threshold of zero
PROG: Verify memory alarm raised at threshold value
PASS: Verify memory alarm cleared 1 below threshold value
PASS: Verify above case for both major and critical thresholds

Change-Id: I4e2612ac7b3d906be4b0a140286dbbb095ce7e1b
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-02-13 10:30:42 -05:00
Eric MacDonald e8c9676d98 Add network interface monitoring plugin to collectd
This update introduces interface monitoring for oam,
mgmt and infra networks as a collectd plugin.

The interface plugin runs and queries the new maintenance
Link Monitor daemon for Link Model and Information every
10 seconds.

The plugin then manages alarms based on the link model similar
to how rmon did in the past ; port and interface alarms.

Severity: Interface and Port levels

Alarm Level  Minor        Major              Critical
-----------  -----  ---------------------    ----------------------------
Interface     N/A   One of lag pair is Up    All Interface ports are Down
     Port     N/A   Physical Link is Down    N/A

Degrade support for interface monitoring is add to the mtce
degrade notifier. Any link down condition results in a host
degrade condition like was in rmon.

Sample Data: represented as % of total links Up for that network interface
100 or 100% percent used - all links of interface are up.
 50 or  50% percent used - one of lag pair is Up and the other is Down
  0 or   0% percent used - all ports for that network are Down

The plugin documents all of this in its header.

This update also

1. Adds the new lmond process to syslog-ng config file.
2. Adds the new lmond process to the mtce patch script.
3. Modifies the cpu, df and memory threshold settings by -1.
   rmon thresholds were precise whereas collectd requires
   that the samples cross the thresholds, not just meet them.
   So for example, in terms of a 90% usage action the
   threshold needs to be 89.

Test Plan: (WIP but almost complete)

PASS: Verify interface plugin startup
PASS: Verify interface plugin logging
PASS: Verify interface plugin Link Status Query and response handling
PASS: Verify monitor, sample storage and grafana display
PASS: verify port and interface alarm matches what rmon produced
PASS: Verify lmon port config from manifest configured plugin
PASS: Verify lmon port config from lmon.conf
PASS: Verify single interface failure handling and recovery
PASS: Verify lagged interface failure handling and recovery
PASS: Verify link loss of lagged interface shared between mgmt and oam (hp380)
PASS: Verify network interface failure handling ; single port
PASS: Verify network interface degrade handling ; lagged interface
PEND: Verify network interface degrade handling ; vlan interface
PASS: Verify HTTP request timeout period and handling
PASS: Verify link status query failure handling - invalid uri (timeout)
PASS: Verify link status query failure handling - missing uri (timeout)
PASS: Verify link status query failure handling - status fail
PASS: Verify link status query failure handling - bad json resp

Change-Id: I2e2dfe6ddfa06a46770245540c7153d330bdf196
Story: 2002823
Task: 28635
Depends-On: https://review.openstack.org/#/c/633264
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-02-06 14:18:14 -05:00
Eric MacDonald fab989b5bc Add on-demand instance support to collectd alarm manager plugin
Many plugins need support for on-demand instance sampling
and alarming. The filesystem and memory monitoring plugins
are perfect examples. The number of numa nodes or monitored
file systems vary from host to host.

This update adds on-demand instance support. Any plugin
can now support multiple instances. As new plugin
instances are learned ; memory is allocated for them
and linked to that plugins base object and managed as a
separate instance but within the scope of its parent.

The following additional enhancements were made to the common
alarm and degrade plugins.

1. added /opt/etcd as a new monitored filesystem.
2. added common support for vswitch alarm/degrade handling.
3. a few general cleanup changes for code maintainability.

Change-Id: I05b4de78f30fc27362c63b6dbfc97268d6588e4f
Story: 2002823
Task:29297
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-02-05 12:39:11 -05:00
Eric MacDonald abaff6b275 Remove alarm query before clear in NTP plugin
Issue titled 'NTP 100.14 alarm is not cleared' exposed
an issue where the NTP plugin alarm clear operation is
circumvented when its pre-curser fm_api.get_fault call
returns None if the fm process is not running.
From the callers point of view the None return suggests
that the alarm to be cleared does not exist so the code
skips the call to clear.

This update works around this by simply issuing the
clear without the query.

Change-Id: Idcc05bb0e7e1aa1082af1e8ecdcb1a5463b19440
Closes-Bug: 1812440
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-01-18 16:32:53 -05:00
Eric MacDonald 4d7c958711 Add NTP server monitoring as a collectd plugin
This update replaces the currently existing but disabled
ntpq.py plugin with one that does not rely on an external
query_ntp_servers.sh.

This new ntpq.py is an entirely new self contained
implementation of what rmon and query_ntp_servers.sh
was doing but now more efficiently all in one python
plugin file.

Story: 2002823
Task: 22859

Test Plan:
PASS: Verify handling of one and two unreachable NTP servers.
PASS: verify handling of pingable but not an NTP server.
PASS: Verify NTP server re-provisioning from unreachable to reachable server.
PASS: Verify NTP server re-provisioning from reachable to unreachable server.
PASS: Verify NTP server alarms suppressed while controller is locked.
PASS: Verify NTP asserted alarms show up on unlock until cleared.
PASS: Verify NTP server monitoring occurs on controller only.
PASS: Verify NTP unreachable server alarms are cleared over a collectd restart
PASS: Verify NTP minor IP alarms are cleared on process startup
PASS: Verify NTP minor IP alarm clear retries when FM call fails on process startup.
PASS: Verify NTP alarm assertion retry while FM call fails at runtime.
PASS: Verify NTP alarm clear retry while FM call fails at runtime.
PASS: Verify NTP monitoring after controller Swact.
PASS: Verify NTP monitoring cadence is every 10 minutes.
PASS: Verify NTP plugin logs are useful and assist debug without flooding.

Change-Id: I67c4c5518a6e5dec64b4e419ab7ee2ffcefb9bf3
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-01-09 12:47:00 -05:00
Eric MacDonald c8f39de9a0 Implement collectd startup in manifest apply post stage
Starting collectd too early in the manifest apply is seen
to occasionally fail due to a dependency configuration on
hostname resolution in FQDNLookup not being complete.

Since influxdb is used by collectd and is a controller
only service this update moves it to the manifest apply
post stage as well and is filtered out from non
controller load types.

This issue is fixed by the following multi-git changes.

stx-metal:
   Filter influxdb out of storage and compute only loads.
   No real inter git merge dependency

stx-integ: This update.
   Add startup Before=pmond dependency

stx-config:
   Move collectd config and startup to manifest apply post stage
   Move influxdb config and startup to manifest apply post stage

Test Plan:
PASS: Build iso
PASS: verify install storage system and collectd startup
PASS: Verify Storage system DOR
PASS: Verify influxdb and extensions excluded in non-controller loads
PASS: Verify collectd starts properly on all nodes (CC,DOR,UNLOCK)
PASS: Verify influxdb starts properly on controller nodes (CC,DOR,UNLOCK)
PASS: Verify collectd pmond process monitoring and recovery
PASS: Verify influxdb pmond process monitoring and recovery

PEND: Verify collectd statistics storage and fetch to/from influxdb
PEND: Install AIO DX and verify collectd and influxdb startup

Change-Id: I47d70b05bdbdd22f8fce2f56fcc287fac7371ace
Closes-Bug: 1797909
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2019-01-02 10:21:06 -05:00
Zuul 204bd9ea0c Merge "Change compute node to worker node personality" 2018-12-14 22:40:53 +00:00
Eric MacDonald 0ec1725371 Fix collectd Memory plugin Strict Mode learning
Existing code sets overcommit strict mode to True
if any non-zero value is returned from a read
of /proc/sys/vm/overcommit_memory.

This is incorrect.

Strict mode should only be set when the returned
value is 2.

Change-Id: I2c5328624571bb3b2f478d5a79615650bb92cbd2
Closes-Bug: 1808225
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2018-12-13 09:31:03 -05:00
Tao Liu d4fec24f6c Change compute node to worker node personality
The compute personality & subfunction has been changed to
worker, and compute_reserved.conf has been rename to
worker_reserved.conf. Compute configuration flags have
been updated to worker flags.

This update changes misc dependencies to compute
personality, compute_reserved.conf and configuration
flag files.

It aslo removed puppet-nova dependencies to
compute_reserved.conf.

Tests Performed:
Non-containerized deployment
AIO-SX: Sanity and Nightly automated test suite
AIO-DX: Sanity and Nightly automated test suite
2+2 System: Sanity and Nightly automated test suite
2+2 System: Horizon Patch Orchestration

Kubernetes deployment:
AIO-SX: Create, delete, reboot and rebuild instances
2+2+2 System: worker nodes are unlock enable and no alarms

Story: 2004022
Task: 27013

Depends-On: https://review.openstack.org/#/c/624452/

Change-Id: Iccf5584058a2154f1c4ffdb061938e76b9965861
Signed-off-by: Tao Liu <tao.liu@windriver.com>
2018-12-12 15:09:04 -05:00
Eric MacDonald 5142fac498 Make collectd alarm notifier retry alarm clear attempts that fail
The Starling-X collectd alarm notification handler Fault Manager (FM)
call to clear an alarm can lead to a stuck alarm if that FM request
fails, say due to a concurrent swact operation, and the clear is not
retried.

The alarm will remain stuck until there is another same alarm assertion,
followed by deassertion that leads to a successful clear.

The fix is to execute a 'return' in the alarm clear failure path so
that the alarm notifier's alarm manager control structure is not
updated with the clear state so that the clear will be automatically
retried on the next audit interval.

Change-Id: Iddf4e0e7b99eab0bf0748230a25851419e7c06fa
Closes-Bug: 1793314
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2018-09-20 14:21:32 -04:00
melissaml ff1ba812c0 Remove the duplicated word
Change-Id: I68dc653708a33536b69ede4f032457ab951c24dd
2018-08-17 15:34:51 +08:00
Eric MacDonald 279e0d38e9 Recreate /var/run/influxdb dir upon recovery
This update fixes an issue where the /var/run/influxdb directory
is not being re-created over a DOR because the controller manifest that
creates it is not being run in that recovery mode.

The fix is to enhance the influxdb service file to ensure this directory
is created whenever the service is started.

Story: 2002823
Task: 22740

Change-Id: Iecd81969ae1611b963fae5595f60c3eb2d2da851
Signed-off-by: Jack Ding <jack.ding@windriver.com>
2018-07-20 10:53:56 -04:00
Eric MacDonald 892489acd7 Collectd+InfluxDb-RMON Replacement(ALL METRICS) P1
This is the primary update that introduces collectd monitoring and
sample storage into the influxdb database.
Two new packages are introduced by this update
 - collectd-extensions package which includes
   - newly developed collectd platform memory, cpu and filesystem
     plugins
     - note that the example, ntpq and interface plugins are not
       complete and are not enabled by this update.
   - pmond process monitoring / recovery support for collectd
   - updated service file for pidfile management ; needed by pmond
 - influxdb-extensions package which includes
   - pmond process monitoring / recovery support for influxdb
   - updated service file for pidfile management ; needed by pmond
   - log rotate support for influxdb

Change-Id: I06511fecb781781ed5491c926ad4b1273a1bc23b
Signed-off-by: Jack Ding <jack.ding@windriver.com>
2018-07-03 11:06:24 -04:00