In IPv6 setup, NTP refid is hash result of reference's IPv6 address.
In such case, do not try to intepret the refid to tell if the peer
has a reliable source.
When the NTP service uses peer controller as reference, the alarm
is a reminder to the admin user instead of reporting an issue. This
is a minor alarm.
Closes-Bug: 1834071
Change-Id: Ia2770ba7ed77640e58e8c35254a504b57487ff8f
Signed-off-by: Bin Qian <bin.qian@windriver.com>
This is a partial fix only for ipv4.
The ntpq.py verify if a valid source is the reference of
peer controller when the peer controller is selected as
time server.
This change will avoid raising false alarm when a
controller uses peer controller as time server while
the peer uses a reliable time source (e.g, external time
server, or accurate time device).
Partial-Bug: 1834071
Change-Id: I9140e14b79cb09088c8061a06fae22df97526a70
Signed-off-by: Bin Qian <bin.qian@windriver.com>
The filesystem /opt/cgcs is removed and its content moved under
/opt/platform.
Resources related to drbd-cgcs and /opt/cgcs are updated to
drbd-plaform and /opt/plaform.
Tested in AIO-SX, AIO-DX and Standard hardware labs.
Depends-On: https://review.opendev.org/674360
Partial-Bug: 1830142
Change-Id: I6d0555f00ab269f7d9567fff365180b66adce8b3
Signed-off-by: Kristine Bujold <kristine.bujold@windriver.com>
"No lock" PTP alarm is raised only when GM is and was not present
in a network. Current logic only reaises this alarm in case MAC
address of GM is the same as local MAC address. But it is only
the case when no external GM ever appeared in a PTP setup.
In case GM was present in a network and then lost we need to check
port status instead. PTP MAC address still points to an external GM.
But port status is changed from SLAVE to LISTENING state.
Change-Id: I30365685e6f44566702cc82534ab6ebf0613a731
Closes-bug: 1836884
Signed-off-by: Alex Kozyrev <alex.kozyrev@windriver.com>
The following df monitor list in collectd will be updated
Adding:
/var/lib/docker
/var/lib/docker-distribution
/var/lib/kubelet
/var/lib/nova/instances
Removing:
/etc/nova/instances
Closes-Bug: 1837103
Change-Id: I9f0f4bf27968e0e1b85a0b5b314ab9c3c15fec2d
Signed-off-by: Sun Austin <austin.sun@intel.com>
The collectd PTP monitor plugin development left the
audit period at 1 minute.
The PTP monitoring design called for a 5 minute
audit period.
Change-Id: I7eb6af6f88934028e2fb91c9655ee2db1143dc64
Closes-Bug: 1836392
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Update the license section to ASL 2.0
Closes-Bug: 1836227
Change-Id: Ib84ad47af2d4e8ff02a5801dfde16fc6a440c63d
Signed-off-by: Eric Barrett <eric.barrett@windriver.com>
The python code in the collectd-extensions requires several
python modules in order to run, but is missing the explicit
dependency against those modules in the package.
These include:
fm-api
httplib2
influxdb
oslo-concurrency
tsconfig
Change-Id: I9ace889fdb7fac031792486c3e5ddf3bc2cae770
Story: 2004764
Task: 33630
Signed-off-by: Al Bailey <Al.Bailey@windriver.com>
Collectd creates a samples database within the
InfluxDB database which is stored in the rootfs.
The current 4 week retention period is too long
for larger systems and could lead to the rootfs
filling up.
This update reduces that retention perid to 1 week
to protect the rootfs from being filled up with
sample data until the samples database is moved
to a more appropriate location.
Change-Id: Ic59712849fa228f19d15919594d23edc43109a0b
Closes-Bug: 1827301
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Using Fm API V2 allows the collectd plugins to distinguish
between FM connection failures and no existing alarm query
requests on process startup as well as failure to clear or
assert alarms during runtime so that such actions can be
retried on next audit interval.
This allows the plugins to be more robust in its alarm
management and avoids leaving stuck alarms which fixes
the following three reported stuck alarm bugs.
Closes-Bug: https://bugs.launchpad.net/starlingx/+bug/1802535
Closes-Bug: https://bugs.launchpad.net/starlingx/+bug/1813974
Closes-Bug: https://bugs.launchpad.net/starlingx/+bug/1814944
Additional improvements were made to each plugin to handle
failure paths better with the V2 API.
Additional changes made by this update include:
1. fixed stale unmounted filesystems alarm handling
2. percent usage alarm actual readings are updated on change
3. fix of threshold values
4. add 2 decimal point resolution to % usage alarm text
5. added commented FIT code to mem, cpu and df plugins
6. reversed True/False return polarity in interface plugin functions
Test Plan:
Regression:
PASS: normal alarm handling with FM V2 API ; process startup
PASS: normal alarm handling with FM V2 API ; runtime alarm assert
PASS: normal alarm handling with FM V2 API ; runtime alarm clear
PASS: Verify alarms of unmounted fs gets automatically cleared
PASS: Verify interface alarm/clear operation
Robustness:
PASS: Verify general startup behavior of all plugins while FM
is not running only to see it start at some later time.
PASS: Verify alarm handling over process startup with existing
cpu alarms while FM not running.
PASS: Verify alarm handling over process startup with existing
mem alarms while FM not running.
PASS: Verify alarm handling over process startup with existing
df alarms while FM not running.
PASS: Verify runtime cpu plugin alarm assertion retry handling
PASS: Verify runtime cpu plugin alarm clear retry handling
PASS: Verify runtime cpu plugin handling over process restart
PASS: Verify alarm handling over process startup with existing
cpu alarms while FM initially not running and then started.
PASS: Verify runtime mem plugin alarm assertion retry handling
PASS: Verify runtime mem plugin alarm clear retry handling
PASS: Verify runtime mem plugin handling over process restart
PASS: Verify alarm handling over process startup with existing
mem alarms while FM initially not running and then started.
PASS: Verify runtime df plugin alarm assertion retry handling
PASS: Verify runtime df plugin alarm clear retry handling
PASS: Verify runtime df plugin handling over process restart
PASS: Verify alarm handling over process startup with existing
df alarms while FM initially not running and then started.
PASS: Verify alarm set/clear threshold boundaries for cpu plugin
PASS: Verify alarm set/clear threshold boundaries for memory plugin
PASS: Verify alarm set/clear threshold boundaries for df plugin
New Features: ... threshold exceeded ; threshold 80.00%, actual 80.33%
PASS: Verify percent usage alarms are refreshed with current value
PASS: Verify percent usage alarms show two decimal points
Change-Id: Ibe173617d11c17bdc4b41115e25bd8c18b49807e
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Flake8 currently ignores the following errors:
H401: docstring should not start with a space
H404: multi line docstring should start without a leading new line
H405: multi line docstring summary not separated with an empty line
Enable them for more consistent formatting of docstrings
Change-Id: I385e28e9c6eca3c02a3def51ff64b00b7a63a853
Story: 2004515
Task: 30076
Signed-off-by: Eric Barrett <eric.barrett@windriver.com>
Updated the host interface monitor plugin for collectd.
Replaced the reference of infra interface with
cluster-host interface.
Story: 2004273
Task: 30518
Depends-On: https://review.openstack.org/#/c/652713/
Change-Id: I33ca3c6a970e45bd278cd399149cdb2c985def7d
Signed-off-by: Teresa Ho <teresa.ho@windriver.com>
Flake8 currently ignores the following errors:
E302: expected 2 blank lines[COMMA] found 1
E303: too many blank lines
E305: expected 2 blank lines after class or function definition, found 1
Change-Id: Idfb00e530967f1a345bc2e263ad77597f83cc5d3
Story: 2004515
Task: 30076
Signed-off-by: Eric Barrett <eric.barrett@windriver.com>
This update adds Precision Time Protocol (PTP) monitoring
to the current list of inhouse developed collectd plugins.
Refer to the ptp.py header for a description of the monitoring
service algorithm and inline comments for detailed behavior.
Test Plan:
Useability:
-----------
PASS: Verify monitoring behavior around ptp service enable and disable
PASS: Verify ptp monitoring behavior over lock and unlock
PASS: Verify behavior with bonded interfaces (skew oot alarm)
PASS: Verify no-lock hosts lock to remote grandmaster when available
PASS: Verify AIO SX PTP Enable over Lock/Unlock
System Level:
-------------
PASS: Verify large system install
PASS: Verify AIO SX system install
Host Level:
-----------
PASS: Verify controller monitoring
PASS: Verify worker monitoring
PASS: Verify storage monitoring
PASS: Verify worker/storage behavior when the only controller is rebooted.
PASS: Verify startup handling of fm calls while fm is not running
PASS: Verify runtime handling of fm calls while fm is not running
Config Level:
-------------
PASS: Verify PTP Enable and auto start monitoring
PASS: Verify PTP Disable and auto stop monitoring
PASS: Verify audit interval is every 60 seconds
PASS: Verify hardware timestamp monitoring
PASS: Verify software timestamp monitoring
PASS: verify legacy timestamp monitoring
PASS: Verify hardware to software config change
PASS: Verify software to legacy config change
PASS; Verify legacy to hardware config change
PASS: Verify software to hardware config change
Alarm Management:
-----------------
PASS: Verify end-to-end handling of 'nolock' alarm management
PASS: Verify end-to-end handling of 'out-of-tolerance' alarm management
PASS: Verify end-to-end handling of 'process' alarm management
PASS: Verify end-to-end handling of 'unsupported mode' alarm management
PASS: Verify all ptp alarms get cleared on collectd process start
PASS: Verify plugin startup behavior when FM is not running
PASS: Verify plugin with FM V2 API
PASS: Verify thresholed out-of-tolerance alarm handling
PASS: Verify plugin logging is value added
PASS: Verify alarm assert debounce of 2
PASS: Verify alarm clear with no debounce
PASS: Verify only major out-of-tolerance alarm for software mode
PASS: Verify only major out-of-tolerance alarm for legacy mode
PASS: Verify minor/major out-of-tolerance alarm for hardware mode
PASS: Verify no-lock alarm if compute GM ID is the same as its own
PASS: Verify no-lock alarm is not raised on GM reboot
PASS: Verify GM switches to alternate when GM host is rebooted
Change-Id: If36aece94dd5511bf9deba0753f3863237e2a7fe
Story: 2002823
Task: 29492
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Flake8 currently ignores a number of whitespace related errors:
E201: whitespace after '['
E202: whitespace before '}'
E203: whitespace before ':'
E211: whitespace before '('
E221: multiple spaces before operator
E222: multiple spaces after operator
E225: missing whitespace around operator
E226: missing whitespace around arithmetic operator
E231: missing whitespace after ','
E251: unexpected spaces around keyword / parameter equals
E261: at least two spaces before inline comment
Enable them for more thorough testing of code
Change-Id: Id03f36070b8f16694a12f4d36858680b6e00d530
Story: 2004515
Task: 30076
Signed-off-by: Eric Barrett <eric.barrett@windriver.com>
This update adds titled support to the starlingX set
of collectd monitoring plugins.
This update excludes monitoring of IPV6 remote logging servers.
Only IPV4 remote logging servers are supported.
Story: 2002823
Task: 28636
Test Plan:
PASS: Verify monitoring on controller nodes only
PASS: Verify system install
PASS: Verify plugin logging is value added
PASS: Verify connectivity failure to success handling
PASS: Verify connectivity success to failure handling
PASS: Verify connected / not connected logging on service state change
PASS: Verify connected / not connected logging on connectivity state change
PASS: Verify service enabled to disabled state transition with alarm asserted
PASS: Verify service enabled to disabled state transition while connected
PASS: Verify service disabled to enabled state transition with connectivity
PASS: Verify service disabled to enabled state transition without connectivity
PASS: Verify plugin audit interval is every 60 seconds
PASS: Verify plugin alarm assert debounce of 2
PASS: Verify plugin alarm clear with no debounce
PASS: Verify plugin alarm assert over process start on TCP conn failure
PASS: Verify plugin alarm severity as Minor
PASS: Verify plugin alarm clear over process restart
PASS: Verify plugin alarm is cleared on service disable transition
PASS: Verify plugin sample data
Change-Id: I73cd35170ed19abce17bb4f511f0c5e04bc101c6
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
This update adds titled support to the existing
Platform Memory monitor collectd plugin.
Instance Mapping
Plugin Refinements Instance Name
------------------------------------- ----------
Platform Memory platform
Platform Memory Numa Node 0 node0
Platform Memory Numa Node 1 node1
Platform Memory Numa Node 0 Huge Pages node0_hugepages
Platform Memory Numa Node 1 Huge Pages node1_hugepages
New Alarm Entity IDs added to existing 100.103 alarm ID
host=<hostname>.numa=node0
host=<hostname>.numa=node1
host=<hostname>.numa=node0_hugepages
host=<hostname>.numa=node1_hugepages
Modified memory plugin thresholds and added alarm notifier
to support collectd requiring samples to be 'gt' rather
than 'ge' the specified thresholds for a severity change.
This update also corrects a few subtle pep8 warnings to
a few of the existing python plugins.
There is no need for an rmond update because numa and
huge page monitoring was never enabled in rmond.
Story: 2002823
Task: 29369
PASS: Verify logging of all memory instance types
PASS: Verify monitoring of new numa node memory
PASS: Verify monitoring of new numa node huge page memory
PASS: Verify memory instance alarm handling in fm notifier
PASS: Verify memory instance alarm load on startup
PASS: Verify memory instance alarm clear ; runtime condition gone
PASS: Verify memory instance alarm clear ; startup condition gone
Regression:
PASS: Verify End-To-End Sample Collection for all monitored resources.
Corner Case:
PASS: Verify alarm reporting with threshold of zero
PROG: Verify memory alarm raised at threshold value
PASS: Verify memory alarm cleared 1 below threshold value
PASS: Verify above case for both major and critical thresholds
Change-Id: I4e2612ac7b3d906be4b0a140286dbbb095ce7e1b
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
This update introduces interface monitoring for oam,
mgmt and infra networks as a collectd plugin.
The interface plugin runs and queries the new maintenance
Link Monitor daemon for Link Model and Information every
10 seconds.
The plugin then manages alarms based on the link model similar
to how rmon did in the past ; port and interface alarms.
Severity: Interface and Port levels
Alarm Level Minor Major Critical
----------- ----- --------------------- ----------------------------
Interface N/A One of lag pair is Up All Interface ports are Down
Port N/A Physical Link is Down N/A
Degrade support for interface monitoring is add to the mtce
degrade notifier. Any link down condition results in a host
degrade condition like was in rmon.
Sample Data: represented as % of total links Up for that network interface
100 or 100% percent used - all links of interface are up.
50 or 50% percent used - one of lag pair is Up and the other is Down
0 or 0% percent used - all ports for that network are Down
The plugin documents all of this in its header.
This update also
1. Adds the new lmond process to syslog-ng config file.
2. Adds the new lmond process to the mtce patch script.
3. Modifies the cpu, df and memory threshold settings by -1.
rmon thresholds were precise whereas collectd requires
that the samples cross the thresholds, not just meet them.
So for example, in terms of a 90% usage action the
threshold needs to be 89.
Test Plan: (WIP but almost complete)
PASS: Verify interface plugin startup
PASS: Verify interface plugin logging
PASS: Verify interface plugin Link Status Query and response handling
PASS: Verify monitor, sample storage and grafana display
PASS: verify port and interface alarm matches what rmon produced
PASS: Verify lmon port config from manifest configured plugin
PASS: Verify lmon port config from lmon.conf
PASS: Verify single interface failure handling and recovery
PASS: Verify lagged interface failure handling and recovery
PASS: Verify link loss of lagged interface shared between mgmt and oam (hp380)
PASS: Verify network interface failure handling ; single port
PASS: Verify network interface degrade handling ; lagged interface
PEND: Verify network interface degrade handling ; vlan interface
PASS: Verify HTTP request timeout period and handling
PASS: Verify link status query failure handling - invalid uri (timeout)
PASS: Verify link status query failure handling - missing uri (timeout)
PASS: Verify link status query failure handling - status fail
PASS: Verify link status query failure handling - bad json resp
Change-Id: I2e2dfe6ddfa06a46770245540c7153d330bdf196
Story: 2002823
Task: 28635
Depends-On: https://review.openstack.org/#/c/633264
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Many plugins need support for on-demand instance sampling
and alarming. The filesystem and memory monitoring plugins
are perfect examples. The number of numa nodes or monitored
file systems vary from host to host.
This update adds on-demand instance support. Any plugin
can now support multiple instances. As new plugin
instances are learned ; memory is allocated for them
and linked to that plugins base object and managed as a
separate instance but within the scope of its parent.
The following additional enhancements were made to the common
alarm and degrade plugins.
1. added /opt/etcd as a new monitored filesystem.
2. added common support for vswitch alarm/degrade handling.
3. a few general cleanup changes for code maintainability.
Change-Id: I05b4de78f30fc27362c63b6dbfc97268d6588e4f
Story: 2002823
Task:29297
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Issue titled 'NTP 100.14 alarm is not cleared' exposed
an issue where the NTP plugin alarm clear operation is
circumvented when its pre-curser fm_api.get_fault call
returns None if the fm process is not running.
From the callers point of view the None return suggests
that the alarm to be cleared does not exist so the code
skips the call to clear.
This update works around this by simply issuing the
clear without the query.
Change-Id: Idcc05bb0e7e1aa1082af1e8ecdcb1a5463b19440
Closes-Bug: 1812440
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
This update replaces the currently existing but disabled
ntpq.py plugin with one that does not rely on an external
query_ntp_servers.sh.
This new ntpq.py is an entirely new self contained
implementation of what rmon and query_ntp_servers.sh
was doing but now more efficiently all in one python
plugin file.
Story: 2002823
Task: 22859
Test Plan:
PASS: Verify handling of one and two unreachable NTP servers.
PASS: verify handling of pingable but not an NTP server.
PASS: Verify NTP server re-provisioning from unreachable to reachable server.
PASS: Verify NTP server re-provisioning from reachable to unreachable server.
PASS: Verify NTP server alarms suppressed while controller is locked.
PASS: Verify NTP asserted alarms show up on unlock until cleared.
PASS: Verify NTP server monitoring occurs on controller only.
PASS: Verify NTP unreachable server alarms are cleared over a collectd restart
PASS: Verify NTP minor IP alarms are cleared on process startup
PASS: Verify NTP minor IP alarm clear retries when FM call fails on process startup.
PASS: Verify NTP alarm assertion retry while FM call fails at runtime.
PASS: Verify NTP alarm clear retry while FM call fails at runtime.
PASS: Verify NTP monitoring after controller Swact.
PASS: Verify NTP monitoring cadence is every 10 minutes.
PASS: Verify NTP plugin logs are useful and assist debug without flooding.
Change-Id: I67c4c5518a6e5dec64b4e419ab7ee2ffcefb9bf3
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Starting collectd too early in the manifest apply is seen
to occasionally fail due to a dependency configuration on
hostname resolution in FQDNLookup not being complete.
Since influxdb is used by collectd and is a controller
only service this update moves it to the manifest apply
post stage as well and is filtered out from non
controller load types.
This issue is fixed by the following multi-git changes.
stx-metal:
Filter influxdb out of storage and compute only loads.
No real inter git merge dependency
stx-integ: This update.
Add startup Before=pmond dependency
stx-config:
Move collectd config and startup to manifest apply post stage
Move influxdb config and startup to manifest apply post stage
Test Plan:
PASS: Build iso
PASS: verify install storage system and collectd startup
PASS: Verify Storage system DOR
PASS: Verify influxdb and extensions excluded in non-controller loads
PASS: Verify collectd starts properly on all nodes (CC,DOR,UNLOCK)
PASS: Verify influxdb starts properly on controller nodes (CC,DOR,UNLOCK)
PASS: Verify collectd pmond process monitoring and recovery
PASS: Verify influxdb pmond process monitoring and recovery
PEND: Verify collectd statistics storage and fetch to/from influxdb
PEND: Install AIO DX and verify collectd and influxdb startup
Change-Id: I47d70b05bdbdd22f8fce2f56fcc287fac7371ace
Closes-Bug: 1797909
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Existing code sets overcommit strict mode to True
if any non-zero value is returned from a read
of /proc/sys/vm/overcommit_memory.
This is incorrect.
Strict mode should only be set when the returned
value is 2.
Change-Id: I2c5328624571bb3b2f478d5a79615650bb92cbd2
Closes-Bug: 1808225
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
The compute personality & subfunction has been changed to
worker, and compute_reserved.conf has been rename to
worker_reserved.conf. Compute configuration flags have
been updated to worker flags.
This update changes misc dependencies to compute
personality, compute_reserved.conf and configuration
flag files.
It aslo removed puppet-nova dependencies to
compute_reserved.conf.
Tests Performed:
Non-containerized deployment
AIO-SX: Sanity and Nightly automated test suite
AIO-DX: Sanity and Nightly automated test suite
2+2 System: Sanity and Nightly automated test suite
2+2 System: Horizon Patch Orchestration
Kubernetes deployment:
AIO-SX: Create, delete, reboot and rebuild instances
2+2+2 System: worker nodes are unlock enable and no alarms
Story: 2004022
Task: 27013
Depends-On: https://review.openstack.org/#/c/624452/
Change-Id: Iccf5584058a2154f1c4ffdb061938e76b9965861
Signed-off-by: Tao Liu <tao.liu@windriver.com>
The Starling-X collectd alarm notification handler Fault Manager (FM)
call to clear an alarm can lead to a stuck alarm if that FM request
fails, say due to a concurrent swact operation, and the clear is not
retried.
The alarm will remain stuck until there is another same alarm assertion,
followed by deassertion that leads to a successful clear.
The fix is to execute a 'return' in the alarm clear failure path so
that the alarm notifier's alarm manager control structure is not
updated with the clear state so that the clear will be automatically
retried on the next audit interval.
Change-Id: Iddf4e0e7b99eab0bf0748230a25851419e7c06fa
Closes-Bug: 1793314
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
This update fixes an issue where the /var/run/influxdb directory
is not being re-created over a DOR because the controller manifest that
creates it is not being run in that recovery mode.
The fix is to enhance the influxdb service file to ensure this directory
is created whenever the service is started.
Story: 2002823
Task: 22740
Change-Id: Iecd81969ae1611b963fae5595f60c3eb2d2da851
Signed-off-by: Jack Ding <jack.ding@windriver.com>
This is the primary update that introduces collectd monitoring and
sample storage into the influxdb database.
Two new packages are introduced by this update
- collectd-extensions package which includes
- newly developed collectd platform memory, cpu and filesystem
plugins
- note that the example, ntpq and interface plugins are not
complete and are not enabled by this update.
- pmond process monitoring / recovery support for collectd
- updated service file for pidfile management ; needed by pmond
- influxdb-extensions package which includes
- pmond process monitoring / recovery support for influxdb
- updated service file for pidfile management ; needed by pmond
- log rotate support for influxdb
Change-Id: I06511fecb781781ed5491c926ad4b1273a1bc23b
Signed-off-by: Jack Ding <jack.ding@windriver.com>