Make collectd alarm notifier retry alarm clear attempts that fail

The Starling-X collectd alarm notification handler Fault Manager (FM)
call to clear an alarm can lead to a stuck alarm if that FM request
fails, say due to a concurrent swact operation, and the clear is not
retried.

The alarm will remain stuck until there is another same alarm assertion,
followed by deassertion that leads to a successful clear.

The fix is to execute a 'return' in the alarm clear failure path so
that the alarm notifier's alarm manager control structure is not
updated with the clear state so that the clear will be automatically
retried on the next audit interval.

Change-Id: Iddf4e0e7b99eab0bf0748230a25851419e7c06fa
Closes-Bug: 1793314
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
This commit is contained in:
Eric MacDonald 2018-09-20 14:21:32 -04:00
parent 551b098a87
commit 5142fac498
2 changed files with 2 additions and 1 deletions

View File

@ -16,4 +16,4 @@ COPY_LIST="$PKG_BASE/src/LICENSE \
$PKG_BASE/src/example.py \ $PKG_BASE/src/example.py \
$PKG_BASE/src/example.conf" $PKG_BASE/src/example.conf"
TIS_PATCH_VER=1 TIS_PATCH_VER=2

View File

@ -1143,6 +1143,7 @@ def notifier_func(nObject):
if api.clear_fault(base_obj.id, obj.entity_id) is False: if api.clear_fault(base_obj.id, obj.entity_id) is False:
collectd.error("%s %s:%s clear_fault failed" % collectd.error("%s %s:%s clear_fault failed" %
(PLUGIN, base_obj.id, obj.entity_id)) (PLUGIN, base_obj.id, obj.entity_id))
return 0
else: else:
reason = obj.resource_name reason = obj.resource_name
reason += " threshold exceeded" reason += " threshold exceeded"