For AIO-DX, Ceph monitor was not being started after an uncontrolled
swact caused by sudden power off/reboot of the active controller,
breaking the system high availability. This happens because there is a
flag to indicate on which controller the last active ceph monitor was
running to prevent starting ceph monitor without drbd-cephmon data in
sync, what could cause Ceph data corruption. That flag was also
avoiding data corruption caused when mgmt network was down and both
controllers were set to be active, starting ceph monitor without
drbd-cephmon in sync.
To prevent data corruption and to maintain system high availability,
this fix checks the mgmt network carrier instead of managing flags.
If no carrier is detected on mgmt network interface, then ceph mon and
osd are stopped and only allowed to start again after mgmt network has
carrier.
For the AIO-DX Direct, all networks are also verified. If all networks
have no carrier, then the other controller is considered down, letting
the working controller to be in active state even if mgmt network has
no carrier.
Test-Plan:
PASS: Run system host-swact on AIO-DX and verify ceph is running
with status HEALTH_OK
PASS: Force an uncontrolled swact on AIO-DX by killing a critical
process and verify if ceph is running with status HEALTH_OK
PASS: Disconnect OAM and MGMT networks for both controllers on
AIO-DX and verify ceph mon and osd stop on both controllers.
Reconnect OAM and MGMT networks and verify if ceph is running
and status is HEALTH_OK
PASS: Reboot or power off active controller and verify on the other
controller if ceph is running with status HEALT_WARN because
one host is down. Power on the controller, wait until it is
online/available. Verify if ceph HEALTH_OK after data is
all ODSs are up and data is recovered.
Closes-bug: 2020889
Signed-off-by: Felipe Sanches Zanoni <Felipe.SanchesZanoni@windriver.com>
Change-Id: I38470f43eba86f88fb9cfe47869d2393cacbd365