integ/ceph/ceph
Felipe Sanches Zanoni 655ab05b71 Fix Ceph mon and osd processes start/stop conditions
For AIO-DX, Ceph monitor was not being started after an uncontrolled
swact caused by sudden power off/reboot of the active controller,
breaking the system high availability. This happens because there is a
flag to indicate on which controller the last active ceph monitor was
running to prevent starting ceph monitor without drbd-cephmon data in
sync, what could cause Ceph data corruption. That flag was also
avoiding data corruption caused when mgmt network was down and both
controllers were set to be active, starting ceph monitor without
drbd-cephmon in sync.

To prevent data corruption and to maintain system high availability,
this fix checks the mgmt network carrier instead of managing flags.
If no carrier is detected on mgmt network interface, then ceph mon and
osd are stopped and only allowed to start again after mgmt network has
carrier.

For the AIO-DX Direct, all networks are also verified. If all networks
have no carrier, then the other controller is considered down, letting
the working controller to be in active state even if mgmt network has
no carrier.

Test-Plan:
  PASS: Run system host-swact on AIO-DX and verify ceph is running
        with status HEALTH_OK
  PASS: Force an uncontrolled swact on AIO-DX by killing a critical
        process and verify if ceph is running with status HEALTH_OK
  PASS: Disconnect OAM and MGMT networks for both controllers on
        AIO-DX and verify ceph mon and osd stop on both controllers.
        Reconnect OAM and MGMT networks and verify if ceph is running
        and status is HEALTH_OK
  PASS: Reboot or power off active controller and verify on the other
        controller if ceph is running with status HEALT_WARN because
        one host is down. Power on the controller, wait until it is
        online/available. Verify if ceph HEALTH_OK after data is
        all ODSs are up and data is recovered.

Closes-bug: 2020889

Signed-off-by: Felipe Sanches Zanoni <Felipe.SanchesZanoni@windriver.com>
Change-Id: I38470f43eba86f88fb9cfe47869d2393cacbd365
2023-05-31 13:38:02 -03:00
..
centos Merge "Enable generation of Ceph's Python 3 packages" 2022-01-21 00:26:05 +00:00
debian Update Ceph debian package versionsing 2023-03-27 10:36:47 -07:00
files Fix Ceph mon and osd processes start/stop conditions 2023-05-31 13:38:02 -03:00