12 KiB

Raw Blame History

High availability

Titanium Clouds Service Management (SM) and Maintenance (Mtce) Systems handle transient and persistent networking failures between controllers and service hosts (storage/compute) For instance, a transient loss of Management Network Carrier on the active controller currently triggers an immediate fail-over to the standby controller even though the very same failure may exist for that controller as well ; i.e. it may be no more healthy that the current active controller. A persistent loss of heartbeat messaging to several or all nodes in the system results in the forced failure and reboot of all affected nodes once connectivity has been re-established.In most of these cases the network event that triggered fault handling is external to the system ; i.e. the reboot of a common messaging switch for instance, and truly beyond the control of Titanium Cloud HA (High Availability) Services. In such cases it's best to be more fault tolerant/forgiving than over active.

Overall Requirements

This test will require access to the following configurations: - Regular system - Storage system - AIO-DX systems

Test Cases

HA_Cloud_Recovery_improvements_01

Test ID: HA_Cloud_Recovery_improvements_01
Test Title: test_split_brain_avd_activer_or_standby_based_on_only_storage_and standby_controller_blocked_on_active_controller
Tags: P2,HA,Recovery improvement,regression

Testcase Objective

Purpose of this test is to verify split brain scenario swact on active controller by blocking standby controller and storage on active controller

Test Pre-Conditions

System should be a storage system

Test Steps

Using below cli disconnects management storage-0 and controller-1 active controller-0. Execute below command to block both storage and controller first storage should be blocked and immediately controller should be blocked. code:: Execute sudo iptables -I INPUT 1 -s 192.168.222.204 -j DROP ...
Verify connection failure alarm
Verify controller-1 becomes active. Verify system host-list from controller-1
Reboot new standby controller-0. Once reboot complete verify system host-list from active controller.

Expected Behavior

controller-1 becomes active System host-list shows right states

HA_Cloud_Recovery_improvements_02

Test ID: HA_Cloud_Recovery_improvements_02
Test Title: test_aio_dx_direct_active_controller_lost_connection_to_standby_ip
Tags: P2,HA,Recovery improvement,regression

Testcase Objective

Purpose of this test is to verify standby controller reboot when it is lost connectivity to active controller.

Test Pre-Conditions

System should be AIO-DX direct connection. System should be connected to BMC module and provisioned. If the BMC not provisioned expected behavior there wont be reboot on standby controller

Test Steps

Block the standby ip(Management ip) from active controller. code:: iptables -I INPUT 1 -s 192.168.222.204 ...

Expected Behavior

The stadby controller(controller-1) becomes active System host-list shows right states controller-0 reboots if the BMC provisioned

HA_Cloud_Recovery_improvements_03

Test ID: HA_Cloud_Recovery_improvements_03
Test Title: test_split_brain-avd_aio_dx_direct_active_controller_lost connection_to_standby_ip_table_drop_on_mgt_infra_and_oam
Tags: P2,HA,Recovery improvement,regression

Testcase Objective

To verify split-brain scenario by triggering connection failure on MGT infra and OAM on AIO-DX-Direct standby controller

Test Pre-Conditions

System should be a AIO-DX-Direct connected system

Test Steps

Provision BMC verify BMC provisioned.(if BMC not available there won't be a reboot for loss of connection expected behavior at the time of connection loss is different)
From standby controller to active controller to drop MGT infra and OAM. Example as below. code:: sudo iptables -I INPUT 1 -s 192.168.204.4 -j DROP && sudo iptables -I INPUT 1 -s 192.168.205.3 -j DROP && sudo iptables -I INPUT 1 -s 128.150.150.96 -j DROP ...
Verify loss of connectivity and alarm on active controller

Expected Behavior

Verify loss of connectivity and alarm on active controller System host-list shows right states

HA_Cloud_Recovery_improvements_04

Test ID: HA_Cloud_Recovery_improvements_04
Test Title: test_split-brain-avd_active/standby_number_of_the_nodes_reachable _changes_couple_of_times
Tags: P2,HA,Recovery improvement,regression

Testcase Objective:

Purpose of this test is to verify Active standby controller selection criteria on split brain scenario is based on healthier controller.This scenario will be repeated after active standby selected and again connection failure on compute.

Test Pre-Conditions:

The system should have at least 3 or more computes with 2 controller.

Test Steps

From Active controller controller-0 block control and compute-0 communication (if management and infra provisioned both need to be blocked) code:: sudo iptables -I INPUT 1 -s 192.168.223.57 -j DROP && sudo iptables-I INPUT 1 -s 192.168.222.156 -j DROP && sudo iptables -I INPUT 1 -s 192.168.222.4 -j DROP && sudo iptables -I INPUT 1 -s 128.224.150.57 -j DROP ...
Verify connection failure alarm.
Verify swact
unblock compute-0 to controller-0 from controller-0 suing iptables command. code:: sudo iptables -D INPUT -s 192.168.223.57 -j DROP && sudo iptables -D INPUT -s 192.168.222.156 -j DROP && sudo iptables -D INPUT -s 192.168.222.4 -j DROP && sudo iptables -D INPUT -s 192.168.223.4 -j DROP ...
Repeat the above step current active controller block traffic on controller-1 to compute-0

Expected Behavior

controller-1 becomes active System host-list shows right states

HA_Cloud_Recovery_improvements_05

Test ID: HA_Cloud_Recovery_improvements_05
Test Title: test_MNFA_timeouts_2mins_1_hour
Tags: P2,HA,Recovery improvement,regression

Testcase Objective

Purpose of this test is to validate the trigger of MNFA(Multi Node Failure Avoidance) mode trigger on alarm based on different timeouts 2mins or 1 hour

Test Pre-Conditions

The system should have at least 3 or more computes with 2 controller.

Test Steps

From Active controller set mnfa_timeout (2mins or 1 hour ) on MNFA can stay active before graceful recovery of affected hosts. Use below commands. Eg: code:: system service-parameter-list system service-parameter-modify service=platform section=maintenance mnfa_timeout = 2 service system service-parameter-apply platform ...
Apply the change and alarm 250.001 controller-0 Configuration is out-of-date cleared using command system service-parameter-apply platform
Trigger heart beat failure by powering off any nodes other than active controller
Verify event-list --log to see below MNFA enter and exit. If the mnfa_timeout is set to 120 seconds mnfa enter exit log time difference will be 120 seconds. If is it set to 1 hour it will be 1hour. Below stings will be seen on alarm.

host=controller-1.event=mnfa_enter host=controller-1.event=mnfa_exit

Expected Behavior

In the above test MNFA enter and exit would be triggered in event-list log

HA_Cloud_Recovery_improvements_06

Test ID: HA_Cloud_Recovery_improvements_06
Test Title: test_MNFA_timeouts_default
Tags: P2,HA,Recovery improvement,regression

Testcase Objective

Purpose of this test is to validate the trigger of MNFA mode with the default values.

Test Pre-Conditions

The system should have at least 3 or more computes with 2 controller.

Test Steps

From Active controller Set mnfa_timeout (2mins or 1 hour ) on MNFA can stay active before graceful recovery of affected hosts. Eg: To check current values for mnfa_timeout use system service-parameter-list code:: system service-parameter-modify service=platform section=maintenance mnfa_timeout=<value> system service-parameter-apply platform
Apply the change and alarm 250.001 controller-0 Configuration is out-of-date cleared using command system service-parameter-apply platform
Trigger heart beat failure by powering off any nodes other than active controller.
Verify system event-list --log to see below MNFA enter and exit.
Verify system hosts-list. It will show hosts as degraded when host is in off-line during the MNFA enter and exit. host=controller-1.event=mnfa_enter host=controller-1.event=mnfa_exit

Expected Behavior

In the above test MNFA enter and exit would be triggered in event-list log

HA_Cloud_Recovery_improvements_07

Test ID: HA_Cloud_Recovery_improvements_07
Test Title: test_pull_management_and_OAM_cable_on_active_controller
Tags: P2,HA,Recovery improvement,regression

Testcase Objective:

This test is to verify OAM & MGT cable pull alarm and swact

Test Pre-Conditions:

Any 2+2 system installed latest load.

Test Steps

Verify no alarms for fm alarm-list
Physically remove OAM and MGT cable on active controller(controller-0) cable
Verify alarm ID (400.005,200.005)
Verify standby controller(controller-0) was swacted sudo sm-dump
Verify system host-list on new active controller all the hosts are available and standby controller off-line.

Expected Behavior

system swact with alarms for cable pull on OAM and MGT

Test ID: HA_Cloud_Recovery_improvements_08
Test Title: test_pull_management_cable_on_standby_controller
Tags: P2,HA,Recovery improvement,regression

HA_Cloud_Recovery_improvements_08

Testcase Objective:

Pull management cable on standby and verify alarm.

Test Pre-Conditions:

Any 2+2 system installed latest load.

Test Steps:

Verify no alarms for fm alarm-list
Physically remove MGT cable on standby controller(controller-0) cable
Verify current alarm list fm alarm-list alarm id(400.005,200.005)
Verify no change in active controller and other hosts states standby host will be off-line. code :: system host-list ...

Expected Behavior

Verify management failed alarm ID (400.005,200.005) Verify hosts state system host-list

References:

https://wiki.openstack.org/wiki/StarlingX/Containers/Installationem

12 KiB Raw Blame History

High availability

Overall Requirements

Test Cases

HA_Cloud_Recovery_improvements_01

Testcase Objective

Test Pre-Conditions

Test Steps

Expected Behavior

HA_Cloud_Recovery_improvements_02

Testcase Objective

Test Pre-Conditions

Test Steps

Expected Behavior

HA_Cloud_Recovery_improvements_03

Testcase Objective

Test Pre-Conditions

Test Steps

Expected Behavior

HA_Cloud_Recovery_improvements_04

Testcase Objective:

Test Pre-Conditions:

Test Steps

Expected Behavior

HA_Cloud_Recovery_improvements_05

Testcase Objective

Test Pre-Conditions

Test Steps

Expected Behavior

HA_Cloud_Recovery_improvements_06

Testcase Objective

Test Pre-Conditions

Test Steps

Expected Behavior

HA_Cloud_Recovery_improvements_07

Testcase Objective:

Test Pre-Conditions:

Test Steps

Expected Behavior

HA_Cloud_Recovery_improvements_08

Testcase Objective:

Test Pre-Conditions:

Test Steps:

Expected Behavior

References:

12 KiB

Raw Blame History