From c74f21cef679e2a2e9efa9eb20e99b077c124db1 Mon Sep 17 00:00:00 2001 From: Bart Wensley Date: Wed, 10 Apr 2019 07:47:49 -0500 Subject: [PATCH] Further increase tolerance for declaring neutron agents down The neutron server listens for heartbeats from the various neutron agents running on worker nodes. The agents send this heartbeat every 30s, but use a synchronous RPC, which can take up to 60s to time out if the rabbitmq server disappears (e.g. when a controller host is powered down unexpectedly). The default timeout is 75s, so if two of these async RPC messages time out in a row (due to rabbitmq server issues related to a controller power down or swact), the neutron agent will be declared down incorrectly. This causes the VIM to migrate instances away from the worker node, which we want to avoid. Commit 2fcb4f15 increased the timeout (agent_down_time) to 150s. However, after further testing it has been found that 150s is not enough in some rare cases (e.g. when rebooting the active controller host). I am increasing the timeout (agent_down_time) to 180s. Change-Id: Ic0cedf8f20eaf1c1a33defbabcae13fbfb727ec9 Closes-Bug: 1817935 Signed-off-by: Bart Wensley --- .../stx-openstack-helm/manifests/manifest.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kubernetes/applications/stx-openstack/stx-openstack-helm/stx-openstack-helm/manifests/manifest.yaml b/kubernetes/applications/stx-openstack/stx-openstack-helm/stx-openstack-helm/manifests/manifest.yaml index 17a3046e22..86d5fd7d9a 100644 --- a/kubernetes/applications/stx-openstack/stx-openstack-helm/stx-openstack-helm/manifests/manifest.yaml +++ b/kubernetes/applications/stx-openstack/stx-openstack-helm/stx-openstack-helm/manifests/manifest.yaml @@ -1101,7 +1101,7 @@ data: allow_automatic_l3agent_failover: true # Increase from default of 75 seconds to avoid agents being declared # down during controller swacts, reboots, etc... - agent_down_time: 150 + agent_down_time: 180 # Set to false so as to remove conflict with newly introduced # network rebalancing/rescheduling that will move routers off # down l3 agents, rebalance routers to newly up l3 agents.