-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Flaky e2e test] TestConnectivity/testOVSRestartSameNode #6338
Comments
@tnqn I wanted your opinion on this. I found this old comment of yours (#625 (comment)):
This doesn't exactly match what I have observed while troubleshooting this test. I have observed that when we remove
Notice that the second dump shows no datapath flows, and that in the dump immediately after that, counters have been reset. When I match the timestamp to the timestamps in the Agent logs, it matches Agent initialization and the I believe that this explains why the test is failing. I looked at the ovs-vswitch code, and the observation seems consistent with the code: This code is called by the What do you think? I wanted to make sure that I am not missing something. I am not sure what the best way to fix the test is based on this observation. If I change the test to tolerate one packet loss event, then I believe it will pass consistently. |
I think the observation doesn't conflict with the previous comment, which was about preventing datapath flows from being flushed when ovs-vswitchd is started at which time no userspace flows is installed. My previous assumption is, after installing userspace flows, it should be graceful to flush datapath flows as new packet should consult the userspace flows and be forwarded in desired way. To be more specific, there are several time points: stopping OVS, starting OVS, installing userspace flows, resetting flow-restore-wait (flushing datapath flows). We introduced "flow-restore-wait" to avoid disruption between starting OVS and installing userspace flows. However, I didn't expect the disruption after resetting flow-restore-wait. Maybe we could check why the packet didn't get forwarded based on userspace flows? |
@tnqn You were right. The datapath flows being flushed is not what is causing the traffic interruption. After more investigation, it happens because by the time we remove the
This matches the traffic interruption which I have observed, which is around 300ms on average:
Prior to #5777, the issue may not even have been observable / reproducible (except maybe with a large number of local Pods, if CNIServer reconciliation was taking a while?). I will see if we can add some synchronization, so that we do avoid removing the flag until CNIServer "reconciliation" is complete. |
Until a set of "essential" flows has been installed. At the moment, we include NetworkPolicy flows (using podNetworkWait as the signal), Pod forwarding flows (reconciled by the CNIServer), and Node routing flows (installed by the NodeRouteController). This set can be extended in the future if desired. We leverage the wrapper around sync.WaitGroup which was introduced previously in antrea-io#5777. It simplifies unit testing, and we can achieve some symmetry with podNetworkWait. We can also start leveraging this new wait group (flowRestoreCompleteWait) as the signal to delete flows from previous rounds. However, at the moment this is incomplete, as we don't wait for all controllers to signal that they have installed initial flows. Because the NodeRouteController does not have an initial "reconcile" operation (like the CNIServer) to install flows for the initial Node list, we instead rely on a different mechanims provided by upstream K8s for controllers. When registering event handlers, we can request for the ADD handler to include a boolean flag indicating whether the object is part of the initial list retrieved by the informer. Using this mechanism, we can reliably signal through flowRestoreCompleteWait when this initial list of Nodes has been synced at least once. Fixes antrea-io#6338 Signed-off-by: Antonin Bas <[email protected]>
Until a set of "essential" flows has been installed. At the moment, we include NetworkPolicy flows (using podNetworkWait as the signal), Pod forwarding flows (reconciled by the CNIServer), and Node routing flows (installed by the NodeRouteController). This set can be extended in the future if desired. We leverage the wrapper around sync.WaitGroup which was introduced previously in antrea-io#5777. It simplifies unit testing, and we can achieve some symmetry with podNetworkWait. We can also start leveraging this new wait group (flowRestoreCompleteWait) as the signal to delete flows from previous rounds. However, at the moment this is incomplete, as we don't wait for all controllers to signal that they have installed initial flows. Because the NodeRouteController does not have an initial "reconcile" operation (like the CNIServer) to install flows for the initial Node list, we instead rely on a different mechanims provided by upstream K8s for controllers. When registering event handlers, we can request for the ADD handler to include a boolean flag indicating whether the object is part of the initial list retrieved by the informer. Using this mechanism, we can reliably signal through flowRestoreCompleteWait when this initial list of Nodes has been synced at least once. Fixes antrea-io#6338 Signed-off-by: Antonin Bas <[email protected]>
Until a set of "essential" flows has been installed. At the moment, we include NetworkPolicy flows (using podNetworkWait as the signal), Pod forwarding flows (reconciled by the CNIServer), and Node routing flows (installed by the NodeRouteController). This set can be extended in the future if desired. We leverage the wrapper around sync.WaitGroup which was introduced previously in antrea-io#5777. It simplifies unit testing, and we can achieve some symmetry with podNetworkWait. We can also start leveraging this new wait group (flowRestoreCompleteWait) as the signal to delete flows from previous rounds. However, at the moment this is incomplete, as we don't wait for all controllers to signal that they have installed initial flows. Because the NodeRouteController does not have an initial "reconcile" operation (like the CNIServer) to install flows for the initial Node list, we instead rely on a different mechanims provided by upstream K8s for controllers. When registering event handlers, we can request for the ADD handler to include a boolean flag indicating whether the object is part of the initial list retrieved by the informer. Using this mechanism, we can reliably signal through flowRestoreCompleteWait when this initial list of Nodes has been synced at least once. Fixes antrea-io#6338 Signed-off-by: Antonin Bas <[email protected]>
Until a set of "essential" flows has been installed. At the moment, we include NetworkPolicy flows (using podNetworkWait as the signal), Pod forwarding flows (reconciled by the CNIServer), and Node routing flows (installed by the NodeRouteController). This set can be extended in the future if desired. We leverage the wrapper around sync.WaitGroup which was introduced previously in antrea-io#5777. It simplifies unit testing, and we can achieve some symmetry with podNetworkWait. We can also start leveraging this new wait group (flowRestoreCompleteWait) as the signal to delete flows from previous rounds. However, at the moment this is incomplete, as we don't wait for all controllers to signal that they have installed initial flows. Because the NodeRouteController does not have an initial "reconcile" operation (like the CNIServer) to install flows for the initial Node list, we instead rely on a different mechanims provided by upstream K8s for controllers. When registering event handlers, we can request for the ADD handler to include a boolean flag indicating whether the object is part of the initial list retrieved by the informer. Using this mechanism, we can reliably signal through flowRestoreCompleteWait when this initial list of Nodes has been synced at least once. This change is possible because of antrea-io#6361, which removed the dependency on the proxy (kube-proxy or AntreaProxy) to access the Antrea Controller. Prior to antrea-io#6361, there would have been a circular dependency in the case where kube-proxy was removed: flow-restore-wait will not be removed until the Pod network is "ready", which will not happen until the NetworkPolicy controller has started its watchers, and that depends on antrea Service reachability which depends on flow-restore-wait being removed. Fixes antrea-io#6338 Signed-off-by: Antonin Bas <[email protected]>
Until a set of "essential" flows has been installed. At the moment, we include NetworkPolicy flows (using podNetworkWait as the signal), Pod forwarding flows (reconciled by the CNIServer), and Node routing flows (installed by the NodeRouteController). This set can be extended in the future if desired. We leverage the wrapper around sync.WaitGroup which was introduced previously in antrea-io#5777. It simplifies unit testing, and we can achieve some symmetry with podNetworkWait. We can also start leveraging this new wait group (flowRestoreCompleteWait) as the signal to delete flows from previous rounds. However, at the moment this is incomplete, as we don't wait for all controllers to signal that they have installed initial flows. Because the NodeRouteController does not have an initial "reconcile" operation (like the CNIServer) to install flows for the initial Node list, we instead rely on a different mechanims provided by upstream K8s for controllers. When registering event handlers, we can request for the ADD handler to include a boolean flag indicating whether the object is part of the initial list retrieved by the informer. Using this mechanism, we can reliably signal through flowRestoreCompleteWait when this initial list of Nodes has been synced at least once. This change is possible because of antrea-io#6361, which removed the dependency on the proxy (kube-proxy or AntreaProxy) to access the Antrea Controller. Prior to antrea-io#6361, there would have been a circular dependency in the case where kube-proxy was removed: flow-restore-wait will not be removed until the Pod network is "ready", which will not happen until the NetworkPolicy controller has started its watchers, and that depends on antrea Service reachability which depends on flow-restore-wait being removed. Fixes antrea-io#6338 Signed-off-by: Antonin Bas <[email protected]>
Until a set of "essential" flows has been installed. At the moment, we include NetworkPolicy flows (using podNetworkWait as the signal), Pod forwarding flows (reconciled by the CNIServer), and Node routing flows (installed by the NodeRouteController). This set can be extended in the future if desired. We leverage the wrapper around sync.WaitGroup which was introduced previously in #5777. It simplifies unit testing, and we can achieve some symmetry with podNetworkWait. We can also start leveraging this new wait group (flowRestoreCompleteWait) as the signal to delete flows from previous rounds. However, at the moment this is incomplete, as we don't wait for all controllers to signal that they have installed initial flows. Because the NodeRouteController does not have an initial "reconcile" operation (like the CNIServer) to install flows for the initial Node list, we instead rely on a different mechanims provided by upstream K8s for controllers. When registering event handlers, we can request for the ADD handler to include a boolean flag indicating whether the object is part of the initial list retrieved by the informer. Using this mechanism, we can reliably signal through flowRestoreCompleteWait when this initial list of Nodes has been synced at least once. This change is possible because of #6361, which removed the dependency on the proxy (kube-proxy or AntreaProxy) to access the Antrea Controller. Prior to #6361, there would have been a circular dependency in the case where kube-proxy was removed: flow-restore-wait will not be removed until the Pod network is "ready", which will not happen until the NetworkPolicy controller has started its watchers, and that depends on antrea Service reachability which depends on flow-restore-wait being removed. Fixes #6338 Signed-off-by: Antonin Bas <[email protected]>
While working on #6090, we realized that
TestConnectivity/testOVSRestartSameNode
was failing frequently.Originally, we thought that the failure was caused by a change in the PR, but I was able to reproduce the failure on the main branch (at the time), using the normal Antrea images (i.e., not coverage-enabled).
The following can be observed in the test logs when the test fails:
As a result, and to avoid blocking the PR, the test case has been temporarily disabled:
antrea/test/e2e/connectivity_test.go
Lines 60 to 65 in 8d9c455
The failure now needs to be investigated so that the test case can be restored.
The failure above seems to indicate that the OVS restart indeed leads to a disruption of the datapath. I believe that the start / stop delay in the old coverage-enabled images was somehow hiding this issue (and of course ww only run e2e tests with coverage enabled...). See:
antrea/build/charts/antrea/templates/agent/daemonset.yaml
Line 134 in 33e6da2
To reproduce
kind create cluster --config ci/kind/config-2nodes.yml
kubectl apply -f build/yamls/antrea.yml
go test -timeout=10m -v -count=10 -run=TestConnectivity/testOVSRestartSameNode antrea.io/antrea/test/e2e -provider=kind -coverage=false -deploy-antrea=false
. This command will run the test case 10 times, and I have observed a failure rate of around 25%.The text was updated successfully, but these errors were encountered: