Flaky OVS related e2e tests on the dedicated flexible IPAM testbed #6458

luolanzone · 2024-06-18T08:05:12Z

Following test cases on the dedicated flexible IPAM testbed failed for patch release 2.0.1 in two different builds:

    --- FAIL: TestConnectivity/testOVSFlowReplay (13.23s)
    --- FAIL: TestAntreaIPAM/testAntreaIPAMOVSRestartSameNode (57.44s)

Output are like below, need to check if there is a way to improve the e2e robustness.

=== RUN   TestConnectivity/testOVSFlowReplay
    connectivity_test.go:401: Creating 2 toolbox test Pods on 'antrea-ipam-ds-0-1'
    connectivity_test.go:76: Waiting for Pods to be ready and retrieving IPs
    connectivity_test.go:90: Retrieved all Pod IPs: map[test-pod-0-xtevxzsj:IPv4(192.168.249.214),IPstrings(192.168.249.214) test-pod-1-8hl08o1o:IPv4(192.168.249.215),IPstrings(192.168.249.215)]
    connectivity_test.go:101: Ping mesh test between all Pods
    connectivity_test.go:118: Ping 'testconnectivity-vxj2py28/test-pod-0-xtevxzsj' -> 'testconnectivity-vxj2py28/test-pod-1-8hl08o1o': OK
    connectivity_test.go:118: Ping 'testconnectivity-vxj2py28/test-pod-1-8hl08o1o' -> 'testconnectivity-vxj2py28/test-pod-0-xtevxzsj': OK
    connectivity_test.go:417: The Antrea Pod for Node 'antrea-ipam-ds-0-1' is 'antrea-agent-49ccx'
    connectivity_test.go:434: Counted 153 flow in OVS bridge 'br-int' for Node 'antrea-ipam-ds-0-1'
    connectivity_test.go:453: Counted 13 group in OVS bridge 'br-int' for Node 'antrea-ipam-ds-0-1'
    connectivity_test.go:462: Deleting flows / groups and restarting OVS daemons on Node 'antrea-ipam-ds-0-1'
    connectivity_test.go:484: Error when restarting OVS with ovs-ctl: command terminated with exit code 137 - stdout:  * Saving flows
         * Exiting ovsdb-server (46)
         - stderr: 
    fixtures.go:531: Deleting Pod 'test-pod-1-8hl08o1o'
    fixtures.go:531: Deleting Pod 'test-pod-0-xtevxzsj'

=== RUN   TestAntreaIPAM/testAntreaIPAMOVSRestartSameNode
    connectivity_test.go:337: Creating two toolbox test Pods on 'antrea-ipam-ds-0-1'
    fixtures.go:579: Creating a test Pod 'test-pod-cbshe5lp' and waiting for IP
    fixtures.go:579: Creating a test Pod 'test-pod-tx1u5na4' and waiting for IP
    connectivity_test.go:378: Restarting antrea-agent on Node 'antrea-ipam-ds-0-1'
    connectivity_test.go:359: Arping loss rate: 12.000000%
    connectivity_test.go:366: ARPING 192.168.240.101
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=0 time=435.896 usec
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=1 time=5.171 usec
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=2 time=5.139 usec
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=3 time=4.372 usec
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=4 time=11.172 usec
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=5 time=5.443 usec
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=6 time=5.019 usec
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=7 time=5.078 usec
        Timeout
        Timeout
        Timeout
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=8 time=246.665 usec
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=9 time=4.870 usec
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=10 time=5.219 usec
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=11 time=5.511 usec
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=12 time=4.585 usec
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=13 time=4.923 usec
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=14 time=4.654 usec
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=15 time=4.562 usec
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=16 time=4.628 usec
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=17 time=4.466 usec
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=18 time=5.012 usec
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=19 time=5.201 usec
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=20 time=5.826 usec
        42 bytes from fe:3d:79:b0:98:c6 (192.168.240.101): index=21 time=4.985 usec
        
        --- 192.168.240.101 statistics ---
        25 packets transmitted, 22 packets received,  12%!u(MISSING)nanswered (0 extra)
        rtt min/avg/max/std-dev = 0.004/0.036/0.436/0.101 ms
    connectivity_test.go:384: Arping test failed: arping loss rate is 12.000000%
    fixtures.go:531: Deleting Pod 'test-pod-tx1u5na4'
    fixtures.go:531: Deleting Pod 'test-pod-cbshe5lp'
    connectivity_test.go:337: Creating two toolbox test Pods on 'antrea-ipam-ds-0-1'
    fixtures.go:579: Creating a test Pod 'test-pod-pqk2v325' and waiting for IP
    fixtures.go:579: Creating a test Pod 'test-pod-rm7xh2oj' and waiting for IP
    connectivity_test.go:378: Restarting antrea-agent on Node 'antrea-ipam-ds-0-1'
    connectivity_test.go:359: Arping loss rate: 8.000000%
    fixtures.go:531: Deleting Pod 'test-pod-pqk2v325'
    fixtures.go:531: Deleting Pod 'test-pod-rm7xh2oj'
=== RUN   TestAntreaIPAM/testAntreaIPAMOVSFlowReplay

The text was updated successfully, but these errors were encountered:

luolanzone · 2024-06-18T08:12:40Z

@gran-vmv @antoninbas let me know if I missed any fix which I should back port to v2.0. I have a vague impression that there is a fix related with OVS restart, but I couldn't find any clue.

antoninbas · 2024-06-18T18:02:50Z

@luolanzone I think you can ignore these failures for the patch release. AFAIK, we didn't backport anything related to this to the release-2.0 branch.
We did have some similar failures for the main branch (not specific to FlexibleIPAM), with the following related changes:

Enable Pod network after realizing initial NetworkPolicies #5777 delayed realization of the Pod network on Agent start, without changing the logic for removing flow-restore-wait. At the time we didn't observe any failure because all e2e testing was using the coverage image, which was flawed in a way that was hiding the issue. This change is part of release-2.0.
Replace bincover with built-in Go coverage profiling tool #6090 improved code coverage collection and the flaw that existed in the coverage image was removed. After merging this change, we started observing failures for testOVSRestartSameNode.
Delay removal of flow-restore-wait #6342 resolved the issue by correctly delaying the removal of flow-restore-wait in the bridge.

While the issue caused by #5777 should affect testing for the release-2.0 branch, in practice we should not be observing test failures as long as we are using a coverage-enabled image using bincover. So maybe you could double check that the FlexibleIPAM e2e tests are using the correct image - or rather images since we have one for the Agent and one for the Controller?

We cannot backport #6090 and #6342 to release-2.0 as these are pretty significant changes, which should not go into a patch release.
If you do observe similar test failures for the main branch, then we would have to look into it. BTW, these failures should also exist for release-1.15 and the upcoming 1.15.2 patch release, because #5777 is also part of the release-1.15 branch.

luolanzone added the kind/bug Categorizes issue or PR as related to a bug. label Jun 18, 2024

luolanzone mentioned this issue Jun 18, 2024

Release 2.0.1 #6437

Merged

luolanzone closed this as completed Jul 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky OVS related e2e tests on the dedicated flexible IPAM testbed #6458

Flaky OVS related e2e tests on the dedicated flexible IPAM testbed #6458

luolanzone commented Jun 18, 2024 •

edited

Loading

luolanzone commented Jun 18, 2024

antoninbas commented Jun 18, 2024

Flaky OVS related e2e tests on the dedicated flexible IPAM testbed #6458

Flaky OVS related e2e tests on the dedicated flexible IPAM testbed #6458

Comments

luolanzone commented Jun 18, 2024 • edited Loading

luolanzone commented Jun 18, 2024

antoninbas commented Jun 18, 2024

luolanzone commented Jun 18, 2024 •

edited

Loading