Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace bincover with built-in Go coverage profiling tool #6090

Merged
merged 3 commits into from
May 16, 2024

Conversation

shikharish
Copy link
Contributor

This PR makes the shift from the deprecated bincover package to Golang built-in coverage profiling tool.

@shikharish shikharish force-pushed the remove-bincover branch 2 times, most recently from 2bd4862 to 6de5db9 Compare March 11, 2024 13:21
@shikharish shikharish closed this Mar 11, 2024
@shikharish shikharish reopened this Mar 11, 2024
@shikharish shikharish force-pushed the remove-bincover branch 2 times, most recently from 30772ec to 4851993 Compare March 11, 2024 13:50
@shikharish shikharish changed the title wip: Remove bincover wip: Replace bincover with built-in Go coverage profiling tool Mar 11, 2024
@shikharish shikharish marked this pull request as ready for review March 11, 2024 14:48
.github/workflows/kind.yml Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
build/charts/antrea/templates/agent/daemonset.yaml Outdated Show resolved Hide resolved
build/charts/antrea/templates/agent/daemonset.yaml Outdated Show resolved Hide resolved
build/charts/antrea/templates/agent/daemonset.yaml Outdated Show resolved Hide resolved
@@ -131,7 +133,7 @@ spec:
imagePullPolicy: {{ include "antreaAgentImagePullPolicy" . }}
{{- if ((.Values.testing).coverage) }}
command: ["/bin/sh"]
args: ["-c", "sleep 2; antrea-agent-coverage -test.run=TestBincoverRunMain -test.coverprofile=antrea-agent.cov.out -args-file=/agent-arg-file; while true; do sleep 5 & wait $!; done"]
args: ["-c", "sleep 2; antrea-agent-coverage $(cat /agent-arg-file | tr '\n' ' '); while true; do sleep 5 & wait $!; done"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to find out why the sleep was necessary in the first place, and whether we still think it is necessary today. Maybe you could for the original PR that added it and check for any relevant discussion there.

Copy link
Contributor Author

@shikharish shikharish Mar 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why but we are explicity killing the processes in killProcessAndCollectCovFiles(). So it waits for SIGINT and then the process exits.

But why is killing the process required here? We are collecting the coverage files only after the tests have ran.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We won't get coverage data until the process stops. As a reminder, coverage is reported by the instrumented binary (antrea-agent-coverage), not by the tests themselves. The Pod also cannot stop until after we have successfully downloaded the coverage files, or the files will be lost. This is probably why we sleep after the process exits. I am not sure why we need to sleep before running the antrea-agent-coverage program though.

Copy link
Contributor Author

@shikharish shikharish Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the process shouldn't be running after the tests have ran successfully. I am able to collect coverage files locally without sleep.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I follow. Once the Pod is stopped / deleted, the coverage files are gone. When executing e2e tests, there are several instances where we update the Antrea configuration and need to restart the Antrea components (which means recreate the Pods). If we don't proactively kill the processes manually and retrieve the coverage files, we would miss some data.

@shikharish shikharish force-pushed the remove-bincover branch 4 times, most recently from a08e4a3 to a2d2db9 Compare March 13, 2024 07:37
build/images/Dockerfile.build.agent.coverage Outdated Show resolved Hide resolved
build/images/Dockerfile.build.controller.coverage Outdated Show resolved Hide resolved
ci/kind/test-e2e-kind.sh Show resolved Hide resolved
build/charts/antrea/templates/agent/daemonset.yaml Outdated Show resolved Hide resolved
@shikharish
Copy link
Contributor Author

@antoninbas Is it fine if I add more commits to a PR or should one PR be only one commit?

@antoninbas
Copy link
Contributor

@antoninbas Is it fine if I add more commits to a PR or should one PR be only one commit?

it's better if you add a new commit for each review cycle, then squash all commits at the very end

@shikharish
Copy link
Contributor Author

@antoninbas flow-aggregator images build are failing: https://github.com/antrea-io/antrea/actions/runs/8301256864/job/22720993922?pr=6090#step:3:30

I don't understand why pull access would be denied.

@antoninbas
Copy link
Contributor

@antoninbas flow-aggregator images build are failing: https://github.com/antrea-io/antrea/actions/runs/8301256864/job/22720993922?pr=6090#step:3:30

I don't understand why pull access would be denied.

antrea-build is an intermediate build image that doesn't exist outside of the build process. I think you meant to use flow-aggregator-build there.

.gitignore Outdated Show resolved Hide resolved
build/images/Dockerfile.build.agent.coverage Outdated Show resolved Hide resolved
build/images/Dockerfile.build.coverage Outdated Show resolved Hide resolved
build/images/Dockerfile.build.coverage Outdated Show resolved Hide resolved
hack/git_client_side_hooks/pre-push Outdated Show resolved Hide resolved
test/e2e/utils/run_cov_binary.sh Show resolved Hide resolved
test/e2e/framework.go Outdated Show resolved Hide resolved
@shikharish
Copy link
Contributor Author

shikharish commented Mar 18, 2024

What is the use of this function? I don't think any coverage files are being written in the control plane Node.

@shikharish shikharish requested a review from luolanzone March 18, 2024 21:46
@shikharish shikharish changed the title wip: Replace bincover with built-in Go coverage profiling tool Replace bincover with built-in Go coverage profiling tool Mar 22, 2024
@shikharish shikharish force-pushed the remove-bincover branch 2 times, most recently from 6203bbe to ab65fde Compare March 22, 2024 09:11
build/images/flow-aggregator/Dockerfile.coverage Outdated Show resolved Hide resolved
@@ -119,7 +119,7 @@ const (
antreaControllerCovFile = "antrea-controller.cov.out"
antreaAgentCovFile = "antrea-agent.cov.out"
flowAggregatorCovFile = "flow-aggregator.cov.out"
cpNodeCoverageDir = "/tmp/antrea-e2e-coverage"
cpNodeCoverageDir = "/tmp/coverage"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why do we need to change this directory name?
I feel the old name is more clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to use the same directory name throughout(pods, nodes, etc.). Will revert if it creates ambiguity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it may create some ambiguity as the name is a bit too "generic", and this is the Node's filesystem

@@ -82,7 +82,7 @@ bin: fmt vet ## Build manager binary.

.PHONY: antrea-mc-instr-binary
antrea-mc-instr-binary:
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go test -tags testbincover -covermode count -coverpkg=antrea.io/antrea/multicluster/... -c -o bin/antrea-mc-controller-coverage antrea.io/antrea/multicluster/cmd/...
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -cover -coverpkg=antrea.io/antrea/multicluster/...,antrea.io/antrea/multicluster/cmd/... -o bin/antrea-mc-controller-coverage1 antrea.io/antrea/multicluster/cmd/...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it antrea-mc-controller-coverage1, typo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes🙃

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shikharish I don't think you resolved this. It's possible that you forgot to commit / push.
Be careful when resolving conversations on Github, as it can make it harder when doing follow-up reviews if a conversation is resolved by mistake.

args: ["-c", "/antrea-mc-controller-coverage -test.run=TestBincoverRunMain -test.coverprofile=antrea-mc-controller.cov.out leader --config=/controller_manager_config.yaml; while true; do sleep 5 & wait $!; done"]
name: antrea-mc-controller
- args:
- antrea-mc-controller-coverage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the value of -test.coverprofile after this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use that flag anymore. The instrumented binary will generate covmeta and covcounter files. They will be moved to our local coverage directory and then converted to cov.out files using go tool covdata.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will impact the way to upload the coverage to codecov, please check here

kubectl cp ${namespace}/$mc_controller_pod_name:antrea-mc-controller.cov.out ${COVERAGE_DIR}/$mc_controller_pod_name-$timestamp ${kubeconfig}
and correct it correspondingly.
Please double check and verify that all new generated coverage files can be uploaded to codecov site correctly.

@shikharish shikharish force-pushed the remove-bincover branch 2 times, most recently from 2a4870a to 6907a19 Compare April 1, 2024 18:36
Comment on lines 26 to 29
while true; do
# Sleep in the background, to allow exiting as soon as a signal is received
sleep 5 & wait $!
done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I figured out the root cause for #6318

This code is not handling the SIGTERM signal correctly, which means that the antrea-agent Pod takes a full 30s to terminate when it is deleted. This means that when we delete the antrea-agent Pod in a test case, it has the potential of impacting subsequent test cases, depending on how these test cases are written.

This is my bad since I came up with the code for the script. Can you replace it with the following code? It should solve the issue.

set -euo pipefail

TEST_PID=

function quit {
    echo "Received signal, exiting"
    # While not strictly required, it is better to try to terminate the test process gracefully if it is still running.
    if [ "$TEST_PID" != "" ]; then
        echo "Sending signal to test process"
        kill -SIGTERM $TEST_PID > /dev/null 2>&1 || true
        wait $TEST_PID
        echo "Test process exited gracefully"
    fi
    exit 0
}

# This is necessary: without an explicit signal handler, the SIGTERM signal will be discarded when running
# as a process with PID 1 in a container.
trap 'quit' SIGTERM

# Run the test process in the background, to allow exiting when a signal is received.
"$@" &
TEST_PID=$!
wait $TEST_PID
TEST_PID=

# Sleep in the background, to allow exiting as soon as a signal is received.
sleep infinity & wait

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this way will impact the coverage collection?

 2024/05/11 15:31:41 Error when gracefully exit antrea agent: error when gracefully exiting Antrea Agent: error 
when copying antctl coverage files: error when running this find command '[bash -c find /tmp/coverage  -mindepth 
1]' on Pod 'antrea-agent-cp7c4', stderr: <>, err: <Internal error occurred: error executing command in container: 
failed to exec in container: failed to start exec 
"1d044df8621e3a24943111287da3529bb53e25c04a2e8c8da3eecb1625916618": OCI runtime exec failed: exec 
failed: cannot exec in a stopped container: unknown>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luolanzone Where are these logs from?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have also seen these errors in the logs for one of the Github CI jobs:

2024-05-11T15:06:39.0693719Z I0511 15:06:39.068915   23828 framework.go:2651] Sending SIGINT to 'antrea-agent-coverage'
2024-05-11T15:06:39.1196604Z I0511 15:06:39.119191   23828 framework.go:2657] Copying coverage files from Pod 'antrea-agent-j92js'
2024-05-11T15:06:42.3927758Z I0511 15:06:42.391304   23828 framework.go:2797] Copying file: covmeta.ede75673d2ae4194072b9dd9af8339de
2024-05-11T15:06:42.4480081Z I0511 15:06:42.447436   23828 framework.go:2737] Coverage file not available yet for copy: cannot retrieve content of file '/tmp/coverage/covmeta.ede75673d2ae4194072b9dd9af8339de' from Pod 'antrea-agent-j92js', stderr: <>, err: <Internal error occurred: error executing command in container: failed to exec in container: failed to create exec "55149056a7d640b206c4e2daf48e4c25985b2df698c89c225e0731372cd38449": task a4707bcd87f8f641de784cabd92eeec4adf3932f00874264ae59fd138ab4dc67 not found: not found>
2024-05-11T15:06:43.3923117Z I0511 15:06:43.391975   23828 framework.go:2797] Copying file: covmeta.ede75673d2ae4194072b9dd9af8339de
2024-05-11T15:31:41.3860016Z I0511 15:31:41.385498   23828 framework.go:2657] Copying coverage files from Pod 'antrea-agent-cp7c4'
2024-05-11T15:31:41.4442434Z 2024/05/11 15:31:41 Error when gracefully exit antrea agent: error when gracefully exiting Antrea Agent: error when copying antctl coverage files: error when running this find command '[bash -c find /tmp/coverage  -mindepth 1]' on Pod 'antrea-agent-cp7c4', stderr: <>, err: <Internal error occurred: error executing command in container: failed to exec in container: failed to start exec "1d044df8621e3a24943111287da3529bb53e25c04a2e8c8da3eecb1625916618": OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown>

This is the job: https://github.com/antrea-io/antrea/actions/runs/9044365104/job/24853065716?pr=6090

These errors are confusing as they seem to indicate that the container doesn't exist anymore. Which should not be the case since at that point we have killed the process, but not the Pod / container. The first one in particular is confusing for me, because it seems that we succeed in copying the file on the second attempt. Any idea @shikharish ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can run the test again after @shikharish disables the TestConnectivity/testOVSRestartSameNode test case, but it's making me a bit uneasy that we are even seeing these errors now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that the failure is in TestWireGuard and only due to the flag --feature-gates AllAlpha=true.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed the only case in which we see it happen in CI, but I don't think it has anything to do with TestWireGuard. TestWireGuard is actually skipped when all features are enabled, as it is not compatible with Multicast. The errors happen when we collect coverage data at the end of the test suite:

defer gracefulExitAntrea(testData)

We need to figure out what's going on, and it seems to happen consistently in CI. I wasn't able to reproduce locally by running the following command: go test -timeout=10m -v -count=1 -run=TestWireGuard antrea.io/antrea/test/e2e -provider=kind -coverage=true -deploy-antrea=false.

This means that we will need to find a way to extract more information from the CI job, in order to troubleshoot.
It looks to me that we are always able to 1) retrieve the PID for the Agent process, and 2) kill the process. However, something is not working when we retrieve coverage data: one of the container seems to have been stopped. If you can find a way to do a kubectl describe on that specific Pod (refer to waitForAntreaDaemonSetPods), this may provide us with enough information. I think the right place to do it is in killProcessAndCollectCovFiles. If collectCovFiles returns an error, run kubectl describe for that Pod, and print the contents of stdout.
Please make these changes in a separate commit; these changes may not need to be temporary. If done well we could keep them for future troubleshooting.

Copy link
Contributor Author

@shikharish shikharish May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antoninbas I reproduced the error locally, here is the kubectl describe output for that pod:

I0515 03:27:30.333084 1309413 framework.go:2660] Name:                 antrea-agent-5dxnn
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      antrea-agent
Node:                 kind-worker2/172.18.0.2
Start Time:           Tue, 14 May 2024 21:57:06 +0000
Labels:               app=antrea
                      component=antrea-agent
                      controller-revision-hash=75f74c6b6d
                      pod-template-generation=1
Annotations:          checksum/config: d0b905f11f84008ec883b25acf2106020200fc226e51ca647d0083956e532ae0
                      kubectl.kubernetes.io/default-container: antrea-agent
Status:               Running
IP:                   172.18.0.2
IPs:
  IP:           172.18.0.2
Controlled By:  DaemonSet/antrea-agent
Init Containers:
  install-cni:
    Container ID:  containerd://654efbe0b9208d7127c4b2f1d2fdfdb2fc6b12f78515d84915cf3b133c295cfd
    Image:         antrea/antrea-agent-ubuntu-coverage:latest
    Image ID:      docker.io/library/import-2024-05-14@sha256:ab5b4129a4ee158f3ddeae628c1bf6c161a7b51d429cdcebc3a5b408ff05e2e2
    Port:          <none>
    Host Port:     <none>
    Command:
      install_cni
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 14 May 2024 21:57:06 +0000
      Finished:     Tue, 14 May 2024 21:57:06 +0000
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:  100m
    Environment:
      SKIP_CNI_BINARIES:  
    Mounts:
      /etc/antrea/antrea-cni.conflist from antrea-config (ro,path="antrea-cni.conflist")
      /host/etc/cni/net.d from host-cni-conf (rw)
      /host/opt/cni/bin from host-cni-bin (rw)
      /lib/modules from host-lib-modules (ro)
      /var/run/antrea from host-var-run-antrea (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qgqfg (ro)
Containers:
  antrea-agent:
    Container ID:  containerd://e514acc78e315282fde3253a7f24d061006af88c1ac3ff4b60cefcdd46c1c0b6
    Image:         antrea/antrea-agent-ubuntu-coverage:latest
    Image ID:      docker.io/library/import-2024-05-14@sha256:ab5b4129a4ee158f3ddeae628c1bf6c161a7b51d429cdcebc3a5b408ff05e2e2
    Port:          10350/TCP
    Host Port:     10350/TCP
    Args:
      antrea-agent-coverage
      --config=/etc/antrea/antrea-agent.conf
      --logtostderr=false
      --log_dir=/var/log/antrea
      --alsologtostderr
      --log_file_max_size=100
      --log_file_max_num=4
      --v=4
    State:          Running
      Started:      Tue, 14 May 2024 21:57:07 +0000
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:      200m
    Liveness:   http-get https://localhost:api/livez delay=10s timeout=5s period=10s #success=1 #failure=5
    Readiness:  http-get https://localhost:api/readyz delay=5s timeout=5s period=10s #success=1 #failure=8
    Environment:
      POD_NAME:       antrea-agent-5dxnn (v1:metadata.name)
      POD_NAMESPACE:  kube-system (v1:metadata.namespace)
      NODE_NAME:       (v1:spec.nodeName)
    Mounts:
      /etc/antrea/antrea-agent.conf from antrea-config (ro,path="antrea-agent.conf")
      /host/proc from host-proc (ro)
      /host/var/run/netns from host-var-run-netns (ro)
      /run/xtables.lock from xtables-lock (rw)
      /var/lib/cni from host-var-run-antrea (rw,path="cni")
      /var/log/antrea from host-var-log-antrea (rw)
      /var/run/antrea from host-var-run-antrea (rw)
      /var/run/openvswitch from host-var-run-antrea (rw,path="openvswitch")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qgqfg (ro)
  antrea-ovs:
    Container ID:  containerd://999590d3b1960cc24c9412bbe7455c19ba66d6eeeca9972e9e719c3d162954a3
    Image:         antrea/antrea-agent-ubuntu-coverage:latest
    Image ID:      docker.io/library/import-2024-05-14@sha256:ab5b4129a4ee158f3ddeae628c1bf6c161a7b51d429cdcebc3a5b408ff05e2e2
    Port:          <none>
    Host Port:     <none>
    Command:
      start_ovs
    Args:
      --log_file_max_size=100
      --log_file_max_num=4
    State:          Running
      Started:      Tue, 14 May 2024 21:57:07 +0000
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        200m
    Liveness:     exec [/bin/sh -c timeout 10 container_liveness_probe ovs] delay=5s timeout=10s period=10s #success=1 #failure=5
    Environment:  <none>
    Mounts:
      /var/log/openvswitch from host-var-log-antrea (rw,path="openvswitch")
      /var/run/openvswitch from host-var-run-antrea (rw,path="openvswitch")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qgqfg (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       True 
  ContainersReady             True 
  PodScheduled                True 
Volumes:
  antrea-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      antrea-config
    Optional:  false
  host-cni-conf:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/cni/net.d
    HostPathType:  
  host-cni-bin:
    Type:          HostPath (bare host directory volume)
    Path:          /opt/cni/bin
    HostPathType:  
  host-proc:
    Type:          HostPath (bare host directory volume)
    Path:          /proc
    HostPathType:  
  host-var-run-netns:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/netns
    HostPathType:  
  host-var-run-antrea:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/antrea
    HostPathType:  DirectoryOrCreate
  host-var-log-antrea:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log/antrea
    HostPathType:  DirectoryOrCreate
  host-lib-modules:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:  
  xtables-lock:
    Type:          HostPath (bare host directory volume)
    Path:          /run/xtables.lock
    HostPathType:  FileOrCreate
  kube-api-access-qgqfg:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 :NoSchedule op=Exists
                             :NoExecute op=Exists
                             CriticalAddonsOnly op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/network-unavailable:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason     Age   From               Message
  ----     ------     ----  ----               -------
  Normal   Scheduled  24s   default-scheduler  Successfully assigned kube-system/antrea-agent-5dxnn to kind-worker2
  Normal   Pulled     24s   kubelet            Container image "antrea/antrea-agent-ubuntu-coverage:latest" already present on machine
  Normal   Created    24s   kubelet            Created container install-cni
  Normal   Started    24s   kubelet            Started container install-cni
  Normal   Pulled     23s   kubelet            Container image "antrea/antrea-agent-ubuntu-coverage:latest" already present on machine
  Normal   Created    23s   kubelet            Created container antrea-agent
  Normal   Started    23s   kubelet            Started container antrea-agent
  Normal   Pulled     23s   kubelet            Container image "antrea/antrea-agent-ubuntu-coverage:latest" already present on machine
  Normal   Created    23s   kubelet            Created container antrea-ovs
  Normal   Started    23s   kubelet            Started container antrea-ovs
  Warning  Unhealthy  14s   kubelet            Readiness probe failed: Get "https://localhost:10350/readyz": dial tcp [::1]:10350: connect: connection refused

@shikharish
Copy link
Contributor Author

@luolanzone The coverage on my branch is around 65%, significantly less that that of the main branch. https://app.codecov.io/gh/antrea-io/antrea/tree/remove-bincover

@shikharish
Copy link
Contributor Author

Oh it increased to 72% with the latest commit.

@shikharish
Copy link
Contributor Author

@antoninbas The tests still failed but this time due to TestConnectivity/testOVSRestartSameNode.

@antoninbas
Copy link
Contributor

@antoninbas The tests still failed but this time due to TestConnectivity/testOVSRestartSameNode.

So I would not worry about this failure. I don't think it was introduced by this PR. As a matter of fact I was able to reproduce with the latest Antrea release (v2.0). I tried several combinations and I was able to reproduce in each case:

  • using your PR, with -coverage=true
  • using your PR, with -coverage=false
  • using Antrea v2.0, with -coverage=true

I think that the previous version of the coverage image had some issue, and the Pod was probably taking a few seconds to terminate. It seems that it may have had an impact on the outcome of the test, but I have to dig deeper. Given that in CI we only run the test with coverage enabled, we could have missed the issue. My suggestion would be to skip the test for now, so that it doesn't block this PR. You can add the following at the beginning of the sub-test: t.Skip("Skipping test for now as it fails consistently"), and I will investigate more later.

test/e2e/framework.go Outdated Show resolved Hide resolved
@shikharish shikharish requested a review from antoninbas May 14, 2024 20:04
@antoninbas
Copy link
Contributor

/test-all
/test-multicluster-e2e

@antoninbas
Copy link
Contributor

/test-windows-containerd-e2e

@antoninbas antoninbas merged commit 8d9c455 into antrea-io:main May 16, 2024
56 of 59 checks passed
@antoninbas
Copy link
Contributor

Thanks for your hard work @shikharish and congrats on getting this PR merged!

@antoninbas antoninbas added the action/release-note Indicates a PR that should be included in release notes. label May 16, 2024
@shikharish shikharish deleted the remove-bincover branch May 17, 2024 20:34
antoninbas pushed a commit that referenced this pull request May 29, 2024
The Dockerfile was reintroduced by mistake in #6090

Signed-off-by: Shikhar Soni <[email protected]>
antoninbas added a commit to antoninbas/antrea that referenced this pull request May 31, 2024
There should no longer be a need for a custom timeout value in
testAntreaGracefulExit when coverage is enabled. The increased timeout
was not actually required for code coverage collection (which is qite
fast), but because of issues with the coverage-enabled image, which was
not handling signals correctly. After antrea-io#6090, this should have been
fixed.

Signed-off-by: Antonin Bas <[email protected]>
antoninbas added a commit that referenced this pull request Jun 3, 2024
There should no longer be a need for a custom timeout value in
testAntreaGracefulExit when coverage is enabled. The increased timeout
was not actually required for code coverage collection (which is quite
fast), but because of issues with the coverage-enabled image, which was
not handling signals correctly. After #6090, this should have been
fixed.

Signed-off-by: Antonin Bas <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/release-note Indicates a PR that should be included in release notes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants