-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace bincover with built-in Go coverage profiling tool #6090
Conversation
2bd4862
to
6de5db9
Compare
6de5db9
to
daed98b
Compare
30772ec
to
4851993
Compare
@@ -131,7 +133,7 @@ spec: | |||
imagePullPolicy: {{ include "antreaAgentImagePullPolicy" . }} | |||
{{- if ((.Values.testing).coverage) }} | |||
command: ["/bin/sh"] | |||
args: ["-c", "sleep 2; antrea-agent-coverage -test.run=TestBincoverRunMain -test.coverprofile=antrea-agent.cov.out -args-file=/agent-arg-file; while true; do sleep 5 & wait $!; done"] | |||
args: ["-c", "sleep 2; antrea-agent-coverage $(cat /agent-arg-file | tr '\n' ' '); while true; do sleep 5 & wait $!; done"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to find out why the sleep was necessary in the first place, and whether we still think it is necessary today. Maybe you could for the original PR that added it and check for any relevant discussion there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure why but we are explicity killing the processes in killProcessAndCollectCovFiles(). So it waits for SIGINT and then the process exits.
But why is killing the process required here? We are collecting the coverage files only after the tests have ran.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We won't get coverage data until the process stops. As a reminder, coverage is reported by the instrumented binary (antrea-agent-coverage
), not by the tests themselves. The Pod also cannot stop until after we have successfully downloaded the coverage files, or the files will be lost. This is probably why we sleep after the process exits. I am not sure why we need to sleep before running the antrea-agent-coverage
program though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the process shouldn't be running after the tests have ran successfully. I am able to collect coverage files locally without sleep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I follow. Once the Pod is stopped / deleted, the coverage files are gone. When executing e2e tests, there are several instances where we update the Antrea configuration and need to restart the Antrea components (which means recreate the Pods). If we don't proactively kill the processes manually and retrieve the coverage files, we would miss some data.
a08e4a3
to
a2d2db9
Compare
@antoninbas Is it fine if I add more commits to a PR or should one PR be only one commit? |
it's better if you add a new commit for each review cycle, then squash all commits at the very end |
multicluster/config/overlays/leader-ns/coverage/manager_command_patch_coverage.yaml
Outdated
Show resolved
Hide resolved
8672450
to
baae5b5
Compare
@antoninbas flow-aggregator images build are failing: https://github.com/antrea-io/antrea/actions/runs/8301256864/job/22720993922?pr=6090#step:3:30 I don't understand why pull access would be denied. |
|
multicluster/config/overlays/leader-ns/coverage/manager_command_patch_coverage.yaml
Outdated
Show resolved
Hide resolved
What is the use of this function? I don't think any coverage files are being written in the control plane Node. |
6203bbe
to
ab65fde
Compare
test/e2e/framework.go
Outdated
@@ -119,7 +119,7 @@ const ( | |||
antreaControllerCovFile = "antrea-controller.cov.out" | |||
antreaAgentCovFile = "antrea-agent.cov.out" | |||
flowAggregatorCovFile = "flow-aggregator.cov.out" | |||
cpNodeCoverageDir = "/tmp/antrea-e2e-coverage" | |||
cpNodeCoverageDir = "/tmp/coverage" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure why do we need to change this directory name?
I feel the old name is more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decided to use the same directory name throughout(pods, nodes, etc.). Will revert if it creates ambiguity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it may create some ambiguity as the name is a bit too "generic", and this is the Node's filesystem
multicluster/Makefile
Outdated
@@ -82,7 +82,7 @@ bin: fmt vet ## Build manager binary. | |||
|
|||
.PHONY: antrea-mc-instr-binary | |||
antrea-mc-instr-binary: | |||
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go test -tags testbincover -covermode count -coverpkg=antrea.io/antrea/multicluster/... -c -o bin/antrea-mc-controller-coverage antrea.io/antrea/multicluster/cmd/... | |||
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -cover -coverpkg=antrea.io/antrea/multicluster/...,antrea.io/antrea/multicluster/cmd/... -o bin/antrea-mc-controller-coverage1 antrea.io/antrea/multicluster/cmd/... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is it antrea-mc-controller-coverage1
, typo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes🙃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shikharish I don't think you resolved this. It's possible that you forgot to commit / push.
Be careful when resolving conversations on Github, as it can make it harder when doing follow-up reviews if a conversation is resolved by mistake.
args: ["-c", "/antrea-mc-controller-coverage -test.run=TestBincoverRunMain -test.coverprofile=antrea-mc-controller.cov.out leader --config=/controller_manager_config.yaml; while true; do sleep 5 & wait $!; done"] | ||
name: antrea-mc-controller | ||
- args: | ||
- antrea-mc-controller-coverage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the value of -test.coverprofile
after this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't use that flag anymore. The instrumented binary will generate covmeta and covcounter files. They will be moved to our local coverage directory and then converted to cov.out files using go tool covdata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will impact the way to upload the coverage to codecov, please check here
Line 469 in a8d78bc
kubectl cp ${namespace}/$mc_controller_pod_name:antrea-mc-controller.cov.out ${COVERAGE_DIR}/$mc_controller_pod_name-$timestamp ${kubeconfig} |
Please double check and verify that all new generated coverage files can be uploaded to codecov site correctly.
2a4870a
to
6907a19
Compare
test/e2e/utils/run_cov_binary.sh
Outdated
while true; do | ||
# Sleep in the background, to allow exiting as soon as a signal is received | ||
sleep 5 & wait $! | ||
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I figured out the root cause for #6318
This code is not handling the SIGTERM signal correctly, which means that the antrea-agent Pod takes a full 30s to terminate when it is deleted. This means that when we delete the antrea-agent Pod in a test case, it has the potential of impacting subsequent test cases, depending on how these test cases are written.
This is my bad since I came up with the code for the script. Can you replace it with the following code? It should solve the issue.
set -euo pipefail
TEST_PID=
function quit {
echo "Received signal, exiting"
# While not strictly required, it is better to try to terminate the test process gracefully if it is still running.
if [ "$TEST_PID" != "" ]; then
echo "Sending signal to test process"
kill -SIGTERM $TEST_PID > /dev/null 2>&1 || true
wait $TEST_PID
echo "Test process exited gracefully"
fi
exit 0
}
# This is necessary: without an explicit signal handler, the SIGTERM signal will be discarded when running
# as a process with PID 1 in a container.
trap 'quit' SIGTERM
# Run the test process in the background, to allow exiting when a signal is received.
"$@" &
TEST_PID=$!
wait $TEST_PID
TEST_PID=
# Sleep in the background, to allow exiting as soon as a signal is received.
sleep infinity & wait
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this way will impact the coverage collection?
2024/05/11 15:31:41 Error when gracefully exit antrea agent: error when gracefully exiting Antrea Agent: error
when copying antctl coverage files: error when running this find command '[bash -c find /tmp/coverage -mindepth
1]' on Pod 'antrea-agent-cp7c4', stderr: <>, err: <Internal error occurred: error executing command in container:
failed to exec in container: failed to start exec
"1d044df8621e3a24943111287da3529bb53e25c04a2e8c8da3eecb1625916618": OCI runtime exec failed: exec
failed: cannot exec in a stopped container: unknown>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@luolanzone Where are these logs from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have also seen these errors in the logs for one of the Github CI jobs:
2024-05-11T15:06:39.0693719Z I0511 15:06:39.068915 23828 framework.go:2651] Sending SIGINT to 'antrea-agent-coverage'
2024-05-11T15:06:39.1196604Z I0511 15:06:39.119191 23828 framework.go:2657] Copying coverage files from Pod 'antrea-agent-j92js'
2024-05-11T15:06:42.3927758Z I0511 15:06:42.391304 23828 framework.go:2797] Copying file: covmeta.ede75673d2ae4194072b9dd9af8339de
2024-05-11T15:06:42.4480081Z I0511 15:06:42.447436 23828 framework.go:2737] Coverage file not available yet for copy: cannot retrieve content of file '/tmp/coverage/covmeta.ede75673d2ae4194072b9dd9af8339de' from Pod 'antrea-agent-j92js', stderr: <>, err: <Internal error occurred: error executing command in container: failed to exec in container: failed to create exec "55149056a7d640b206c4e2daf48e4c25985b2df698c89c225e0731372cd38449": task a4707bcd87f8f641de784cabd92eeec4adf3932f00874264ae59fd138ab4dc67 not found: not found>
2024-05-11T15:06:43.3923117Z I0511 15:06:43.391975 23828 framework.go:2797] Copying file: covmeta.ede75673d2ae4194072b9dd9af8339de
2024-05-11T15:31:41.3860016Z I0511 15:31:41.385498 23828 framework.go:2657] Copying coverage files from Pod 'antrea-agent-cp7c4'
2024-05-11T15:31:41.4442434Z 2024/05/11 15:31:41 Error when gracefully exit antrea agent: error when gracefully exiting Antrea Agent: error when copying antctl coverage files: error when running this find command '[bash -c find /tmp/coverage -mindepth 1]' on Pod 'antrea-agent-cp7c4', stderr: <>, err: <Internal error occurred: error executing command in container: failed to exec in container: failed to start exec "1d044df8621e3a24943111287da3529bb53e25c04a2e8c8da3eecb1625916618": OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown>
This is the job: https://github.com/antrea-io/antrea/actions/runs/9044365104/job/24853065716?pr=6090
These errors are confusing as they seem to indicate that the container doesn't exist anymore. Which should not be the case since at that point we have killed the process, but not the Pod / container. The first one in particular is confusing for me, because it seems that we succeed in copying the file on the second attempt. Any idea @shikharish ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can run the test again after @shikharish disables the TestConnectivity/testOVSRestartSameNode
test case, but it's making me a bit uneasy that we are even seeing these errors now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found that the failure is in TestWireGuard
and only due to the flag --feature-gates AllAlpha=true
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is indeed the only case in which we see it happen in CI, but I don't think it has anything to do with TestWireGuard
. TestWireGuard
is actually skipped when all features are enabled, as it is not compatible with Multicast. The errors happen when we collect coverage data at the end of the test suite:
Line 750 in 6a7c606
defer gracefulExitAntrea(testData) |
We need to figure out what's going on, and it seems to happen consistently in CI. I wasn't able to reproduce locally by running the following command: go test -timeout=10m -v -count=1 -run=TestWireGuard antrea.io/antrea/test/e2e -provider=kind -coverage=true -deploy-antrea=false
.
This means that we will need to find a way to extract more information from the CI job, in order to troubleshoot.
It looks to me that we are always able to 1) retrieve the PID for the Agent process, and 2) kill the process. However, something is not working when we retrieve coverage data: one of the container seems to have been stopped. If you can find a way to do a kubectl describe
on that specific Pod (refer to waitForAntreaDaemonSetPods
), this may provide us with enough information. I think the right place to do it is in killProcessAndCollectCovFiles
. If collectCovFiles
returns an error, run kubectl describe
for that Pod, and print the contents of stdout.
Please make these changes in a separate commit; these changes may not need to be temporary. If done well we could keep them for future troubleshooting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@antoninbas I reproduced the error locally, here is the kubectl describe
output for that pod:
I0515 03:27:30.333084 1309413 framework.go:2660] Name: antrea-agent-5dxnn
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: antrea-agent
Node: kind-worker2/172.18.0.2
Start Time: Tue, 14 May 2024 21:57:06 +0000
Labels: app=antrea
component=antrea-agent
controller-revision-hash=75f74c6b6d
pod-template-generation=1
Annotations: checksum/config: d0b905f11f84008ec883b25acf2106020200fc226e51ca647d0083956e532ae0
kubectl.kubernetes.io/default-container: antrea-agent
Status: Running
IP: 172.18.0.2
IPs:
IP: 172.18.0.2
Controlled By: DaemonSet/antrea-agent
Init Containers:
install-cni:
Container ID: containerd://654efbe0b9208d7127c4b2f1d2fdfdb2fc6b12f78515d84915cf3b133c295cfd
Image: antrea/antrea-agent-ubuntu-coverage:latest
Image ID: docker.io/library/import-2024-05-14@sha256:ab5b4129a4ee158f3ddeae628c1bf6c161a7b51d429cdcebc3a5b408ff05e2e2
Port: <none>
Host Port: <none>
Command:
install_cni
State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 14 May 2024 21:57:06 +0000
Finished: Tue, 14 May 2024 21:57:06 +0000
Ready: True
Restart Count: 0
Requests:
cpu: 100m
Environment:
SKIP_CNI_BINARIES:
Mounts:
/etc/antrea/antrea-cni.conflist from antrea-config (ro,path="antrea-cni.conflist")
/host/etc/cni/net.d from host-cni-conf (rw)
/host/opt/cni/bin from host-cni-bin (rw)
/lib/modules from host-lib-modules (ro)
/var/run/antrea from host-var-run-antrea (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qgqfg (ro)
Containers:
antrea-agent:
Container ID: containerd://e514acc78e315282fde3253a7f24d061006af88c1ac3ff4b60cefcdd46c1c0b6
Image: antrea/antrea-agent-ubuntu-coverage:latest
Image ID: docker.io/library/import-2024-05-14@sha256:ab5b4129a4ee158f3ddeae628c1bf6c161a7b51d429cdcebc3a5b408ff05e2e2
Port: 10350/TCP
Host Port: 10350/TCP
Args:
antrea-agent-coverage
--config=/etc/antrea/antrea-agent.conf
--logtostderr=false
--log_dir=/var/log/antrea
--alsologtostderr
--log_file_max_size=100
--log_file_max_num=4
--v=4
State: Running
Started: Tue, 14 May 2024 21:57:07 +0000
Ready: True
Restart Count: 0
Requests:
cpu: 200m
Liveness: http-get https://localhost:api/livez delay=10s timeout=5s period=10s #success=1 #failure=5
Readiness: http-get https://localhost:api/readyz delay=5s timeout=5s period=10s #success=1 #failure=8
Environment:
POD_NAME: antrea-agent-5dxnn (v1:metadata.name)
POD_NAMESPACE: kube-system (v1:metadata.namespace)
NODE_NAME: (v1:spec.nodeName)
Mounts:
/etc/antrea/antrea-agent.conf from antrea-config (ro,path="antrea-agent.conf")
/host/proc from host-proc (ro)
/host/var/run/netns from host-var-run-netns (ro)
/run/xtables.lock from xtables-lock (rw)
/var/lib/cni from host-var-run-antrea (rw,path="cni")
/var/log/antrea from host-var-log-antrea (rw)
/var/run/antrea from host-var-run-antrea (rw)
/var/run/openvswitch from host-var-run-antrea (rw,path="openvswitch")
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qgqfg (ro)
antrea-ovs:
Container ID: containerd://999590d3b1960cc24c9412bbe7455c19ba66d6eeeca9972e9e719c3d162954a3
Image: antrea/antrea-agent-ubuntu-coverage:latest
Image ID: docker.io/library/import-2024-05-14@sha256:ab5b4129a4ee158f3ddeae628c1bf6c161a7b51d429cdcebc3a5b408ff05e2e2
Port: <none>
Host Port: <none>
Command:
start_ovs
Args:
--log_file_max_size=100
--log_file_max_num=4
State: Running
Started: Tue, 14 May 2024 21:57:07 +0000
Ready: True
Restart Count: 0
Requests:
cpu: 200m
Liveness: exec [/bin/sh -c timeout 10 container_liveness_probe ovs] delay=5s timeout=10s period=10s #success=1 #failure=5
Environment: <none>
Mounts:
/var/log/openvswitch from host-var-log-antrea (rw,path="openvswitch")
/var/run/openvswitch from host-var-run-antrea (rw,path="openvswitch")
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qgqfg (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
antrea-config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: antrea-config
Optional: false
host-cni-conf:
Type: HostPath (bare host directory volume)
Path: /etc/cni/net.d
HostPathType:
host-cni-bin:
Type: HostPath (bare host directory volume)
Path: /opt/cni/bin
HostPathType:
host-proc:
Type: HostPath (bare host directory volume)
Path: /proc
HostPathType:
host-var-run-netns:
Type: HostPath (bare host directory volume)
Path: /var/run/netns
HostPathType:
host-var-run-antrea:
Type: HostPath (bare host directory volume)
Path: /var/run/antrea
HostPathType: DirectoryOrCreate
host-var-log-antrea:
Type: HostPath (bare host directory volume)
Path: /var/log/antrea
HostPathType: DirectoryOrCreate
host-lib-modules:
Type: HostPath (bare host directory volume)
Path: /lib/modules
HostPathType:
xtables-lock:
Type: HostPath (bare host directory volume)
Path: /run/xtables.lock
HostPathType: FileOrCreate
kube-api-access-qgqfg:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: kubernetes.io/os=linux
Tolerations: :NoSchedule op=Exists
:NoExecute op=Exists
CriticalAddonsOnly op=Exists
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/network-unavailable:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 24s default-scheduler Successfully assigned kube-system/antrea-agent-5dxnn to kind-worker2
Normal Pulled 24s kubelet Container image "antrea/antrea-agent-ubuntu-coverage:latest" already present on machine
Normal Created 24s kubelet Created container install-cni
Normal Started 24s kubelet Started container install-cni
Normal Pulled 23s kubelet Container image "antrea/antrea-agent-ubuntu-coverage:latest" already present on machine
Normal Created 23s kubelet Created container antrea-agent
Normal Started 23s kubelet Started container antrea-agent
Normal Pulled 23s kubelet Container image "antrea/antrea-agent-ubuntu-coverage:latest" already present on machine
Normal Created 23s kubelet Created container antrea-ovs
Normal Started 23s kubelet Started container antrea-ovs
Warning Unhealthy 14s kubelet Readiness probe failed: Get "https://localhost:10350/readyz": dial tcp [::1]:10350: connect: connection refused
30b4b95
to
d060a26
Compare
@luolanzone The coverage on my branch is around 65%, significantly less that that of the main branch. https://app.codecov.io/gh/antrea-io/antrea/tree/remove-bincover |
Oh it increased to 72% with the latest commit. |
@antoninbas The tests still failed but this time due to |
So I would not worry about this failure. I don't think it was introduced by this PR. As a matter of fact I was able to reproduce with the latest Antrea release (v2.0). I tried several combinations and I was able to reproduce in each case:
I think that the previous version of the coverage image had some issue, and the Pod was probably taking a few seconds to terminate. It seems that it may have had an impact on the outcome of the test, but I have to dig deeper. Given that in CI we only run the test with coverage enabled, we could have missed the issue. My suggestion would be to skip the test for now, so that it doesn't block this PR. You can add the following at the beginning of the sub-test: |
Fixes antrea-io#4962 Signed-off-by: Shikhar Soni <[email protected]>
Signed-off-by: Shikhar Soni <[email protected]>
Signed-off-by: Shikhar Soni <[email protected]>
7750090
to
4ec1b09
Compare
/test-all |
/test-windows-containerd-e2e |
Thanks for your hard work @shikharish and congrats on getting this PR merged! |
The Dockerfile was reintroduced by mistake in #6090 Signed-off-by: Shikhar Soni <[email protected]>
There should no longer be a need for a custom timeout value in testAntreaGracefulExit when coverage is enabled. The increased timeout was not actually required for code coverage collection (which is qite fast), but because of issues with the coverage-enabled image, which was not handling signals correctly. After antrea-io#6090, this should have been fixed. Signed-off-by: Antonin Bas <[email protected]>
There should no longer be a need for a custom timeout value in testAntreaGracefulExit when coverage is enabled. The increased timeout was not actually required for code coverage collection (which is quite fast), but because of issues with the coverage-enabled image, which was not handling signals correctly. After #6090, this should have been fixed. Signed-off-by: Antonin Bas <[email protected]>
This PR makes the shift from the deprecated bincover package to Golang built-in coverage profiling tool.