Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flaky e2e jobs after EKS migration #2898

Closed
chendave opened this issue Jun 19, 2023 · 27 comments
Closed

flaky e2e jobs after EKS migration #2898

chendave opened this issue Jun 19, 2023 · 27 comments
Labels
area/test kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@chendave
Copy link
Member

chendave commented Jun 19, 2023

https://k8s-testgrid.appspot.com/sig-cluster-lifecycle-kubeadm#kubeadm-kinder-1-25
https://k8s-testgrid.appspot.com/sig-cluster-lifecycle-kubeadm#kubeadm-kinder-1-27

Both are flaky, but cannot see any suspicious from the logs.

It's this e2e are flaky in the 1-25 (not always),

[sig-node] Variable Expansion should fail substituting values in a volume subpath with absolute path [Slow] [Conformance] but other e2e run successfully at the end.

@pacoxu
Copy link
Member

pacoxu commented Jun 19, 2023

Failed at 18:32:07; pod failed at 18:33:23.

Jun 18 18:32:07.284: FAIL: while waiting for the pod container to fail: timed out while waiting for pod var-expansion-2973/var-expansion-2cf003b9-6f5e-4734-b2dc-d3e5acee3bce to be container 0 failed with reason CreateContainerConfigError

Jun 18 18:33:23.299: INFO: At 2023-06-18 18:33:20 +0000 UTC - event for var-expansion-2cf003b9-6f5e-4734-b2dc-d3e5acee3bce: {kubelet kinder-regular-worker-2} Failed: Error: error SubPath /tmp must not be an absolute path

According to the log, it just failed after the timeout 2 minutes(PodStartShortTimeout).

During the timeout waiting, some error logs in kubelet is like below:

Jun 18 18:29:14 kinder-regular-worker-2 kubelet[394]: E0618 18:29:14.871370     394 kubelet.go:2190] "Housekeeping took longer than 15s" err="housekeeping took too long" seconds=116.755183604
Jun 18 18:33:14 kinder-regular-worker-2 kubelet[394]: I0618 18:30:23.129947     394 trace.go:205] Trace[704497890]: "iptables ChainExists" (18-Jun-2023 18:30:18.120) (total time: 5008ms):
Jun 18 18:33:14 kinder-regular-worker-2 kubelet[394]: Trace[704497890]: [5.008922585s] [5.008922585s] END
Jun 18 18:33:14 kinder-regular-worker-2 kubelet[394]: W0618 18:30:23.130301     394 iptables.go:587] Could not check for iptables canary mangle/KUBE-KUBELET-CANARY: exit status 4
Jun 18 18:33:14 kinder-regular-worker-2 kubelet[394]: I0618 18:30:23.199649     394 trace.go:205] Trace[147984487]: "iptables ChainExists" (18-Jun-2023 18:30:18.191) (total time: 5008ms):
Jun 18 18:33:14 kinder-regular-worker-2 kubelet[394]: Trace[147984487]: [5.008338221s] [5.008338221s] END
Jun 18 18:33:14 kinder-regular-worker-2 kubelet[394]: W0618 18:30:23.199674     394 iptables.go:587] Could not check for iptables canary mangle/KUBE-KUBELET-CANARY: exit status 4
Jun 18 18:33:14 kinder-regular-worker-2 kubelet[394]: I0618 18:31:23.128331     394 trace.go:205] Trace[816972421]: "iptables ChainExists" (18-Jun-2023 18:31:18.119) (total time: 5008ms):
Jun 18 18:33:14 kinder-regular-worker-2 kubelet[394]: Trace[816972421]: [5.008329778s] [5.008329778s] END
Jun 18 18:33:14 kinder-regular-worker-2 kubelet[394]: W0618 18:31:23.128354     394 iptables.go:587] Could not check for iptables canary mangle/KUBE-KUBELET-CANARY: exit status 4
Jun 18 18:33:14 kinder-regular-worker-2 kubelet[394]: I0618 18:31:23.199197     394 trace.go:205] Trace[2033996605]: "iptables ChainExists" (18-Jun-2023 18:31:18.190) (total time: 5008ms):
Jun 18 18:33:14 kinder-regular-worker-2 kubelet[394]: Trace[2033996605]: [5.008492559s] [5.008492559s] END
Jun 18 18:33:14 kinder-regular-worker-2 kubelet[394]: W0618 18:31:23.199218     394 iptables.go:587] Could not check for iptables canary mangle/KUBE-KUBELET-CANARY: exit status 4
Jun 18 18:33:14 kinder-regular-worker-2 kubelet[394]: I0618 18:32:23.128717     394 trace.go:205] Trace[1249408050]: "iptables ChainExists" (18-Jun-2023 18:32:18.120) (total time: 5008ms):
Jun 18 18:33:14 kinder-regular-worker-2 kubelet[394]: Trace[1249408050]: [5.008274162s] [5.008274162s] END
Jun 18 18:33:14 kinder-regular-worker-2 kubelet[394]: W0618 18:32:23.128739     394 iptables.go:587] Could not check for iptables canary mangle/KUBE-KUBELET-CANARY: exit status 4
Jun 18 18:33:14 kinder-regular-worker-2 kubelet[394]: I0618 18:32:23.199624     394 trace.go:205] Trace[929886770]: "iptables ChainExists" (18-Jun-2023 18:32:18.191) (total time: 5008ms):
Jun 18 18:33:14 kinder-regular-worker-2 kubelet[394]: Trace[929886770]: [5.008394903s] [5.008394903s] END
Jun 18 18:33:14 kinder-regular-worker-2 kubelet[394]: W0618 18:32:23.199650     394 iptables.go:587] Could not check for iptables canary mangle/KUBE-KUBELET-CANARY: exit status 4
Jun 18 18:33:14 kinder-regular-worker-2 kubelet[394]: E0618 18:33:05.741530     394 kubelet.go:2190] "Housekeeping took longer than 15s" err="housekeeping took too long" seconds=230.870079681
Jun 18 18:33:14 kinder-regular-worker-2 kubelet[394]: I0618 18:33:05.741578     394 topology_manager.go:205] "Topology Admit Handler"
Jun 18 18:33:14 kinder-regular-worker-2 kubelet[394]: I0618 18:33:05.858901     394 reconciler.go:357] "operationExecutor.VerifyControllerAttachedVolume started for volume \"workdir1\" (UniqueName: \"kubernetes.io/empty-dir/f536f460-121f-44ff-bdca-e9ceddd50f2b-workdir1\") pod \"var-expansion-2cf003b9-6f5e-4734-b2dc-d3e5acee3bce\" (UID: \"f536f460-121f-44ff-bdca-e9ceddd50f2b\") " pod="var-expansion-2973/var-expansion-2cf003b9-6f5e-4734-b2dc-d3e5acee3bce"
Jun 18 18:33:14 kinder-regular-worker-2 kubelet[394]: I0618 18:33:05.858969     394 reconciler.go:357] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-api-access-8njzp\" (UniqueName: \"kubernetes.io/projected/f536f460-121f-44ff-bdca-e9ceddd50f2b-kube-api-access-8njzp\") pod \"var-expansion-2cf003b9-6f5e-4734-b2dc-d3e5acee3bce\" (UID: \"f536f460-121f-44ff-bdca-e9ceddd50f2b\") " pod="var-expansion-2973/var-expansion-2cf003b9-6f5e-4734-b2dc-d3e5acee3bce"
Jun 18 18:33:14 kinder-regular-worker-2 kubelet[394]: I0618 18:33:05.956382     394 topology_manager.go:205] "Topology Admit Handler"

@neolit123 neolit123 added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. kind/flake Categorizes issue or PR as related to a flaky test. area/test labels Jun 19, 2023
@neolit123 neolit123 added this to the v1.28 milestone Jun 19, 2023
@chendave chendave changed the title flaky e2e job kubeadm-kinder-1-27 && kubeadm-kinder-1-27 flaky e2e job kubeadm-kinder-1-25 && kubeadm-kinder-1-27 Jun 19, 2023
@neolit123 neolit123 changed the title flaky e2e job kubeadm-kinder-1-25 && kubeadm-kinder-1-27 flaky e2e jobs after EKS migration Jun 19, 2023
@neolit123
Copy link
Member

renamed the issue to be more general as there seem to be a lot more flaky jobs.
oddly, some are passing 100% of the time, but i think it's just random:
https://k8s-testgrid.appspot.com/sig-cluster-lifecycle-kubeadm

my suspicion right now is resource constrains, VM CPU / resource sharing limits at EKS.
we used to get the same general problem on the prow GCE / Google cluster, from time to time.

@dims @ameukam unfortunately, i don't think this is clean signal for kubeadm.
we might have to revert to the k8s-infra prow cluster unless there is a way to bump resources for these EKS VMs.

this of course is with my assumption above about "resource constrains", in case someone finds another reason. let me know.

@ameukam
Copy link
Member

ameukam commented Jun 19, 2023

renamed the issue to be more general as there seem to be a lot more flaky jobs. oddly, some are passing 100% of the time, but i think it's just random: https://k8s-testgrid.appspot.com/sig-cluster-lifecycle-kubeadm

my suspicion right now is resource constrains, VM CPU / resource sharing limits at EKS. we used to get the same general problem on the prow GCE / Google cluster, from time to time.

@dims @ameukam unfortunately, i don't think this is clean signal for kubeadm. we might have to revert to the k8s-infra prow cluster unless there is a way to bump resources for these EKS VMs.

this of course is with my assumption above about "resource constrains", in case someone finds another reason. let me know.

@neolit123 feel free to revert to the k8s-infra cluster if current signal is not good enough for you.

@dims
Copy link
Member

dims commented Jun 19, 2023

@neolit123 my personal preference as you know would be to work through these issues. but totally understand if you wish to just revert. I'd hate for us to not being able to use the resources we have in AWS (we are probably going to consume 1/3 of the budget this year and let the rest go unused as it does not roll up to next year)

We could try bumping cpu/mem limits if you think that will help?

Looking at https://prow.k8s.io/?job=*kubeadm-kinder*&state=failure i think we were still tweaking things yesterday for the inotify stuff, so i'd only focus on things that failed after those changes landed as well. Please do compare it to the jobs that suceeded as well ( https://prow.k8s.io/?job=*kubeadm-kinder*&state=success )

but in the end it's your call. thanks for considering the above arguments.

@neolit123
Copy link
Member

neolit123 commented Jun 19, 2023

@neolit123 my personal preference as you know would be to work through these issues. but totally understand if you wish to just revert. I'd hate for us to not being able to use the resources we have in AWS (we are probably going to consume 1/3 of the budget this year and let the rest go unused as it does not roll up to next year)

one concern about clean signal is that we plan to start working on a new api for kubeadm 1.28 and later in 1.29. we could leave the latest jobs (1.28) to be in the k8s infra cluster and the rest (majority in numbers) to run on EKS, until we resolve the issues.

but as mentioned above, to me this does seem like a VM resouce sharing/limitation problem, based on the randomness of the flakes. while on the Google prow, we never found a solution for that and it just happened. if the assumption is true, we need to ensure EKS can guarantee the needed resources. what can be done as an experiment e.g. to bump VM resources and who can help us debug?

@neolit123
Copy link
Member

We could try bumping cpu/mem limits if you think that will help?

these values were migrated from the k8s infra prow to eks. it can be done as a test, but IIRC if the cloud provider infra is busy we won't be getting the requested values.

@dims
Copy link
Member

dims commented Jun 19, 2023

cc @xmudrii @pkprzekwas

@xmudrii
Copy link
Member

xmudrii commented Jun 19, 2023

Thanks @dims for pinging me on this issue! 👋 Let me try to unpack this issue.

@dims @ameukam unfortunately, i don't think this is clean signal for kubeadm.
we might have to revert to the k8s-infra prow cluster unless there is a way to bump resources for these EKS VMs.

Those EKS VMs are beefy and heavily underutilized. To put in some concrete numbers in, we have 20 nodes in the cluster. The node with the highest memory usage has 6% of memory usage. CPU usage is in the similar state, going between few percents to up to 30% on some nodes. Also, those machines are r5a.4xlarge which have 16 vCPUs, 128 GB RAM, and dedicated NVMe SSDs.

@neolit123 my personal preference as you know would be to work through these issues. but totally understand if you wish to just revert. I'd hate for us to not being able to use the resources we have in AWS (we are probably going to consume 1/3 of the budget this year and let the rest go unused as it does not roll up to next year)

I heavily agree with @dims here, but after all, this is your call. I think we should work to solve those issues, using the EKS cluster should be default if a job doesn't depend on GCP resources. We managed to solve most of failures and flakes within some reasonable timeframe, I think we can do same here, but we need someone to help us from SCL.

If you decide to revert, please leave some of those jobs as canaries, so that we can work on fixing those issues and tracking if anything is getting better.

but as mentioned above, to me this does seem like a VM resouce sharing/limitation problem, based on the randomness of the flakes. while on the Google prow, we never found a solution for that and it just happened. if the assumption is true, we need to ensure EKS can guarantee the needed resources. what can be done as an experiment e.g. to bump VM resources and who can help us debug?

Can you please provide some context on how those jobs work? That's going to help us to figure out in what direction to go.

@neolit123
Copy link
Member

neolit123 commented Jun 19, 2023

Can you please provide some context on how those jobs work? That's going to help us to figure out in what direction to go.

these jobs create a containerd-in-docker kubeadm cluster (similar to kind) with usually 3 control plane nodes and 2 worker nodes. then they perform some kubeadm specific tests and optionally run the official k8s e2e test suite with tests that support parallel mode.

I heavily agree with @dims here, but after all, this is your call. I think we should work to solve those issues, using the EKS cluster should be default if a job doesn't depend on GCP resources. We managed to solve most of failures and flakes within some reasonable timeframe, I think we can do same here, but we need someone to help us from SCL.

the inotify change yesterday did seem to improve the flakes, and given some of these jobs run only 2 times a day, we can give it more time. i guess.

Those EKS VMs are beefy and heavily underutilized. To put in some concrete numbers in, we have 20 nodes in the cluster. The node with the highest memory usage has 6% of memory usage. CPU usage is in the similar state, going between few percents to up to 30% on some nodes. Also, those machines are r5a.4xlarge which have 16 vCPUs, 128 GB RAM, and dedicated NVMe SSDs.

looking at https://k8s-testgrid.appspot.com/sig-cluster-lifecycle-kubeadm some of the flakes seem to be related to running the k8s e2e suite. those are likely due to timeouts.
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-kubeadm-kinder-learner-mode-latest/1670623392035770368
https://k8s-testgrid.appspot.com/sig-cluster-lifecycle-kubeadm#kubeadm-kinder-learner-mode-latest

there are a good number of cases where the prow pod gets deleted, like here:
Job execution failed: Pod got deleted unexpectedly
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-kubeadm-kinder-1-25/1670492909419368448
https://k8s-testgrid.appspot.com/sig-cluster-lifecycle-kubeadm#kubeadm-kinder-1-25
and here
https://k8s-testgrid.appspot.com/sig-cluster-lifecycle-kubeadm#kubeadm-kinder-latest


note to kubeadm maintainers and something not EKS related IMO (docker exec issue?) is that there are also some cases where kubeadm init fails:

time="02:40:15" level=info msg="Running action kubeadm-init..."
time="02:40:15" level=debug msg="Running: docker exec kinder-xony-control-plane-1 mkdir -p /kinder/patches"
Error: failed to exec action kubeadm-init: failed to create /kinder/patches folder: exit status 1
 exit status 1

which we have seen before but it was never understood, and the fix was to retry all dir / file creations in our clustering tool.
https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-kubeadm-kinder-1-25-on-1-24/1670258734548389888/build-log.txt

@pacoxu
Copy link
Member

pacoxu commented Jun 20, 2023

the inotify change yesterday did seem to improve the flakes, and given some of these jobs run only 2 times a day, we can give it more time. i guess.

Agree. Most of tests are passed in the last run, and kubeadm-kinder-no-addons-latest failed for deleting kinder container.
This seems to be a general flake that I have met before in GCE.

# task-07-delete
kinder delete cluster --name=kinder-upgrade --loglevel=debug

Deleting cluster "kinder-upgrade" ...
Error: failed to delete cluster "kinder-upgrade": failed to delete nodes: command "docker rm -f -v kinder-upgrade-worker-1 kinder-upgrade-control-plane-1" failed with error: exit status 1
 exit status 1

BTW, warning message from AWS runs:

[WARNING SystemVerification]: failed to parse kernel config: unable to load kernel module: "configs", output: "modprobe: FATAL: Module configs not found in directory /lib/modules/5.15.0-1030-aws\n", err: exit status 1

in https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-kubeadm-kinder-no-addons-latest/1670858696168574976

@neolit123
Copy link
Member

neolit123 commented Jun 20, 2023

BTW, warning message from AWS runs:

@xmudrii @pkprzekwas @dims

can we solve this missing kernel config problem?

here is how system validators searches for it , and usually users don't see the problem with normal distros:
https://github.com/kubernetes/system-validators/blob/master/validators/kernel_validator.go#L177

without the kernel config kubeadm cannot tell if this node is k8s / kubelet compliant (kernel features).

@dims
Copy link
Member

dims commented Jun 20, 2023

@xmudrii @pkprzekwas i can confirm that /boot/config-5.15.0-1030-aws exists on the prow nodes, but is not available to the jobs which causes kubeadm to complain.

@BenTheElder @michelle192837 @xmcqueen how do we handle this in existing prow instances?

@xmudrii
Copy link
Member

xmudrii commented Jun 20, 2023

@dims @neolit123 Here's my understanding of the situation here. system-validators are:

  • trying to load kernel config from a known file path
    • this doesn't work in a container because we don't mount, let's say, /boot/config-*
  • if loading config from a known file path doesn't work (like in our case), it uses modprobe configs to get the kernel config

The issue is that modprobe configs doesn't work on Ubuntu. When I run this command directly on the node, I get exactly same error:

modprobe: FATAL: Module configs not found in directory /lib/modules/5.15.0-1030-aws

Which is correct, because Ubuntu doesn't store kernel config there, but in /boot. The only option that I see for now, is that we mount the kernel config in read-only mode, so that system-validators can read the file from a known path. We should probably research if there's a better way.

@BenTheElder @michelle192837 @xmcqueen how do we handle this in existing prow instances?

To my knowledge, this is Ubuntu-specific thing and we don't use Ubuntu on existing Prow instances.

@neolit123
Copy link
Member

The only option that I see for now, is that we mount the kernel config in read-only mode, so that system-validators can read the file from a known path. We should probably research if there's a better way.

i think that's what kind is doing. one alternative on the kubeadm side is to just ignore the warning, but i don't think that's a good idea.

@xmudrii
Copy link
Member

xmudrii commented Jun 21, 2023

@dims @ameukam Are we okay with giving read-only access to /boot/config-* to jobs that need it so that we solve the issue described above? ^^

@ameukam
Copy link
Member

ameukam commented Jun 21, 2023

To my knowledge, this is Ubuntu-specific thing and we don't use Ubuntu on existing Prow instances.

we in fact use ubuntu instances at least for the community-owned clusters. I think it's ok to have those kernel modules exposed to our prowjobs in RO; the security risk is very low but IMHO it's probably not worth it. Our ultimate goal is to have all the prowjobs running on k8s-infra (the community infrastructure) which was already the case before the migration to EKS.

@xmudrii
Copy link
Member

xmudrii commented Jun 21, 2023

I think it's ok to have those kernel modules exposed to our prowjobs in RO; the security risk is very low but IMHO it's probably not worth it.

I think not exposing those kernel modules can be a problem for kubeadm, we should probably have some coverage on that side too, but @neolit123 can weight on this.

@neolit123
Copy link
Member

yes, please try to mount the config in one of the known paths. it will be used to pass validation on required/optional kernel features.

@xmudrii
Copy link
Member

xmudrii commented Jun 22, 2023

@ameukam What do you think about creating a new preset (e.g. preset-kernel-config-ro) that's going to include the needed volume and volume mount? Then jobs that need this can easily just include the preset in their spec. I can take care of this if that's okay and if we want to go that way.

@pacoxu
Copy link
Member

pacoxu commented Jun 25, 2023

Recent flakes are some job timeout like below:

Test started yesterday at 9:09 PM error after 46m2s. (more info)

Job execution failed: Pod got deleted unexpectedly

The flakes will not trigger an alert to https://groups.google.com/a/kubernetes.io/g/sig-cluster-lifecycle-kubeadm-alerts/ as the next try will pass.

@neolit123
Copy link
Member

Job execution failed: Pod got deleted unexpectedly

also what i see as the most common failure.
AFAIK, only the cluster owners can debug it.

@xmudrii
Copy link
Member

xmudrii commented Jun 26, 2023

We're investigating some stability issues that might cause this error, see kubernetes/k8s.io#5473 for more details.

@ameukam
Copy link
Member

ameukam commented Jun 27, 2023

@ameukam What do you think about creating a new preset (e.g. preset-kernel-config-ro) that's going to include the needed volume and volume mount? Then jobs that need this can easily just include the preset in their spec. I can take care of this if that's okay and if we want to go that way.

+1 for this preset.

@chendave
Copy link
Member Author

chendave commented Aug 3, 2023

I checked the current status, just found one flaky,

https://testgrid.k8s.io/sig-cluster-lifecycle-kubeadm#kubeadm-kinder-1-28-on-1-27

Deleting cluster "kinder-xony" ...
Error: failed to delete cluster "kinder-xony": failed to delete nodes: command "docker rm -f -v kinder-xony-lb kinder-xony-worker-1 kinder-xony-control-plane-2 kinder-xony-worker-2 kinder-xony-control-plane-1 kinder-xony-control-plane-3" failed with error: exit status 1

Before close this issue, I want to confirm with @neolit123 @SataQiu @pacoxu is that a known issue?

@pacoxu
Copy link
Member

pacoxu commented Aug 3, 2023

Error: failed to delete cluster "kinder-xony": failed to delete nodes: command "docker rm -f "

I remember this being a very common failure when docker hangs to delete a container.

@neolit123 neolit123 modified the milestones: v1.28, v1.29 Aug 18, 2023
@pacoxu
Copy link
Member

pacoxu commented Aug 29, 2023

/close
per #2911 (Now we have moved kubeadm jobs back to GKE cluster)

@k8s-ci-robot
Copy link
Contributor

@pacoxu: Closing this issue.

In response to this:

/close
per #2911 (Now we have moved kubeadm jobs back to GKE cluster)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

7 participants