-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EKS general pod failures (caused by kubelet crash?) #1548
Comments
@djmcgreal-cc check what's xref: golang/go#49438 |
I'm reading the issue you link and it looks like it's been fixed in go 1.20. So question is which version of Go kubelet was compiled with? I'm not sure the relevance of containerd? |
Does that mean it's not related to golang/go#49438 (and golang/go@14018c8?). If so I'm sad because you guys would've really have pulled it out of the bag there! |
Then again, the errno is definitely 11, so either the |
@ ing @cartermckinnon due to #899 (comment). |
Are you still having this issue? |
What happened:
We've had several instances where multiple nodes (1.24.16-20230825) have become unable to reliably launch Pods. Here's an example of the kinds of errors we see:
What you expected to happen:
Nodes resilient under load.
How to reproduce it (as minimally and precisely as possible):
Nodes are under high CPU load, but there're other nodes under the same load that don't have this problem. We've only started seeing it in the last week but haven't changed the AMI since August, or any other configuration, like networking.
Clues of what I can look at next very welcome!
Anything else we need to know?:
Nodes in this state have restarting kubelets, which might be this.
I've also seen this ulimit error as Pod Events. When I run
ulimit -u
on the node using nsenter it tells me it'sunlimited
. journal is also suppressing a lot of kubelet messages - so it's going quite nuts.Nodes that get into this state do not recover reliability (some Pods do get scheduled) even after every workload is drained.
Environment:
aws eks describe-cluster --name <name> --query cluster.platformVersion
): eks.14aws eks describe-cluster --name <name> --query cluster.version
): 1.24uname -a
):Linux ip-10-0-0-159.ec2.internal 5.10.186-179.751.amzn2.x86_64 #1 SMP Tue Aug 1 20:51:38 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
cat /etc/eks/release
on a node):The text was updated successfully, but these errors were encountered: