Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EKS general pod failures (caused by kubelet crash?) #1548

Open
djmcgreal-cc opened this issue Dec 19, 2023 · 6 comments
Open

EKS general pod failures (caused by kubelet crash?) #1548

djmcgreal-cc opened this issue Dec 19, 2023 · 6 comments

Comments

@djmcgreal-cc
Copy link

djmcgreal-cc commented Dec 19, 2023

What happened:

We've had several instances where multiple nodes (1.24.16-20230825) have become unable to reliably launch Pods. Here's an example of the kinds of errors we see:

 Events:
   Type     Reason       Age                From               Message
   ----     ------       ----               ----               -------
   Warning  FailedMount  27s (x2 over 27s)  kubelet            MountVolume.SetUp failed for volume "lakefs" : mount failed: fork/exec /usr/bin/mount: resource temporarily unavailable
 Mounting command: mount
 Mounting arguments: -t tmpfs -o size=63510704128 tmpfs /var/lib/kubelet/pods/1cbf573e-2101-468e-b6df-9251edd3ada2/volumes/kubernetes.io~secret/lakefs
 Output:
   Warning  FailedMount  27s (x2 over 27s)  kubelet  MountVolume.SetUp failed for volume "kube-api-access-w866f" : mount failed: fork/exec /usr/bin/mount: resource temporarily unavailable
 Mounting command: mount
 Mounting arguments: -t tmpfs -o size=63510704128 tmpfs /var/lib/kubelet/pods/1cbf573e-2101-468e-b6df-9251edd3ada2/volumes/kubernetes.io~projected/kube-api-access-w866f
 Output:
   Warning  Failed   19s                kubelet  Error: failed to prepare subPath for volumeMount "input-artifacts" of container "wait"
   Warning  Failed   19s                kubelet  Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: unable to start init: fork/exec /p
 roc/self/exe: resource temporarily unavailable: unknown
   Warning  Failed   18s                kubelet  Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: can't get final child's PID from pipe: EOF: unknown

What you expected to happen:

Nodes resilient under load.

How to reproduce it (as minimally and precisely as possible):

Nodes are under high CPU load, but there're other nodes under the same load that don't have this problem. We've only started seeing it in the last week but haven't changed the AMI since August, or any other configuration, like networking.

Clues of what I can look at next very welcome!

Anything else we need to know?:

Nodes in this state have restarting kubelets, which might be this.

Dec 19 16:22:16 ip-10-0-0-159 kubelet: runtime: failed to create new OS thread (have 12 already; errno=11)
Dec 19 16:22:16 ip-10-0-0-159 kubelet: runtime: may need to increase max user processes (ulimit -u)
Dec 19 16:22:16 ip-10-0-0-159 kubelet: fatal error: newosproc

I've also seen this ulimit error as Pod Events. When I run ulimit -u on the node using nsenter it tells me it's unlimited. journal is also suppressing a lot of kubelet messages - so it's going quite nuts.

Nodes that get into this state do not recover reliability (some Pods do get scheduled) even after every workload is drained.

Environment:

  • AWS Region: us-east-1
  • Instance Type(s): m5.4xlarge
  • EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): eks.14
  • Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): 1.24
  • AMI Version: 1.24.16-20230825
  • Kernel (e.g. uname -a): Linux ip-10-0-0-159.ec2.internal 5.10.186-179.751.amzn2.x86_64 #1 SMP Tue Aug 1 20:51:38 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
  • Release information (run cat /etc/eks/release on a node):
BASE_AMI_ID="ami-0f2b325398f933a81"
BUILD_TIME="Fri Aug 25 20:04:37 UTC 2023"
BUILD_KERNEL="5.10.186-179.751.amzn2.x86_64"
ARCH="x86_64"
@dims
Copy link
Member

dims commented Dec 19, 2023

@djmcgreal-cc check what's LimitNPROC in your /etc/systemd/system/containerd.service file?

xref: golang/go#49438

@djmcgreal-cc
Copy link
Author

djmcgreal-cc commented Dec 19, 2023

[root@ip-10-0-0-159 /]# cat /etc/systemd/system/containerd.service
cat: /etc/systemd/system/containerd.service: No such file or directory

[root@ip-10-0-0-159 /]# grep -r LimitNPROC /etc
/etc/systemd/system.conf:#DefaultLimitNPROC=
/etc/systemd/user.conf:#DefaultLimitNPROC=
/etc/systemd/system.conf.d/50-limits.conf:DefaultLimitNPROC=infinity:infinity

[root@ip-10-0-0-159 /]# find /etc -name containerd.service
/etc/systemd/system/multi-user.target.wants/containerd.service

[root@ip-10-0-0-159 /]# grep LimitNPROC /etc/systemd/system/multi-user.target.wants/containerd.service
LimitNPROC=infinity

I'm reading the issue you link and it looks like it's been fixed in go 1.20. So question is which version of Go kubelet was compiled with? I'm not sure the relevance of containerd?

@djmcgreal-cc
Copy link
Author

[root@ip-10-0-0-159 /]# go version  /usr/bin/containerd
/usr/bin/containerd: go1.20.7
[root@ip-10-0-0-159 /]# go version /usr/bin/kubelet
/usr/bin/kubelet: go1.20.6

Does that mean it's not related to golang/go#49438 (and golang/go@14018c8?). If so I'm sad because you guys would've really have pulled it out of the bag there!

@djmcgreal-cc
Copy link
Author

Then again, the errno is definitely 11, so either the retryOnEAGAIN isn't being called, or it's failing 20 times in a row?

@djmcgreal-cc
Copy link
Author

@ ing @cartermckinnon due to #899 (comment).

@cartermckinnon
Copy link
Member

Are you still having this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants