EKS general pod failures (caused by kubelet crash?) #1548

djmcgreal-cc · 2023-12-19T17:08:16Z

What happened:

We've had several instances where multiple nodes (1.24.16-20230825) have become unable to reliably launch Pods. Here's an example of the kinds of errors we see:

 Events:
   Type     Reason       Age                From               Message
   ----     ------       ----               ----               -------
   Warning  FailedMount  27s (x2 over 27s)  kubelet            MountVolume.SetUp failed for volume "lakefs" : mount failed: fork/exec /usr/bin/mount: resource temporarily unavailable
 Mounting command: mount
 Mounting arguments: -t tmpfs -o size=63510704128 tmpfs /var/lib/kubelet/pods/1cbf573e-2101-468e-b6df-9251edd3ada2/volumes/kubernetes.io~secret/lakefs
 Output:
   Warning  FailedMount  27s (x2 over 27s)  kubelet  MountVolume.SetUp failed for volume "kube-api-access-w866f" : mount failed: fork/exec /usr/bin/mount: resource temporarily unavailable
 Mounting command: mount
 Mounting arguments: -t tmpfs -o size=63510704128 tmpfs /var/lib/kubelet/pods/1cbf573e-2101-468e-b6df-9251edd3ada2/volumes/kubernetes.io~projected/kube-api-access-w866f
 Output:
   Warning  Failed   19s                kubelet  Error: failed to prepare subPath for volumeMount "input-artifacts" of container "wait"
   Warning  Failed   19s                kubelet  Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: unable to start init: fork/exec /p
 roc/self/exe: resource temporarily unavailable: unknown
   Warning  Failed   18s                kubelet  Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: can't get final child's PID from pipe: EOF: unknown

What you expected to happen:

Nodes resilient under load.

How to reproduce it (as minimally and precisely as possible):

Nodes are under high CPU load, but there're other nodes under the same load that don't have this problem. We've only started seeing it in the last week but haven't changed the AMI since August, or any other configuration, like networking.

Clues of what I can look at next very welcome!

Anything else we need to know?:

Nodes in this state have restarting kubelets, which might be this.

Dec 19 16:22:16 ip-10-0-0-159 kubelet: runtime: failed to create new OS thread (have 12 already; errno=11)
Dec 19 16:22:16 ip-10-0-0-159 kubelet: runtime: may need to increase max user processes (ulimit -u)
Dec 19 16:22:16 ip-10-0-0-159 kubelet: fatal error: newosproc

I've also seen this ulimit error as Pod Events. When I run ulimit -u on the node using nsenter it tells me it's unlimited. journal is also suppressing a lot of kubelet messages - so it's going quite nuts.

Nodes that get into this state do not recover reliability (some Pods do get scheduled) even after every workload is drained.

Environment:

AWS Region: us-east-1
Instance Type(s): m5.4xlarge
EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): eks.14
Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): 1.24
AMI Version: 1.24.16-20230825
Kernel (e.g. uname -a): Linux ip-10-0-0-159.ec2.internal 5.10.186-179.751.amzn2.x86_64 #1 SMP Tue Aug 1 20:51:38 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Release information (run cat /etc/eks/release on a node):

BASE_AMI_ID="ami-0f2b325398f933a81"
BUILD_TIME="Fri Aug 25 20:04:37 UTC 2023"
BUILD_KERNEL="5.10.186-179.751.amzn2.x86_64"
ARCH="x86_64"

The text was updated successfully, but these errors were encountered:

dims · 2023-12-19T18:36:10Z

@djmcgreal-cc check what's LimitNPROC in your /etc/systemd/system/containerd.service file?

xref: golang/go#49438

djmcgreal-cc · 2023-12-19T19:26:44Z

[root@ip-10-0-0-159 /]# cat /etc/systemd/system/containerd.service
cat: /etc/systemd/system/containerd.service: No such file or directory

[root@ip-10-0-0-159 /]# grep -r LimitNPROC /etc
/etc/systemd/system.conf:#DefaultLimitNPROC=
/etc/systemd/user.conf:#DefaultLimitNPROC=
/etc/systemd/system.conf.d/50-limits.conf:DefaultLimitNPROC=infinity:infinity

[root@ip-10-0-0-159 /]# find /etc -name containerd.service
/etc/systemd/system/multi-user.target.wants/containerd.service

[root@ip-10-0-0-159 /]# grep LimitNPROC /etc/systemd/system/multi-user.target.wants/containerd.service
LimitNPROC=infinity

I'm reading the issue you link and it looks like it's been fixed in go 1.20. So question is which version of Go kubelet was compiled with? I'm not sure the relevance of containerd?

djmcgreal-cc · 2023-12-19T19:48:10Z

[root@ip-10-0-0-159 /]# go version  /usr/bin/containerd
/usr/bin/containerd: go1.20.7
[root@ip-10-0-0-159 /]# go version /usr/bin/kubelet
/usr/bin/kubelet: go1.20.6

Does that mean it's not related to golang/go#49438 (and golang/go@14018c8?). If so I'm sad because you guys would've really have pulled it out of the bag there!

djmcgreal-cc · 2023-12-19T19:55:14Z

Then again, the errno is definitely 11, so either the retryOnEAGAIN isn't being called, or it's failing 20 times in a row?

djmcgreal-cc · 2023-12-19T22:06:13Z

@ ing @cartermckinnon due to #899 (comment).

cartermckinnon · 2024-10-28T18:30:33Z

Are you still having this issue?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EKS general pod failures (caused by kubelet crash?) #1548

EKS general pod failures (caused by kubelet crash?) #1548

djmcgreal-cc commented Dec 19, 2023 •

edited

Loading

dims commented Dec 19, 2023 •

edited

Loading

djmcgreal-cc commented Dec 19, 2023 •

edited

Loading

djmcgreal-cc commented Dec 19, 2023

djmcgreal-cc commented Dec 19, 2023

djmcgreal-cc commented Dec 19, 2023

cartermckinnon commented Oct 28, 2024

EKS general pod failures (caused by kubelet crash?) #1548

EKS general pod failures (caused by kubelet crash?) #1548

Comments

djmcgreal-cc commented Dec 19, 2023 • edited Loading

dims commented Dec 19, 2023 • edited Loading

djmcgreal-cc commented Dec 19, 2023 • edited Loading

djmcgreal-cc commented Dec 19, 2023

djmcgreal-cc commented Dec 19, 2023

djmcgreal-cc commented Dec 19, 2023

cartermckinnon commented Oct 28, 2024

djmcgreal-cc commented Dec 19, 2023 •

edited

Loading

dims commented Dec 19, 2023 •

edited

Loading

djmcgreal-cc commented Dec 19, 2023 •

edited

Loading