-
Notifications
You must be signed in to change notification settings - Fork 809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Occasional retrieving IMDS metadata failed on AL2023 #2262
Comments
Hi @brianrowlett, I wonder if there's a race where imds.GetInstanceIdentityDocument times out before pod networking is fully setup on the node. Will try to reproduce and bring this up with the team. Perhaps there's a more robust way to attempt IMDS metadata retrieval. If not, we can consider exposing a parameter to NOT fallback to K8s_metadata. With this parameter, Thanks for raising the issue! |
Hi @AndrewSirenko , thank you for the quick response. My intuition was that maybe this was a race condition, but I'm not familiar enough with the codebase to say for sure. It's reassuring that you might be thinking the same thing. To clarify, manually restarting the pods is not required, and falling back to Kubernetes metadata is likely acceptable for us (we just didn't like seeing imds fail without knowing why), so I don't think an Please let me know if there is anything I can do to help you reproduce the issue or test a fix. |
@brianrowlett 3 more questions for you to help us reproduce:
Thank you! |
|
Thanks @brianrowlett, we'll dive into the current IMDS SDK retry logic and see if there's an improvement we can make in our EC2MetadataInstanceInfo path. Final question, how often does this happen on your cluster? 1 in how many node startups? Appreciate you spotting this, will also mention this AL2 vs AL23 behavior to the IMDSv2 team. /priority important-longterm |
Thank you @AndrewSirenko, I was seeing it relatively frequently, maybe 1 in 3 or so (but unfortunately, I didn't actually keep a record). |
We are actually seeing this issue as well on node creation, this had caused us to receive a false alert the pod |
@AndrewSirenko
My machine type can handle 28 volumes (including disks, ENIs), and because of the way EBS-CSI driver fetches information from the Kubernetes metadata it reports 26 free slots. However, that is not true because my CNI configuration is running with |
@asher-lab @radirobi, thank you for your +1s and noting that there is impact to your stateful workloads. I'll escalate the priority of this issue internally. Worst case I will add a short-term workaround of retrying IMDS one additional time if my team does not have bandwidth. /priority important-soon |
/kind bug
We currently have AL2 nodes and have never had a problem with this.
When switching to AL2023 nodes, occasionally the ebs-csi-node will fail to retrieve metadata from IMDS. This only appears to happen at node startup time, if we restart the ebs-csi-node daemonset, it is able to retrieve metadata from IMDS reliably.
It does appear to successfully fallback to getting metadata from Kubernetes, but we think IMDS should not be failing like this.
What happened?
What you expected to happen?
How to reproduce it (as minimally and precisely as possible)?
Anything else we need to know?:
Our launch template looks like:
And our managed nodegroup looks like:
Environment
kubectl version
): v1.30.6-eks-7f9249aThe text was updated successfully, but these errors were encountered: