-
Notifications
You must be signed in to change notification settings - Fork 560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ThrottlingException and failed to get rate limit token issues #1504
Comments
Looks like you are hitting this issue which we fixed in latest version v2.1.0, can you please update to the latest and see if you are facing the issue still. |
@mskanth972 - we are running this version of the image - amazon/aws-efs-csi-driver:v2.1.0 |
Some debug logs - efs-plugin-debug.log |
I can't confirm it doesn't leak undeletable PVs anymore as I can't forcefully make this happen, but from the fix, it looks like there is no cleaning up from old undeletable resources (left over finalizers), so this is what worked for me for cleaning up resources P.S.A.: Double check all commands and outputs as this can destroy your data for pv in $(kubectl get pv -A -o json | jq -r '.items[] | select(.spec.storageClassName == "efs-etcd" and .status.phase == "Released") | .metadata.name'); do
kubectl patch pv "$pv" --type=json -p '[{"op": "remove", "path": "/metadata/finalizers"}]'
done |
I confirm, I am also experiencing this error (latest version of efs-csi-controller) 2.1.2. |
Hi, we have raised a PR on this to switch to AdaptiveRetry mode which is already merged and we are planning to include this in the next release. PR has more info. |
/kind bug
What happened?
We have a very volatile integration EKS cluster which creates and deletes resources regularly, including PVC's which use an EFFS as the storage class.
We recently had an issue provisioning volumes (happening very slowly) and then found we had 1000's of what seem to be orphaned PV's in the cluster - with no matching EFS access point in the account. They were all "stuck" in a released state, with the finalizer preventing removal.
On checking the controller logs we found 1000's of these errors -
Along with a handful of these errors -
Full logfile - efs-plugin.log
The Access Denied seems to be related to the fact that the controller is constantly trying to clean-up none existing access points, if you look at the CloudTrail event it has no request parameters. We know the controller has the correct permissions as it is creating/deleting file systems OK too (policy we are using is attached).
efs-pol.json
What seems to be happening is -
What you expected to happen?
The driver to handle the rate limiting with some sort of back-off and ensure successful removal allowing the PV to then be deleted too.
How to reproduce it (as minimally and precisely as possible)?
This is going to be difficult, as I mentioned above this is a cluster where we have dozens of testing pipelines creating namespaces (which include EFS based PV's), performing automated tests, then deleting itself - we think the sheer volume of this activity is triggering the rate limiting etc.
Anything else we need to know?: No
Environment
kubectl version
): v1.30.4-eks-a737599Please also attach debug logs to help us better diagnose
Difficult here due to manner in which the pods using the EFS are created / deleted. We have enabled logLevel 5 on the controller to see if this reveals anything and will upload when we have something.
The text was updated successfully, but these errors were encountered: