-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drop Node event when EC2 instance does not exist #753
Drop Node event when EC2 instance does not exist #753
Conversation
This issue is currently awaiting triage. If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
1eef0b0
to
1ccb35f
Compare
Something seems wonky with the CI, I'll look into it. |
/retest All failures are related to persistent volumes, doesn't seem related to this change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It look ok to me, but can we add a test for this new behavior in tagging_controller_test.go?
/retest |
The CI is definitely hosed, same cases have been red in k/k since ~11/24: https://testgrid.k8s.io/presubmits-ec2#pull-kubernetes-e2e-ec2 |
1ccb35f
to
da5beab
Compare
02a76b9
to
e80c120
Compare
e80c120
to
1108878
Compare
@tzneal added a couple unit test cases 👍 |
Seems ok, any idea on the CI problem? |
Haven't had a chance to go down the rabbit hole. Looks like things broke when the @dims do you have a guess? |
@cartermckinnon no, i have not looked at this yet .. |
@dims I'll try to get a fix up 👍 |
CI should be fixed by this: kubernetes-sigs/provider-aws-test-infra#232 |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for picking this up! did we also want to make the queue size visible with some logging around here? (since dequeuing latency isn't the most direct metric)
@ndbaker1: changing LGTM is restricted to collaborators In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I plan to add a metric for this in a separate PR, because it'd be helpful to debug in the future; but I think dequeue latency is still the more important metric to track and alarm on. There can be many events in the queue that are no-ops, and that doesn't necessarily have an impact e.g. how quickly a new Node is tagged. |
@ndbaker1: changing LGTM is restricted to collaborators In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dims, mmerkes, ndbaker1 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind bug
What this PR does / why we need it:
This avoids unnecessary retries when the
ec2:CreateTags
call fails with anInvalidInstanceId.NotFound
error. Excessive retries for each event can lead to a growing work queue that may increase dequeue latency dramatically.If the Node is newly-created, we requeue the event and retry, to handle eventual-consistency of this API. If the Node is not newly-created, we drop the event.
This PR is only concerned with retries for a single event; all nodes still have the implicit "retry" that results from each update event (every ~5m).
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?: