-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: expose nodeclaim disruption through new disruption condition, improves pod eviction event message #1370
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: cnmcavoy The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
83de6f8
to
a7f19d9
Compare
Pull Request Test Coverage Report for Build 9718717778Details
💛 - Coveralls |
a7f19d9
to
3e04007
Compare
Pull Request Test Coverage Report for Build 9750492374Details
💛 - Coveralls |
9819849
to
3041c3e
Compare
Pull Request Test Coverage Report for Build 9781364600Details
💛 - Coveralls |
3041c3e
to
0d31c2d
Compare
Pull Request Test Coverage Report for Build 11635491539Details
💛 - Coveralls |
/remove-needs-rebase |
0d31c2d
to
ef86002
Compare
ef86002
to
2b235b5
Compare
2b235b5
to
0807939
Compare
ed94007
to
d731e0d
Compare
This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity. |
c531847
to
dbe3f32
Compare
dbe3f32
to
13fd78e
Compare
13fd78e
to
4677dff
Compare
This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
4677dff
to
791cb0a
Compare
ab5dbb4
to
22ed1fe
Compare
5743ceb
to
5381d84
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, think we're getting close!
if errors.IsConflict(err) { | ||
return reconcile.Result{Requeue: true}, nil | ||
} | ||
return reconcile.Result{}, fmt.Errorf("removing taint %s from nodes, %w", pretty.Taint(v1.DisruptedNoScheduleTaint), err) | ||
} | ||
if err, requeue := state.ClearNodeClaimsCondition(ctx, c.kubeClient, v1.ConditionTypeDisruptionReason, outdatedNodes...); requeue { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want to check that the node doesn't have deletion timestamp set, or else we may be fighting with another controller that adds this status condition.
// ClearNodeClaimsCondition will remove the conditionType from the NodeClaim status of the provided statenodes | ||
func ClearNodeClaimsCondition(ctx context.Context, kubeClient client.Client, conditionType string, nodes ...*StateNode) (err error, requeue bool) { | ||
return multierr.Combine(lo.Map(nodes, func(s *StateNode, _ int) error { | ||
if !s.Initialized() || s.NodeClaim == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
initialized should capture if the nodeclaim isn't set
if !s.Initialized() || s.NodeClaim == nil { | |
if !s.Initialized() { |
https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/state/statenode.go#L320
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing this check results in test panics - I don't think your assertions is correct. Here is Initialized()
:
func (in *StateNode) Initialized() bool {
// Node is managed by Karpenter, so we can check for the Initialized label
if in.Managed() {
return in.Node != nil && in.Node.Labels[v1.NodeInitializedLabelKey] == "true"
}
// Nodes not managed by Karpenter are always considered Initialized
return true
}
in.Managed()
return false
because in.NodeClaim
is nil. So initialized actually returns true
when the nodeclaim is nil
.
@@ -171,6 +175,10 @@ func (q *Queue) Reconcile(ctx context.Context) (reconcile.Result, error) { | |||
// Evict returns true if successful eviction call, and false if there was an eviction-related error | |||
func (q *Queue) Evict(ctx context.Context, key QueueKey) bool { | |||
ctx = log.IntoContext(ctx, log.FromContext(ctx).WithValues("Pod", klog.KRef(key.Namespace, key.Name))) | |||
evictionMessage, err := evictionReason(ctx, key, q.kubeClient) | |||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally think we should just silently fail here. I'm worried about super noisy logs in the case where a nodeclaim accidentally goes away, or if there are other issues with the apiserver
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dislike silently failing, bc when debugging I often end up having to re-add all the logs for suppressed errors. So I compromised and moved it to V(1)
.
…dd eviction message from condition Signed-off-by: Cameron McAvoy <[email protected]>
5381d84
to
8f0d122
Compare
Fixes #N/A
Description
Add's a new nodeclaim condition
DisruptionCandidate
which is set when a nodeclaim is being disrupted, and is applied after the disruption taint is set. TheDisruptionCandidate
nodeclaim condition contains the reason why the nodeclaim is being terminated (e.gnode worker-mgn6n/ip-10-115-200-242.us-east-2.compute.internal was single node consolidated
).The motivation for this new nodeclaim condition is so that when evicting pods, we can look up this condition and use the condition's message in the pod event.
Example of what the pod events look like now from testing in our clusters:
How was this change tested?
Built Karpenter with this change locally and tested in our clusters. Also
make presubmit
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.