-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 bootstrap: fix useExperimentalRetryJoin for kubernetes v1.31 #10983
🐛 bootstrap: fix useExperimentalRetryJoin for kubernetes v1.31 #10983
Conversation
18bb31c
to
755db50
Compare
/test help |
@chrischdi: The specified target(s) for
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/test pull-cluster-api-e2e-upgrade-1-30-1-31-main |
@@ -0,0 +1,137 @@ | |||
#!/bin/bash | |||
# Copyright 2020 The Kubernetes Authors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2024
(because it's a new file, even if we are just copying code to it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
github is misdisplaying it a bit. This is the old file, and just the same :-)
But thanks for the pointer, have to bump the year on the other file!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
github is misdisplaying it a bit.
hm, not sure how so.
seems like bootstrap/kubeadm/internal/cloudinit/kubeadm-bootstrap-script-pre-k8s-1-31.sh is a new file that has the old contents. it should be with the 2024 year. old file can have the new changes but be with the old year.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's basically a git mv bootstrap/kubeadm/internal/cloudinit/kubeadm-bootstrap-script.sh bootstrap/kubeadm/internal/cloudinit/kubeadm-bootstrap-script-pre-k8s-1-31.sh
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, it ends up as a new file in the tree, so the correct action is to apply 2024.
i don't think the license police will mind, so up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a license police? :)
test/e2e/data/infrastructure-docker/main/clusterclass-quick-start.yaml
Outdated
Show resolved
Hide resolved
try-or-die-command kubeadm join phase etcd-join | ||
# {{ end }} | ||
# Note: below phase is a no-op if ControlPlaneKubeletLocalMode feature gate is false. | ||
retry-command kubelet-wait-boostrap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, phase order was relevant to the FG enablement, do we need to consider it here?
(non trivial i would assume)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Phase order is correct here, this is the file for >= v1.31
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a typo boostrap
.
this is the 1.31 CLI output
preflight Run join pre-flight checks
control-plane-prepare Prepare the machine for serving a control plane
/download-certs Download certificates shared among control-plane nodes from the kubeadm-certs Secret
/certs Generate the certificates for the new control plane components
/kubeconfig Generate the kubeconfig for the new control plane components
/control-plane Generate the manifests for the new control plane components
kubelet-start Write kubelet settings, certificates and (re)start the kubelet
control-plane-join Join a machine as a control plane instance
/etcd Add a new local etcd member
/mark-control-plane Mark a node as a control-plane
wait-control-plane EXPERIMENTAL: Wait for the control plane to start
but these etcd-join and kubelet-wait-bootstrap phases are hidden for now:
https://github.com/kubernetes/kubernetes/blob/f8d5b2074c9b3a1629ef085b9e17a5370f393732/cmd/kubeadm/app/cmd/phases/join/controlplanejoin.go#L93
so there has to be another iteration here. or you can use the fact that calling an unknown or hidden kubeadm join phase non-existent
will return exit code 0, which is a bit of a mistake on the kubeadm side. just found it out by doing kubeadm join phase foo > /dev/null && echo ok
and prepopulate these hidden phases in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, looks like I even cannot run these hidden phases manually :-(
/test pull-cluster-api-e2e-upgrade-1-30-1-31-main |
if [[ "${CONTROLPLANE_KUBELET_LOCAL_MODE}" == "true" ]]; then | ||
# Run kubeadm join and skip all already executed phases | ||
# kubeadm join relies on hidden phases when CONTROLPLANE_KUBELET_LOCAL_MODE. These cannot be run | ||
# externally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can even think about not doing this if/else.
We would only loose that we don't retry the kubeadm join phase control-plane-join mark-control-plane
.
/test pull-cluster-api-e2e-upgrade-1-30-1-31-main |
try-or-die-command kubeadm join phase control-plane-join etcd | ||
retry-command kubeadm join phase control-plane-join mark-control-plane | ||
# {{ end }} | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: there is another phase at the end:
wait-control-plane EXPERIMENTAL: Wait for the control plane to start
as i listed here:
#10983 (comment)
https://github.com/kubernetes/kubernetes/blob/f8d5b2074c9b3a1629ef085b9e17a5370f393732/cmd/kubeadm/app/cmd/phases/join/waitcontrolplane.go#L63
but it comes from another alpha FG that can be skipped in the CAPI exp-join-retry IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but i guess it depends on the timeline of exp-retry-join vs features.WaitForAllControlPlaneComponents graudation.
i would assume that features.WaitForAllControlPlaneComponents will graduate faster than the removal of exp-retry-join, so might be worth adding it here after line 148 without retries.
it's a no-op if features.WaitForAllControlPlaneComponents is disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jep. I simplified it for now to just run kubeadm join
for all leftover phases, so no retries on mark-control-plane
, but with the Pro of running all hidden and potentially enabled and new phases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
the change SGTM. like you pointed out we get it similar to the way it was at 80% so i think that's fine and give the feature is deprecated.
bootstrap/kubeadm/internal/cloudinit/kubeadm-bootstrap-script.sh
Outdated
Show resolved
Hide resolved
/test pull-cluster-api-e2e-upgrade-1-30-1-31-main |
1 similar comment
/test pull-cluster-api-e2e-upgrade-1-30-1-31-main |
|
||
# Run kubeadm join and skip all already executed phases. | ||
try-or-die-command kubeadm join --skip-phases preflight,control-plane-prepare,kubelet-start |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have two options here:
Current implementation
Just run kubeadm join
without any retries at this point.
The disadvantage is that we do not run control-plane-join mark-control-plane
with retries anymore.
The advantage here (and requirement when ControlPlaneKubeletLocalMode
is enabled): we also take advantage of hidden and new phases to be run when they are enabled.
The alternative would be:
- Keep the current implementation here for worker nodes.
- For ControlPlane nodes: use the
/etc/kubernetes/admin.conf
to check if theControlPlaneKubeletLocalMode
is enabled- Only if it is not enabled: run the
control-plane-join mark-control-plane
with retries as before
- Only if it is not enabled: run the
I think the advantages of the current implementation are better then doing the alternative here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think given all recent versions of kubeadm in support by CAPI have integrated retries. breaking down the join in phases and/or retrying any of them is pointless.
so i'm +1 to make experimental-retry-join be just a regular single kubeadm cmd join too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
At this point we need to fix this script for 1.31 as a stop gap before removal of the entire experimentalRetryMachines, but it does not add any value on top of retry/timeout management embedded in recent versions of kubeadm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
This adds a new script for useExperimentalRetryJoin for kubernetes >= v1.31 due to the following changes: * control-plane update-status phase got removed * new phases got added for the ControlPlaneKubeletLocalMode feature gate
52ec37e
to
6f1543f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rebased and squashed. Also tested it locally for:
- CP + workers
- experimentalretryjoin on and off
- v1.30 and v1.31
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the updates.
after yesterday's discussion i understood that we want to still keep the external retries and phase breakdown when the retry feature is enabled.
so
/lgtm
but best to have another reviewer.
/hold
LGTM label has been added. Git tree hash: 27adcc190a4324df413d24bb5d77e2e9c747f047
|
/retest Github flake |
One question #10983 (comment) (but should be fine based on what you tested) /lgtm Deferring to @fabriziopandini Let's get this merged before the weekend |
/cherry-pick release-1.8 |
@chrischdi: once the present PR merges, I will cherry-pick it on top of release-1.8 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
thanks @chrischdi for taking care of this fix looking forward to when we can get finally get rid of useExperimentalRetryJoin /lgtm /hold cancel |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: fabriziopandini, neolit123 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@chrischdi: new pull request created: #11000 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What this PR does / why we need it:
This adds a new script for useExperimentalRetryJoin for kubernetes >= v1.31 due to the following changes:
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #