Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 bootstrap: fix useExperimentalRetryJoin for kubernetes v1.31 #10983

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 21 additions & 3 deletions bootstrap/kubeadm/internal/cloudinit/cloudinit.go
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,11 @@ import (
"fmt"
"text/template"

"github.com/blang/semver/v4"
"github.com/pkg/errors"

bootstrapv1 "sigs.k8s.io/cluster-api/bootstrap/kubeadm/api/v1beta1"
"sigs.k8s.io/cluster-api/util/version"
)

const (
Expand Down Expand Up @@ -56,6 +58,7 @@ type BaseUserData struct {
KubeadmCommand string
KubeadmVerbosity string
SentinelFileCommand string
KubernetesVersion semver.Version
}

func (input *BaseUserData) prepare() error {
Expand All @@ -64,7 +67,7 @@ func (input *BaseUserData) prepare() error {
input.KubeadmCommand = fmt.Sprintf(standardJoinCommand, input.KubeadmVerbosity)
if input.UseExperimentalRetry {
input.KubeadmCommand = retriableJoinScriptName
joinScriptFile, err := generateBootstrapScript(input)
joinScriptFile, err := generateBootstrapScript(input, input.KubernetesVersion)
if err != nil {
return errors.Wrap(err, "failed to generate user data for machine joining control plane")
}
Expand Down Expand Up @@ -118,12 +121,23 @@ func generate(kind string, tpl string, data interface{}) ([]byte, error) {
}

var (
//go:embed kubeadm-bootstrap-script-pre-k8s-1-31.sh
kubeadmBootstrapScriptPre1_31 string
//go:embed kubeadm-bootstrap-script.sh
kubeadmBootstrapScript string

// kubernetesVersion1_31 is the version where kubeadm removed the update-status phase
// and introduced new phases with the ControlPlaneKubeletLocalMode feature gate.
kubernetesVersion1_31 = semver.MustParse("1.31.0")
)

func generateBootstrapScript(input interface{}) (*bootstrapv1.File, error) {
joinScript, err := generate("JoinScript", kubeadmBootstrapScript, input)
func generateBootstrapScript(input interface{}, parsedversion semver.Version) (*bootstrapv1.File, error) {
bootstrapScript := kubeadmBootstrapScript
if useKubeadmBootstrapScriptPre1_31(parsedversion) {
bootstrapScript = kubeadmBootstrapScriptPre1_31
}

joinScript, err := generate("JoinScript", bootstrapScript, input)
if err != nil {
return nil, errors.Wrap(err, "failed to bootstrap script for machine joins")
}
Expand All @@ -134,3 +148,7 @@ func generateBootstrapScript(input interface{}) (*bootstrapv1.File, error) {
Content: string(joinScript),
}, nil
}

func useKubeadmBootstrapScriptPre1_31(parsedversion semver.Version) bool {
return version.Compare(parsedversion, kubernetesVersion1_31, version.WithoutPreReleases()) < 0
}
37 changes: 37 additions & 0 deletions bootstrap/kubeadm/internal/cloudinit/cloudinit_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ package cloudinit
import (
"testing"

"github.com/blang/semver/v4"
. "github.com/onsi/gomega"
"k8s.io/utils/ptr"

Expand Down Expand Up @@ -278,3 +279,39 @@ func TestNewJoinControlPlaneExperimentalRetry(t *testing.T) {
g.Expect(out).To(ContainSubstring(f))
}
}

func Test_useKubeadmBootstrapScriptPre1_31(t *testing.T) {
tests := []struct {
name string
parsedversion semver.Version
want bool
}{
{
name: "true for version for v1.30",
parsedversion: semver.MustParse("1.30.99"),
want: true,
},
{
name: "true for version for v1.28",
parsedversion: semver.MustParse("1.28.0"),
want: true,
},
{
name: "false for v1.31.0",
parsedversion: semver.MustParse("1.31.0"),
want: false,
},
{
name: "false for v1.31.0-beta.0",
parsedversion: semver.MustParse("1.31.0-beta.0"),
want: false,
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
if got := useKubeadmBootstrapScriptPre1_31(tt.parsedversion); got != tt.want {
t.Errorf("useKubeadmBootstrapScriptPre1_31() = %v, want %v", got, tt.want)
}
})
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
#!/bin/bash
# Copyright 2020 The Kubernetes Authors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2024
(because it's a new file, even if we are just copying code to it)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

github is misdisplaying it a bit. This is the old file, and just the same :-)

But thanks for the pointer, have to bump the year on the other file!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

github is misdisplaying it a bit.

hm, not sure how so.

seems like bootstrap/kubeadm/internal/cloudinit/kubeadm-bootstrap-script-pre-k8s-1-31.sh is a new file that has the old contents. it should be with the 2024 year. old file can have the new changes but be with the old year.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's basically a git mv bootstrap/kubeadm/internal/cloudinit/kubeadm-bootstrap-script.sh bootstrap/kubeadm/internal/cloudinit/kubeadm-bootstrap-script-pre-k8s-1-31.sh 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, it ends up as a new file in the tree, so the correct action is to apply 2024.
i don't think the license police will mind, so up to you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a license police? :)

#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Log an error and exit.
# Args:
# $1 Message to log with the error
# $2 The error code to return
log::error_exit() {
local message="${1}"
local code="${2}"

log::error "${message}"
# {{ if .ControlPlane }}
log::info "Removing member from cluster status"
kubeadm reset -f update-cluster-status || true
log::info "Removing etcd member"
kubeadm reset -f remove-etcd-member || true
# {{ end }}
log::info "Resetting kubeadm"
kubeadm reset -f || true
log::error "cluster.x-k8s.io kubeadm bootstrap script $0 exiting with status ${code}"
exit "${code}"
}

log::success_exit() {
log::info "cluster.x-k8s.io kubeadm bootstrap script $0 finished"
exit 0
}

# Log an error but keep going.
log::error() {
local message="${1}"
timestamp=$(date --iso-8601=seconds)
echo "!!! [${timestamp}] ${1}" >&2
shift
for message; do
echo " ${message}" >&2
done
}

# Print a status line. Formatted to show up in a stream of output.
log::info() {
timestamp=$(date --iso-8601=seconds)
echo "+++ [${timestamp}] ${1}"
shift
for message; do
echo " ${message}"
done
}

check_kubeadm_command() {
local command="${1}"
local code="${2}"
case ${code} in
"0")
log::info "kubeadm reported successful execution for ${command}"
;;
"1")
log::error "kubeadm reported failed action(s) for ${command}"
;;
"2")
log::error "kubeadm reported preflight check error during ${command}"
;;
"3")
log::error_exit "kubeadm reported validation error for ${command}" "${code}"
;;
*)
log::error "kubeadm reported unknown error ${code} for ${command}"
;;
esac
}

function retry-command() {
n=0
local kubeadm_return
until [ $n -ge 5 ]; do
log::info "running '$*'"
# shellcheck disable=SC1083
"$@" --config=/run/kubeadm/kubeadm-join-config.yaml {{.KubeadmVerbosity}}
kubeadm_return=$?
check_kubeadm_command "'$*'" "${kubeadm_return}"
if [ ${kubeadm_return} -eq 0 ]; then
break
fi
# We allow preflight errors to pass
if [ ${kubeadm_return} -eq 2 ]; then
break
fi
n=$((n + 1))
sleep 15
done
if [ ${kubeadm_return} -ne 0 ]; then
log::error_exit "too many errors, exiting" "${kubeadm_return}"
fi
}

# {{ if .ControlPlane }}
function try-or-die-command() {
local kubeadm_return
log::info "running '$*'"
# shellcheck disable=SC1083
"$@" --config=/run/kubeadm/kubeadm-join-config.yaml {{.KubeadmVerbosity}}
kubeadm_return=$?
check_kubeadm_command "'$*'" "${kubeadm_return}"
if [ ${kubeadm_return} -ne 0 ]; then
log::error_exit "fatal error, exiting" "${kubeadm_return}"
fi
}
# {{ end }}

retry-command kubeadm join phase preflight --ignore-preflight-errors=DirAvailable--etc-kubernetes-manifests
# {{ if .ControlPlane }}
retry-command kubeadm join phase control-plane-prepare download-certs
retry-command kubeadm join phase control-plane-prepare certs
retry-command kubeadm join phase control-plane-prepare kubeconfig
retry-command kubeadm join phase control-plane-prepare control-plane
# {{ end }}
retry-command kubeadm join phase kubelet-start
# {{ if .ControlPlane }}
try-or-die-command kubeadm join phase control-plane-join etcd
retry-command kubeadm join phase control-plane-join update-status
retry-command kubeadm join phase control-plane-join mark-control-plane
# {{ end }}

log::success_exit
10 changes: 4 additions & 6 deletions bootstrap/kubeadm/internal/cloudinit/kubeadm-bootstrap-script.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#!/bin/bash
# Copyright 2020 The Kubernetes Authors.
# Copyright 2024 The Kubernetes Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -128,10 +128,8 @@ retry-command kubeadm join phase control-plane-prepare kubeconfig
retry-command kubeadm join phase control-plane-prepare control-plane
# {{ end }}
retry-command kubeadm join phase kubelet-start
# {{ if .ControlPlane }}
try-or-die-command kubeadm join phase control-plane-join etcd
retry-command kubeadm join phase control-plane-join update-status
retry-command kubeadm join phase control-plane-join mark-control-plane
# {{ end }}

# Run kubeadm join and skip all already executed phases.
chrischdi marked this conversation as resolved.
Show resolved Hide resolved
try-or-die-command kubeadm join --skip-phases preflight,control-plane-prepare,kubelet-start
Comment on lines +131 to +133
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have two options here:

Current implementation

Just run kubeadm join without any retries at this point.
The disadvantage is that we do not run control-plane-join mark-control-plane with retries anymore.
The advantage here (and requirement when ControlPlaneKubeletLocalMode is enabled): we also take advantage of hidden and new phases to be run when they are enabled.

The alternative would be:

  • Keep the current implementation here for worker nodes.
  • For ControlPlane nodes: use the /etc/kubernetes/admin.conf to check if the ControlPlaneKubeletLocalMode is enabled
    • Only if it is not enabled: run the control-plane-join mark-control-plane with retries as before

I think the advantages of the current implementation are better then doing the alternative here.

Copy link
Member

@neolit123 neolit123 Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think given all recent versions of kubeadm in support by CAPI have integrated retries. breaking down the join in phases and/or retrying any of them is pointless.

so i'm +1 to make experimental-retry-join be just a regular single kubeadm cmd join too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.
At this point we need to fix this script for 1.31 as a stop gap before removal of the entire experimentalRetryMachines, but it does not add any value on top of retry/timeout management embedded in recent versions of kubeadm


log::success_exit
Original file line number Diff line number Diff line change
Expand Up @@ -535,6 +535,7 @@ func (r *KubeadmConfigReconciler) handleClusterNotInitialized(ctx context.Contex
Mounts: scope.Config.Spec.Mounts,
DiskSetup: scope.Config.Spec.DiskSetup,
KubeadmVerbosity: verbosityFlag,
KubernetesVersion: parsedVersion,
},
InitConfiguration: initdata,
ClusterConfiguration: clusterdata,
Expand Down Expand Up @@ -656,6 +657,7 @@ func (r *KubeadmConfigReconciler) joinWorker(ctx context.Context, scope *Scope)
DiskSetup: scope.Config.Spec.DiskSetup,
KubeadmVerbosity: verbosityFlag,
UseExperimentalRetry: scope.Config.Spec.UseExperimentalRetryJoin,
KubernetesVersion: parsedVersion,
},
JoinConfiguration: joinData,
}
Expand Down Expand Up @@ -774,6 +776,7 @@ func (r *KubeadmConfigReconciler) joinControlplane(ctx context.Context, scope *S
DiskSetup: scope.Config.Spec.DiskSetup,
KubeadmVerbosity: verbosityFlag,
UseExperimentalRetry: scope.Config.Spec.UseExperimentalRetryJoin,
KubernetesVersion: parsedVersion,
},
}

Expand Down
Loading