Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition causes"Reached target .*Multi-User System.*" #3758

Open
DerekTBrown opened this issue Oct 15, 2024 · 22 comments
Open

Race condition causes"Reached target .*Multi-User System.*" #3758

DerekTBrown opened this issue Oct 15, 2024 · 22 comments
Labels
kind/support Categorizes issue or PR as a support question.

Comments

@DerekTBrown
Copy link

I am unable to create a Kind cluster using kind create. I get an error:


ERROR: failed to create cluster: could not find a log line that matches "Reached target .*Multi-User System.*\|detected cgroup v1"
--
  | Stack Trace:
  | sigs.k8s.io/kind/pkg/errors.Errorf
  | sigs.k8s.io/kind/pkg/errors/errors.go:41
  | sigs.k8s.io/kind/pkg/cluster/internal/providers/common.WaitUntilLogRegexpMatches
  | sigs.k8s.io/kind/pkg/cluster/internal/providers/common/cgroups.go:84
  | sigs.k8s.io/kind/pkg/cluster/internal/providers/docker.createContainerWithWaitUntilSystemdReachesMultiUserSystem
  | sigs.k8s.io/kind/pkg/cluster/internal/providers/docker/provision.go:417
  | sigs.k8s.io/kind/pkg/cluster/internal/providers/docker.planCreation.func3
  | sigs.k8s.io/kind/pkg/cluster/internal/providers/docker/provision.go:123
  | sigs.k8s.io/kind/pkg/errors.UntilErrorConcurrent.func1
  | sigs.k8s.io/kind/pkg/errors/concurrent.go:30
  | runtime.goexit
  | runtime/asm_amd64.s:1695
  | creating kind cluster: exit status 1

When I run kind export logs, I plainly see this line in the journal.log:

Oct 15 00:54:00 <my-cluster-name> systemd[1]: Reached target multi-user.target - Multi-User System.

I don't see anything else in the log bundle to indicate an error:

kind_logs.tar.gz


Appendix: things I eliminated as possible root causes from similar issues

  • I am on the latest version of kind.
  • It doesn't appear to be related to max_user_instances or max_user_watches. I have increased these to extremely large limits.
  • It doesn't appear to be related to kernel version, I am on a relatively new kernel.
  • It doesn't appear to be related to OS Architecture. I don't see anything in the logs to indicate a Docker image/machine arch mismatch, and the Docker image manifests seem to align with the underlying infrastructure (AMD64).
@DerekTBrown DerekTBrown added the kind/support Categorizes issue or PR as a support question. label Oct 15, 2024
@DerekTBrown
Copy link
Author

Cluster creation fails after just 3s, though the logs eventually show the correct string:

$ time kind create cluster --retain
Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:v1.31.0) 🖼
 ✗ Preparing nodes 📦
ERROR: failed to create cluster: could not find a log line that matches "Reached target .*Multi-User System.*|detected cgroup v1"

real	0m3.388s
user	0m0.188s
sys	0m0.188s

$ docker logs -f kind-control-plane | grep -E "Reached target .*Multi-User System.*|detected cgroup v1"
[  OK  ] Reached target multi-user.target - Multi-User System.

@DerekTBrown
Copy link
Author

Something funny is going on here. The error stack indicates that we are hitting this line, which seems like it is only expected to be invoked when the 30s log timeout is hit.

@DerekTBrown
Copy link
Author

DerekTBrown commented Oct 15, 2024

Adding to the exceptional weirdness; this fails 100% of the time when invoked directly, but fails ~25% of the time when invoked within strace:

$ strace -f -o trace_output_3.txt kind create cluster
Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:v1.31.0) 🖼
 ✓ Preparing nodes 📦
 ✓ Writing configuration 📜
 ✓ Starting control-plane 🕹️
 ✓ Installing CNI 🔌
 ✓ Installing StorageClass 💾
Set kubectl context to "kind-kind"
You can now use your cluster with:

kubectl cluster-info --context kind-kind

Have a nice day! 👋

@DerekTBrown
Copy link
Author

DerekTBrown commented Oct 15, 2024

To validate my hypothesis, I wrote a wrapper script for docker:

#!/bin/bash

if [[ "$1" == "logs" ]]; then
  sleep 1s
fi

/usr/local/bin/docker "$@"

Cluster creation magically succeeds.

@DerekTBrown DerekTBrown changed the title Cluster creation fails with "Reached target .*Multi-User System.*" Race condition causes"Reached target .*Multi-User System.*" Oct 15, 2024
@aojea
Copy link
Contributor

aojea commented Oct 15, 2024

The log attached has 8 worker node, can you paste the exact configuration and steps that youa are running,
a kind create cluster without any additional config files too?
are you using cgroupsv1?

@DerekTBrown
Copy link
Author

DerekTBrown commented Oct 15, 2024

a kind create cluster without any additional config files too?

This reproduces with just kind create cluster as well. Will try to get a log export when I have a second.

@DerekTBrown
Copy link
Author

are you using cgroupsv1?

No, this system has cgroupsv2.

@stmcginnis
Copy link
Contributor

This is missing a lot of important details about your environment that may be useful troubleshooting what is happening. Can you update the description from the requested output from the issue template?

Environment:

  • kind version: (use kind version):
  • Runtime info: (use docker info, podman info or nerdctl info):
  • OS (e.g. from /etc/os-release):
  • Kubernetes version: (use kubectl version):
  • Any proxies or other special environment settings?:

Also, since you can get this to repro with just a basic kind create cluster, please clean things up (docker system prune -a might not be a bad idea too) and get the logs from that simpler scenario to cut down on the noise. kind create cluster --retain and kind export logs.

@DerekTBrown
Copy link
Author

Can you update the description from the requested output from the issue template?

FYI: I think these are missing from the "Question" template:

Screenshot 2024-10-15 at 7 55 36 AM

They are present in the "Bug" template, but I wasn't sure this was a bug at the time of filing.

@DerekTBrown
Copy link
Author

kind version: (use kind version):

kind v0.24.0 go1.22.6 linux/amd64

Runtime info: (use docker info, podman info or nerdctl info):

Client: Docker Engine - Community
 Version:    27.3.1
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.17.1
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.29.7
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 8
  Running: 8
  Paused: 0
  Stopped: 0
 Images: 4
 Server Version: 27.3.1
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: runc io.containerd.runc.v2
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 7f7fdf5fed64eb6a7caf99b3e12efcf9d60e311c
 runc version: v1.1.14-0-g2c9f560
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.8.0-1015-aws
 Operating System: Ubuntu 22.04.5 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 30.82GiB
 Name: ip-10-107-37-51
 ID: 28b51eae-72ea-4480-8302-b9ba9b4dc122
 Docker Root Dir: /var/lib/docker
 Debug Mode: true
  File Descriptors: 110
  Goroutines: 188
  System Time: 2024-10-15T15:16:17.221998998Z
  EventsListeners: 0
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled

OS (e.g. from /etc/os-release):

PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

Kubernetes version: (use kubectl version)

Client Version: v1.31.1
Kustomize Version: v5.4.2

Any proxies or other special environment settings?:
Not that I am aware of. Standard Ubuntu OS.

@stmcginnis
Copy link
Contributor

Thanks for the extra details! I didn't realize it wasn't on the questions template either. :]

@DerekTBrown
Copy link
Author

Logs from a standalone kind create cluster --retain:

$ ./kind create cluster --retain
Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:v1.31.0) 🖼
 ✗ Preparing nodes 📦
ERROR: failed to create cluster: could not find a log line that matches "Reached target .*Multi-User System.*|detected cgroup v1"

kind-logs.tar.gz

@DerekTBrown
Copy link
Author

Is it possible there is a race condition between docker run and docker logs here, such that docker logs fails?

https://github.com/kubernetes-sigs/kind/blob/main/pkg/cluster/internal/providers/docker/provision.go#L410-L417

@BenTheElder
Copy link
Member

Is it possible there is a race condition between docker run and docker logs here, such that docker logs fails?

Switched link to permalink so we get the exact current code here:

if err := exec.Command("docker", append([]string{"run", "--name", name}, args...)...).Run(); err != nil {
return err
}
logCtx, logCancel := context.WithTimeout(context.Background(), 30*time.Second)
logCmd := exec.CommandContext(logCtx, "docker", "logs", "-f", name)
defer logCancel()
return common.WaitUntilLogRegexpMatches(logCtx, logCmd, common.NodeReachedCgroupsReadyRegexp())

Run() waits for process completion. So the container is running, unless there's a bug in docker. (docker run returns the container ID after starting it, not to be confused with Run() on the process).

@BenTheElder
Copy link
Member

Maybe docker run started returning the container ID earlier, before the process is actually started?

@BenTheElder
Copy link
Member

Something funny is going on here. The error stack indicates that we are hitting this line, which seems like it is only expected to be invoked when the 30s log timeout is hit.

Sounds like docker logs is failing entirely. Is the container getting restarted once or we just have a race now (maybe #3758 (comment) ?)

If the latter, then we may have to add some check that the container is running before fetching logs ... but that's been assumed to be true from docker run for a long time now (which is supposed to be like docker container create && docker container start, after start it should be running and it should be possible to start streaming logs ...)

@fjammes
Copy link

fjammes commented Oct 22, 2024

FYI, I am also hitting this bug with the following use case:

 fjammes@clrinfopo18  ~  cat /etc/os-release                            
PRETTY_NAME="Ubuntu 24.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.1 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo
 fjammes@clrinfopo18  ~  docker version
Client:
 Version:           24.0.6-rd
 API version:       1.43
 Go version:        go1.20.7
 Git commit:        da4c87c
 Built:             Wed Sep  6 16:41:53 2023
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          24.0.7
  API version:      1.43 (minimum version 1.12)
  Go version:       go1.20.10
  Git commit:       311b9ff
  Built:            Thu Oct 26 09:08:01 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.26
  GitCommit:        3dd1e886e55dd695541fdcd67420c2888645a495
 runc:
  Version:          1.1.10
  GitCommit:        v1.1.10-0-g18a0cb0
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
 fjammes@clrinfopo18  ~  kind version
kind v0.24.0 go1.22.6 linux/amd64
 fjammes@clrinfopo18  ~  cat kind-config.yaml 
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker
- role: worker

And the command line which cause the error:

kind create cluster --config  kind-config.yaml --name first
Creating cluster "first" ...
 ✓ Ensuring node image (kindest/node:v1.31.0) 🖼
 ✓ Preparing nodes 📦 📦 📦 📦  
 ✓ Writing configuration 📜 
 ✓ Starting control-plane 🕹️ 
 ✓ Installing CNI 🔌 
 ✓ Installing StorageClass 💾 
 ✓ Joining worker nodes 🚜 
Set kubectl context to "kind-first"
You can now use your cluster with:

kubectl cluster-info --context kind-first

Not sure what to do next? 😅  Check out https://kind.sigs.k8s.io/docs/user/quick-start/

kind create cluster --config  kind-config.yaml --name second
Creating cluster "second" ...
 ✓ Ensuring node image (kindest/node:v1.31.0) 🖼
 ✗ Preparing nodes 📦 📦 📦 📦  
Deleted nodes: ["second-worker2" "second-control-plane" "second-worker3" "second-worker"]
ERROR: failed to create cluster: could not find a log line that matches "Reached target .*Multi-User System.*|detected cgroup v1"

This problem does not occur when kind cluster only have one node (i.e. control plane).

I hope this will help.

@aojea
Copy link
Contributor

aojea commented Oct 22, 2024

let me ask @AkihiroSuda if this rings a bell to him? this is a really weird problem

@BenTheElder
Copy link
Member

BenTheElder commented Oct 22, 2024

I think we need to capture the docker logs error (that we seem to be getting), we don't know yet what is actually failing sine we just return this fixed "could not find a log line" error.

We aren't capturing the error when docker logs fails instead of not seeing the line.

@fjammes
Copy link

fjammes commented Oct 23, 2024

Maybe this may help?

# I have a running cluster
 docker ps
CONTAINER ID   IMAGE                  COMMAND                  CREATED       STATUS       PORTS                       NAMES
6a75e33dd66a   kindest/node:v1.31.0   "/usr/local/bin/entr…"   2 hours ago   Up 2 hours   127.0.0.1:32939->6443/tcp   fjammes-main-7bdbcb-control-plane

# I try to create a new one
kind create cluster --config  kind-config.yaml --name first -v10

Creating cluster "first" ...
DEBUG: docker/images.go:58] Image: kindest/node:v1.31.0@sha256:53df588e04085fd41ae12de0c3fe4c72f7013bba32a20e7325357a1ac94ba865 present locally
 ✓ Ensuring node image (kindest/node:v1.31.0) 🖼
 ✗ Preparing nodes 📦 📦 📦 📦  
Deleted nodes: ["first-worker" "first-worker2" "first-worker3" "first-control-plane"]
ERROR: failed to create cluster: could not find a log line that matches "Reached target .*Multi-User System.*|detected cgroup v1"
Stack Trace: 
sigs.k8s.io/kind/pkg/errors.Errorf
	sigs.k8s.io/kind/pkg/errors/errors.go:41
sigs.k8s.io/kind/pkg/cluster/internal/providers/common.WaitUntilLogRegexpMatches
	sigs.k8s.io/kind/pkg/cluster/internal/providers/common/cgroups.go:84
sigs.k8s.io/kind/pkg/cluster/internal/providers/docker.createContainerWithWaitUntilSystemdReachesMultiUserSystem
	sigs.k8s.io/kind/pkg/cluster/internal/providers/docker/provision.go:417
sigs.k8s.io/kind/pkg/cluster/internal/providers/docker.planCreation.func2
	sigs.k8s.io/kind/pkg/cluster/internal/providers/docker/provision.go:115
sigs.k8s.io/kind/pkg/errors.UntilErrorConcurrent.func1
	sigs.k8s.io/kind/pkg/errors/concurrent.go:30
runtime.goexit
	runtime/asm_amd64.s:1695

And the docker daemon logs

sudo journalctl -fu docker.service
Oct 23 11:53:51 clrinfopo18 dockerd[3209]: time="2024-10-23T11:53:51.683242937+02:00" level=info msg="No non-localhost DNS nameservers are left in resolv.conf. Using default external servers: [nameserver 8.8.8.8 nameserver 8.8.4.4]"
Oct 23 11:53:51 clrinfopo18 dockerd[3209]: time="2024-10-23T11:53:51.683269407+02:00" level=info msg="IPv6 enabled; Adding default IPv6 external servers: [nameserver 2001:4860:4860::8888 nameserver 2001:4860:4860::8844]"
Oct 23 11:53:51 clrinfopo18 dockerd[3209]: time="2024-10-23T11:53:51.754501999+02:00" level=info msg="No non-localhost DNS nameservers are left in resolv.conf. Using default external servers: [nameserver 8.8.8.8 nameserver 8.8.4.4]"
Oct 23 11:53:51 clrinfopo18 dockerd[3209]: time="2024-10-23T11:53:51.754533431+02:00" level=info msg="IPv6 enabled; Adding default IPv6 external servers: [nameserver 2001:4860:4860::8888 nameserver 2001:4860:4860::8844]"
Oct 23 11:53:51 clrinfopo18 dockerd[3209]: time="2024-10-23T11:53:51.856004824+02:00" level=info msg="No non-localhost DNS nameservers are left in resolv.conf. Using default external servers: [nameserver 8.8.8.8 nameserver 8.8.4.4]"
Oct 23 11:53:51 clrinfopo18 dockerd[3209]: time="2024-10-23T11:53:51.856036429+02:00" level=info msg="IPv6 enabled; Adding default IPv6 external servers: [nameserver 2001:4860:4860::8888 nameserver 2001:4860:4860::8844]"
Oct 23 11:53:51 clrinfopo18 dockerd[3209]: time="2024-10-23T11:53:51.966520233+02:00" level=info msg="No non-localhost DNS nameservers are left in resolv.conf. Using default external servers: [nameserver 8.8.8.8 nameserver 8.8.4.4]"
Oct 23 11:53:51 clrinfopo18 dockerd[3209]: time="2024-10-23T11:53:51.966539809+02:00" level=info msg="IPv6 enabled; Adding default IPv6 external servers: [nameserver 2001:4860:4860::8888 nameserver 2001:4860:4860::8844]"
Oct 23 11:53:52 clrinfopo18 dockerd[3209]: time="2024-10-23T11:53:52.407574081+02:00" level=info msg="ignoring event" container=b2951ceeafaddc6f5b4433feed422970b232f5c05d997344d30b2cfabe357e6a module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Oct 23 11:53:52 clrinfopo18 dockerd[3209]: time="2024-10-23T11:53:52.514401992+02:00" level=info msg="ignoring event" container=f2a4c91121ff3f7da9b08059f02ea61c680f430685eeb19a2c99fc2cca788bb0 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Oct 23 11:53:52 clrinfopo18 dockerd[3209]: time="2024-10-23T11:53:52.519705680+02:00" level=info msg="ignoring event" container=096df3e7dd5fd215b6e10608e8a3481a29ae3398e149b85e67fca8c49be9719e module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Oct 23 11:53:52 clrinfopo18 dockerd[3209]: time="2024-10-23T11:53:52.522163561+02:00" level=info msg="ignoring event" container=1f606f6a6a53c1f31be3c5526ad50a3dcab7837406931ca7b702629ffb056ecc module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Oct 23 11:53:52 clrinfopo18 dockerd[3209]: time="2024-10-23T11:53:52.527215263+02:00" level=warning msg="ShouldRestart failed, container will not be restarted" container=f2a4c91121ff3f7da9b08059f02ea61c680f430685eeb19a2c99fc2cca788bb0 daemonShuttingDown=false error="restart canceled" execDuration=297.56881ms exitStatus="{137 2024-10-23 09:53:52.488308099 +0000 UTC}" hasBeenManuallyStopped=true restartCount=0
Oct 23 11:53:52 clrinfopo18 dockerd[3209]: time="2024-10-23T11:53:52.530169714+02:00" level=warning msg="ShouldRestart failed, container will not be restarted" container=096df3e7dd5fd215b6e10608e8a3481a29ae3398e149b85e67fca8c49be9719e daemonShuttingDown=false error="restart canceled" execDuration=433.673569ms exitStatus="{137 2024-10-23 09:53:52.497934074 +0000 UTC}" hasBeenManuallyStopped=true restartCount=0
Oct 23 11:53:52 clrinfopo18 dockerd[3209]: time="2024-10-23T11:53:52.532995375+02:00" level=warning msg="ShouldRestart failed, container will not be restarted" container=1f606f6a6a53c1f31be3c5526ad50a3dcab7837406931ca7b702629ffb056ecc daemonShuttingDown=false error="restart canceled" execDuration=440.453011ms exitStatus="{137 2024-10-23 09:53:52.502121269 +0000 UTC}" hasBeenManuallyStopped=true restartCount=0
Oct 23 11:53:52 clrinfopo18 dockerd[3209]: time="2024-10-23T11:53:52.540629304+02:00" level=error msg="restartmanger wait error: container is marked for removal and cannot be started"

@BenTheElder
Copy link
Member

Maybe this may help?

Thanks but I don't think that tells us what happened unfortunately, most likely we'll have to patch the code in an environment that can reproduce this to log the output of the docker logs call.

@fjammes
Copy link

fjammes commented Nov 5, 2024

On my side I have found my error, it was an issue in my kind configuration file. This was preventing the api-server from starting. It seems that if the api-server fails to start (i.e. kubeadm init fail), then kind will delete the control-plane node and this make debugging difficult. I should have used the --retain option in order to keep the nodes for debugging (i.e. kind create cluster --name k8s --retain --config kind-config.yaml). Thanks to the kind team for helping me on that topic!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/support Categorizes issue or PR as a support question.
Projects
None yet
Development

No branches or pull requests

6 participants
@fjammes @BenTheElder @stmcginnis @aojea @DerekTBrown and others