Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bottlerocket 1.25 is preventing nvidia driver to start #4250

Open
nikoul opened this issue Oct 18, 2024 · 20 comments · Fixed by bottlerocket-os/bottlerocket-core-kit#228
Open
Labels
status/in-progress This issue is currently being worked on type/bug Something isn't working

Comments

@nikoul
Copy link

nikoul commented Oct 18, 2024

Description

We have been using Bottlerocket 1.25 for the past 3 days. Since the upgrade, some of our GPU nodes (g5.xlarge) are failing to initialize the NVIDIA driver on Kubernetes 1.30. The affected image is bottlerocket-aws-k8s-1.30-nvidia-x86_64-v1.25.0-388e1050.

Issue

When the driver fails to initialize, the node does not advertise its nvidia.com/gpu capacity, causing the corresponding pods to remain in a Pending state indefinitely. This prevents any pods scheduled on the affected node from starting.
The kubelet log on the admin container of an affected node contains the following error:

Oct 17 12:05:33 ip-xx-xx-xx-xx.ec2.internal kubelet[1655]: E1017 12:05:33.335816    1655 client.go:90] "ListAndWatch ended unexpectedly for device plugin" err="rpc error: code = Unavailable desc = error reading from server: EOF" resource="nvidia.com/gpu"

Additional Information

  • We are using Karpenter to provision our nodes, but I don't think the issue is related to it. The NodeClaim remain in an Unknown state indefinitely and shows the following condition message:
    Message:               Resource "nvidia.com/gpu" was requested but not registered
  • The issue occurs on approximately 1 out of every 6 nodes, suggesting a potential race condition during node startup that interferes with NVIDIA driver initialization.
  • This problem was not observed with Bottlerocket 1.24.

Expected Behavior

The NVIDIA driver should initialize correctly on all GPU nodes, and the node should advertise its GPU capacity to ensure proper scheduling.

Steps to Reproduce

  1. Deploy Bottlerocket 1.25 on 6 (or more) g5.xlarge instances in a Kubernetes 1.30 cluster.
  2. Observe the node initialization process and monitor whether the NVIDIA driver is initialized correctly.
  3. Check the node’s GPU capacity and NodeClaim status.
@nikoul nikoul added status/needs-triage Pending triage or re-evaluation type/bug Something isn't working labels Oct 18, 2024
@nikoul nikoul changed the title Bottlerocket 1.25 is preventing nvidia driver at startup Bottlerocket 1.25 is preventing nvidia driver to start Oct 18, 2024
@larvacea
Copy link
Member

Thank you for the report. I will try your reproduction recipe and let you know what I see.

@larvacea
Copy link
Member

My first attempt to reproduce this failed (all instances had a functioning NVIDIA driver), which doesn't say very much if (as seems very likely) this is a race condition between driver load and kubelet start. I will keep looking. One mildly troubling indication is that I do not see any systemd scheduling that ensures that driver load occurs before kubelet start.

If you would like to dig deeper yourself, you might look at this on a failing node:

bash-5.1# journalctl -u load-tesla-kernel-modules
Oct 18 17:40:27 localhost systemd[1]: Starting Load Tesla kernel modules...
Oct 18 17:40:28 ip-192-168-68-128.us-west-2.compute.internal driverdog[1356]: 17:40:28 [INFO] Updated modules dependencies
Oct 18 17:40:29 ip-192-168-68-128.us-west-2.compute.internal driverdog[1356]: 17:40:29 [INFO] Loaded kernel modules
Oct 18 17:40:29 ip-192-168-68-128.us-west-2.compute.internal systemd[1]: Finished Load Tesla kernel modules.

This is the systemd service that loads the NVIDIA driver kmod on a G5 (or similar) instance. If this happens before kubelet start on successful boots and after kubelet start on failed boots, we will have found the race condition you suspected.

@larvacea
Copy link
Member

I used systemd-analyze dump to look at the constraints it used during boot on a g5.xlarge with the 1.25.0 AMI, and I believe we can rule out a race between kmod and kubelet:

bash-5.1# systemd-analyze dump kubelet.service
...
        After: configured.target (origin-file)
...
bash-5.1# systemd-analyze dump configured.target
...
        After: preconfigured.target (origin-file origin-default)
...
bash-5.1# systemd-analyze dump preconfigured.target
...
        After: load-tesla-kernel-modules.service (origin-default)

This should ensure that the driver is present before kubelet starts, so at least for now I would rule out a race between these two systemd services.

@larvacea
Copy link
Member

Any configuration details you could share would be helpful in reproducing this problem. It would be interesting to see the device plugin logging, since that's the source of the kubelet error message you provide. This is what I see on my instances:

bash-5.1# journalctl | grep nvidia-device
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: I1018 19:40:34.372837    1558 main.go:199] Starting FS watcher.
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: I1018 19:40:34.372929    1558 main.go:206] Starting OS watcher.
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: I1018 19:40:34.373179    1558 main.go:221] Starting Plugins.
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: I1018 19:40:34.373193    1558 main.go:278] Loading configuration.
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: I1018 19:40:34.386371    1558 main.go:303] Updating config with default resource matching patterns.
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: I1018 19:40:34.386452    1558 main.go:314]
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: Running with config:
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: {
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:   "version": "v1",
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:   "flags": {
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:     "migStrategy": "none",
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:     "failOnInitError": true,
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:     "mpsRoot": "",
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:     "nvidiaDriverRoot": "/",
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:     "nvidiaDevRoot": "/",
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:     "gdsEnabled": false,
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:     "mofedEnabled": false,
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:     "useNodeFeatureAPI": null,
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:     "deviceDiscoveryStrategy": "auto",
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:     "plugin": {
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:       "passDeviceSpecs": true,
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:       "deviceListStrategy": [
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:         "volume-mounts"
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:       ],
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:       "deviceIDStrategy": "index",
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:       "cdiAnnotationPrefix": "cdi.k8s.io/",
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:       "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:       "containerDriverRoot": "/driver-root"
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:     }
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:   },
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:   "resources": {
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:     "gpus": [
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:       {
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:         "pattern": "*",
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:         "name": "nvidia.com/gpu"
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:       }
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:     ]
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:   },
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:   "sharing": {
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:     "timeSlicing": {}
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]:   }
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: }
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: I1018 19:40:34.386468    1558 main.go:317] Retrieving plugins.
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: I1018 19:40:34.541469    1558 server.go:216] Starting GRPC server for 'nvidia.com/gpu'
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: I1018 19:40:34.546032    1558 server.go:147] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
Oct 18 19:40:34 ip-192-168-9-203.us-west-2.compute.internal nvidia-device-plugin[1558]: I1018 19:40:34.553200    1558 server.go:154] Registered device plugin for 'nvidia.com/gpu' with Kubelet

There is one possibly-relevant change in 1.25.0. We added support for GPU time-slicing (as seen in #4230). Normally I would expect the nvidia-device-plugin process to run until node shutdown. If the plugin crashes or shuts down for any reason, kubelet will see an unexpected eof on the plugin socket and print that ListAndWatch ended unexpectedly error message. If you can, please share the nvidia-device-plugin logging from one of your failed instances.

@nikoul
Copy link
Author

nikoul commented Oct 19, 2024

Hello again, and thanks for investigating this!
I've retrieved some logs from a faulty node, here they are:

bash-5.1# journalctl -u load-tesla-kernel-modules
Oct 17 12:05:19 localhost systemd[1]: Starting Load Tesla kernel modules...
Oct 17 12:05:21 ip-xx-xx-x-xx.ec2.internal driverdog[1375]: 12:05:21 [INFO] Updated modules dependencies
Oct 17 12:05:21 ip-xx-xx-x-xx.ec2.internal driverdog[1375]: 12:05:21 [INFO] Loaded kernel modules
Oct 17 12:05:21 ip-xx-xx-x-xx.ec2.internal systemd[1]: Finished Load Tesla kernel modules.
bash-5.1# journalctl -u kubelet
Oct 17 12:05:32 ip-xx-xx-x-xx.ec2.internal systemd[1]: Starting Kubelet...
Oct 17 12:05:32 ip-xx-xx-x-xx.ec2.internal ctr[1639]: unpacking localhost/kubernetes/pause:0.1.0 (sha256:e287062022dab2bcd00be3aa5b65a10318c93c3426cc7b72126a2108d28c47aa)...done
Oct 17 12:05:32 ip-xx-xx-x-xx.ec2.internal ctr[1648]: io.cri-containerd.image=managed,io.cri-containerd.pinned=pinned
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: Flag --container-runtime-endpoint has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: Flag --containerd has been deprecated, This is a cadvisor flag that was mistakenly registered with the Kubelet. Due to legacy concerns, it will follow the standard CLI deprecation timeline before being removed.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: Flag --register-with-taints has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: Flag --pod-infra-container-image has been deprecated, will be removed in a future release. Image garbage collector will get sandbox image information from CRI.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.004911    1655 server.go:205] "--pod-infra-container-image will not be pruned by the image garbage collector in kubelet and should also be set in the remote runtime"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.012604    1655 server.go:484] "Kubelet version" kubeletVersion="v1.30.1-eks-e564799"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.012638    1655 server.go:486] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.030358    1655 dynamic_cafile_content.go:157] "Starting controller" name="client-ca-bundle::/etc/kubernetes/pki/ca.crt"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.062024    1655 container_manager_linux.go:265] "Container manager verified user specified cgroup-root exists" cgroupRoot=[]
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.062066    1655 container_manager_linux.go:270] "Creating Container Manager object based on Node Config" nodeConfig={"NodeName":"ip-xx-xx-x-xx.ec2.internal","RuntimeCgroupsName":"/runtime.slice/containerd.service","SystemCgroupsName":"","KubeletCgroupsName":"","KubeletOOMScoreAdj":-999,"ContainerRuntime":"","CgroupsPerQOS":true,"CgroupRoot":"/","CgroupDriver":"systemd","KubeletRootDir":"/var/lib/kubelet","ProtectKernelDefaults":true,"KubeReservedCgroupName":"/runtime","SystemReservedCgroupName":"","ReservedSystemCPUs":{},"EnforceNodeAllocatable":{"pods":{}},"KubeReserved":{"cpu":"80m","ephemeral-storage":"1Gi","memory":"893Mi"},"SystemReserved":null,"HardEvictionThresholds":[{"Signal":"memory.available","Operator":"LessThan","Value":{"Quantity":"100Mi","Percentage":0},"GracePeriod":0,"MinReclaim":null},{"Signal":"nodefs.available","Operator":"LessThan","Value":{"Quantity":null,"Percentage":0.1},"GracePeriod":0,"MinReclaim":null},{"Signal":"nodefs.inodesFree","Operator":"LessThan","Value":{"Quantity":null,"Percentage":0.05},"GracePeriod":0,"MinReclaim":null},{"Signal":"imagefs.available","Operator":"LessThan","Value":{"Quantity":null,"Percentage":0.15},"GracePeriod":0,"MinReclaim":null},{"Signal":"imagefs.inodesFree","Operator":"LessThan","Value":{"Quantity":null,"Percentage":0.05},"GracePeriod":0,"MinReclaim":null}],"QOSReserved":{},"CPUManagerPolicy":"none","CPUManagerPolicyOptions":null,"TopologyManagerScope":"container","CPUManagerReconcilePeriod":10000000000,"ExperimentalMemoryManagerPolicy":"None","ExperimentalMemoryManagerReservedMemory":null,"PodPidsLimit":1048576,"EnforceCPULimits":true,"CPUCFSQuotaPeriod":100000000,"TopologyManagerPolicy":"none","TopologyManagerPolicyOptions":null}
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.063928    1655 topology_manager.go:138] "Creating topology manager with none policy"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.063947    1655 container_manager_linux.go:301] "Creating device plugin manager"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.066091    1655 state_mem.go:36] "Initialized new in-memory state store"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.069168    1655 kubelet.go:400] "Attempting to sync node with API server"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.069195    1655 kubelet.go:301] "Adding static pod path" path="/etc/kubernetes/static-pods/"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.069248    1655 kubelet.go:312] "Adding apiserver pod source"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.070036    1655 apiserver.go:42] "Waiting for node sync before watching apiserver pods"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.073362    1655 kuberuntime_manager.go:261] "Container runtime initialized" containerRuntime="containerd" version="1.7.22+bottlerocket" apiVersion="v1"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.079193    1655 kubelet.go:815] "Not starting ClusterTrustBundle informer because we are in static kubelet mode"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: W1017 12:05:33.081030    1655 probe.go:272] Flexvolume plugin directory at /var/lib/kubelet/plugins/volume/exec does not exist. Recreating.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.082417    1655 server.go:1264] "Started kubelet"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal systemd[1]: Started Kubelet.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.087003    1655 ratelimit.go:55] "Setting rate limiting for endpoint" service="podresources"qps=100 burstTokens=10
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.087388    1655 server.go:227] "Starting to serve the podresources API" endpoint="unix:/var/lib/kubelet/pod-resources/kubelet.sock"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.087484    1655 server.go:163] "Starting to listen" address="0.0.0.0" port=10250
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.088330    1655 fs_resource_analyzer.go:67] "Starting FS ResourceAnalyzer"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.089641    1655 server.go:455] "Adding debug handlers to kubelet server"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: E1017 12:05:33.090462    1655 kubelet.go:1467] "Image garbage collection failed once. Stats initializationmay not have completed yet" err="invalid capacity 0 on image filesystem"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.090469    1655 volume_manager.go:291] "Starting Kubelet Volume Manager"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.090500    1655 desired_state_of_world_populator.go:149] "Desired state populator starts to run"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.090824    1655 reconciler.go:26] "Reconciler: start to sync state"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.093738    1655 factory.go:221] Registration of the containerd container factory successfully
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.093765    1655 factory.go:221] Registration of the systemd container factory successfully
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.093862    1655 factory.go:219] Registration of the crio container factory failed: Get "http://%2Fvar%2Frun%2Fcrio%2Fcrio.sock/info": dial unix /var/run/crio/crio.sock: connect: no such file or directory
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.110556    1655 cpu_manager.go:214] "Starting CPU manager" policy="none"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.110585    1655 cpu_manager.go:215] "Reconciling" reconcilePeriod="10s"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.110606    1655 state_mem.go:36] "Initialized new in-memory state store"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.136374    1655 policy_none.go:49] "None policy: Start"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.137607    1655 memory_manager.go:170] "Starting memorymanager" policy="None"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.137642    1655 state_mem.go:35] "Initializing new in-memory state store"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.138216    1655 kubelet_network_linux.go:50] "Initialized iptables rules." protocol="IPv4"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.144592    1655 kubelet_network_linux.go:50] "Initialized iptables rules." protocol="IPv6"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.144778    1655 status_manager.go:217] "Starting to sync pod status with apiserver"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.144850    1655 kubelet.go:2337] "Starting kubelet main sync loop"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: E1017 12:05:33.144903    1655 kubelet.go:2361] "Skipping pod synchronization" err="[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.191233    1655 kubelet_node_status.go:73] "Attempting to register node" node="ip-xx-xx-x-xx.ec2.internal"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: E1017 12:05:33.246414    1655 kubelet.go:2361] "Skipping pod synchronization" err="container runtime status check may not have completed yet"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.250055    1655 manager.go:479] "Failed to read data from checkpoint" checkpoint="kubelet_internal_checkpoint" err="checkpoint is not found"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.250238    1655 container_log_manager.go:186] "Initializing container log rotate workers" workers=1 monitorPeriod="10s"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.250347    1655 plugin_manager.go:118] "Starting Kubelet Plugin Manager"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: E1017 12:05:33.254400    1655 eviction_manager.go:282] "Eviction manager: failed to get summary stats" err="failed to get node info: node \"ip-xx-xx-x-xx.ec2.internal\" not found"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.309607    1655 server.go:144] "Got registration request from device plugin with resource" resourceName="nvidia.com/gpu"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.325133    1655 server.go:144] "Got registration request from device plugin with resource" resourceName="nvidia.com/gpu"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: E1017 12:05:33.335816    1655 client.go:90] "ListAndWatch ended unexpectedly for device plugin" err="rpc error: code = Unavailable desc = error reading from server: EOF" resource="nvidia.com/gpu"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: E1017 12:05:33.378294    1655 client.go:90] "ListAndWatch ended unexpectedly for device plugin" err="rpc error: code = Canceled desc = grpc: the client connection is closing" resource="nvidia.com/gpu"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.734727    1655 kubelet_node_status.go:76] "Successfully registered node" node="ip-xx-xx-x-xx.ec2.internal"
Oct 17 12:05:34 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:34.070697    1655 apiserver.go:52] "Watching apiserver"
bash-5.1# journalctl | grep nvidia-device
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.230706    1667 main.go:199] Starting FS watcher.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.231641    1667 main.go:206] Starting OS watcher.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.232173    1667 main.go:221] Starting Plugins.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.232186    1667 main.go:278] Loading configuration.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.236537    1667 main.go:303] Updating config with default resource matching patterns.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.236586    1667 main.go:314]
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: Running with config:
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:   "version": "v1",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:   "flags": {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "migStrategy": "none",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "failOnInitError": true,
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "mpsRoot": "",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "nvidiaDriverRoot": "/",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "nvidiaDevRoot": "/",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "gdsEnabled": false,
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "mofedEnabled": false,
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "useNodeFeatureAPI": null,
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "deviceDiscoveryStrategy": "auto",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "plugin": {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:       "passDeviceSpecs": true,
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:       "deviceListStrategy": [
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:         "volume-mounts"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:       ],
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:       "deviceIDStrategy": "index",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:       "cdiAnnotationPrefix": "cdi.k8s.io/",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:       "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:       "containerDriverRoot": "/driver-root"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     }
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:   },
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:   "resources": {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "gpus": [
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:       {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:         "pattern": "*",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:         "name": "nvidia.com/gpu"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:       }
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     ]
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:   },
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:   "sharing": {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "timeSlicing": {}
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:   }
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: }
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.236597    1667 main.go:317] Retrieving plugins.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.299855    1667 server.go:216] Starting GRPC server for 'nvidia.com/gpu'
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.305466    1667 server.go:147] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.312453    1667 server.go:154] Registered device plugin for 'nvidia.com/gpu' with Kubelet
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.312491    1667 main.go:246] inotify: /var/lib/kubelet/device-plugins/kubelet.sock created, restarting.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.312501    1667 main.go:353] Stopping plugins.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.312507    1667 server.go:185] Stopping to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.313088    1667 main.go:221] Starting Plugins.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.313097    1667 main.go:278] Loading configuration.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.313542    1667 main.go:303] Updating config with default resource matching patterns.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.313587    1667 main.go:314]
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: Running with config:
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:   "version": "v1",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:   "flags": {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "migStrategy": "none",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "failOnInitError": true,
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "mpsRoot": "",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "nvidiaDriverRoot": "/",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "nvidiaDevRoot": "/",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "gdsEnabled": false,
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "mofedEnabled": false,
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "useNodeFeatureAPI": null,
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "deviceDiscoveryStrategy": "auto",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "plugin": {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:       "passDeviceSpecs": true,
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:       "deviceListStrategy": [
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:         "volume-mounts"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:       ],
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:       "deviceIDStrategy": "index",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:       "cdiAnnotationPrefix": "cdi.k8s.io/",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:       "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:       "containerDriverRoot": "/driver-root"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     }
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:   },
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:   "resources": {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "gpus": [
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:       {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:         "pattern": "*",
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:         "name": "nvidia.com/gpu"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:       }
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     ]
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:   },
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:   "sharing": {
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:     "timeSlicing": {}
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]:   }
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: }
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.313597    1667 main.go:317] Retrieving plugins.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.323855    1667 server.go:216] Starting GRPC server for 'nvidia.com/gpu'
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.324374    1667 server.go:147] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.326374    1667 server.go:154] Registered device plugin for 'nvidia.com/gpu' with Kubelet

I'm keeping that node around in case you have other ideas of logs that could help. Thanks again!

@arnaldo2792
Copy link
Contributor

@larvacea, we also added nvidia-persistenced (bottlerocket-os/bottlerocket-core-kit#122), and time-slicing wouldn't play a role here unless the users enabled it, which seems like they didn't. I wonder if nvidia-persistenced is messing up with the device plugin?

@nikoul, in the failing instance, could you please check the journal for any other errors:

journalctl -p4

@nikoul
Copy link
Author

nikoul commented Oct 21, 2024

@arnaldo2792 Of course, here is the error log from journalctl -p4:

bash-5.1# journalctl -p4
Oct 17 12:05:14 localhost kernel:  #3
Oct 17 12:05:14 localhost kernel: GPT:Primary header thinks Alt. header is not at the end of the disk.
Oct 17 12:05:14 localhost kernel: GPT:2097151 != 125829119
Oct 17 12:05:14 localhost kernel: GPT:Alternate GPT header not at the end of the disk.
Oct 17 12:05:14 localhost kernel: GPT:2097151 != 125829119
Oct 17 12:05:14 localhost kernel: GPT: Use GNU Parted to correct GPT errors.
Oct 17 12:05:14 localhost kernel: Module i8042 is blacklisted
Oct 17 12:05:14 localhost kernel: Module i8042 is blacklisted
Oct 17 12:05:15 localhost kernel: XFS (nvme1n1p1): EXPERIMENTAL Large extent counts feature in use. Use at your own risk!
Oct 17 12:05:17 localhost systemd-networkd[1192]: eth0: found matching network '/etc/systemd/network/10-eth0.network', based on potentially unpredictable interface name.
Oct 17 12:05:17 localhost systemd-networkd[1192]: eth0: found matching network '/etc/systemd/network/10-eth0.network', based on potentially unpredictable interface name.
Oct 17 12:05:19 localhost systemd-networkd[1192]: eth0: found matching network '/etc/systemd/network/10-eth0.network', based on potentially unpredictable interface name.
Oct 17 12:05:21 ip-xx-xx-x-xx.ec2.internal kernel: nvidia: loading out-of-tree module taints kernel.
Oct 17 12:05:21 ip-xx-xx-x-xx.ec2.internal kernel: nvidia: module license 'NVIDIA' taints kernel.
Oct 17 12:05:21 ip-xx-xx-x-xx.ec2.internal kernel: Disabling lock debugging due to kernel taint
Oct 17 12:05:21 ip-xx-xx-x-xx.ec2.internal kernel:
Oct 17 12:05:21 ip-xx-xx-x-xx.ec2.internal kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  535.183.06  Wed Jun 26 06:46:07 UTC 2024
Oct 17 12:05:21 ip-xx-xx-x-xx.ec2.internal kernel: nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
Oct 17 12:05:50 ip-xx-xx-x-xx.ec2.internal systemd-run[12]: Failed to create bus connection: No such file or directory
Oct 17 12:06:08 ip-xx-xx-x-xx.ec2.internal kernel: process 'linkerd2-proxy-' launched '/usr/lib/linkerd/linkerd2-proxy' with NULL argv: empty string added

@awoimbee
Copy link

I thought I had a Karpenter issue following upgrade to v1 but I might be having exactly the same issue: aws/karpenter-provider-aws#7046 (comment)

@nikoul
Copy link
Author

nikoul commented Oct 25, 2024

I thought I had a Karpenter issue following upgrade to v1 but I might be having exactly the same issue: aws/karpenter-provider-aws#7046 (comment)

Oh, thanks for sharing. Yeah, if you don't have the issue with 1.24, then it's very likely to be that same issue.

@bcressey
Copy link
Contributor

From the logs, this seems like the problematic sequence:

# plugin manager starts
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.250347    1655 plugin_manager.go:118] "Starting Kubelet Plugin Manager"

# k8s device plugin registers with kubelet
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.305466    1667 server.go:147] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.309607    1655 server.go:144] "Got registration request from device plugin with resource" resourceName="nvidia.com/gpu"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.312453    1667 server.go:154] Registered device plugin for 'nvidia.com/gpu' with Kubelet

# kubelet.sock re-created? plugin restarts
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.312491    1667 main.go:246] inotify: /var/lib/kubelet/device-plugins/kubelet.sock created, restarting.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.312501    1667 main.go:353] Stopping plugins.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.312507    1667 server.go:185] Stopping to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
...

# k8s device plugin registers with kubelet again
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.324374    1667 server.go:147] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.325133    1655 server.go:144] "Got registration request from device plugin with resource" resourceName="nvidia.com/gpu"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.326374    1667 server.go:154] Registered device plugin for 'nvidia.com/gpu' with Kubelet

# kubelet errors out
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: E1017 12:05:33.335816    1655 client.go:90] "ListAndWatch ended unexpectedly for device plugin" err="rpc error: code = Unavailable desc = error reading from server: EOF" resource="nvidia.com/gpu"
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: E1017 12:05:33.378294    1655 client.go:90] "ListAndWatch ended unexpectedly for device plugin" err="rpc error: code = Canceled desc = grpc: the client connection is closing" resource="nvidia.com/gpu"

For affected nodes, journalctl -t kubelet -t nvidia-device-plugin -o short-precise should line up logs from the components involved with the higher precision timestamps on the left column.

Specifically, this inotify event firing appears to cause the trouble:

inotify: /var/lib/kubelet/device-plugins/kubelet.sock created, restarting.

@bcressey
Copy link
Contributor

Also of interest, kubelet's plugin manager starts after the FS watcher starts:

Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal nvidia-device-plugin[1667]: I1017 12:05:33.230706    1667 main.go:199] Starting FS watcher.
Oct 17 12:05:33 ip-xx-xx-x-xx.ec2.internal kubelet[1655]: I1017 12:05:33.250347    1655 plugin_manager.go:118] "Starting Kubelet Plugin Manager"

Since the FS watcher watches the directory, it will see the initial kubelet.sock creation and treat it as a kubelet restart.

@cbgbt
Copy link
Contributor

cbgbt commented Oct 29, 2024

Thanks for the notes @bcressey. I'm working on changes to ensure nvidia-device-plugin starts after the kubelet socket becomes available.

@awoimbee
Copy link

awoimbee commented Oct 30, 2024

I just had the same issue on Bottlerocket OS 1.24.1 (aws-k8s-1.31-nvidia), it happens way less often than on 1.25, but it just happened.
I never had this issue until recently (≃ 2024-10, which matches the time I switched to EKS 1.31).
Karpenter tells me Resource "nvidia.com/gpu" was requested but not registered, the node has capacity nvidia.com/gpu: 0.
Sadly this happened in prod where SSH access and the admin container are disabled, not sure how I can get logs out.

@cbgbt cbgbt added status/in-progress This issue is currently being worked on and removed status/needs-triage Pending triage or re-evaluation labels Oct 30, 2024
@cbgbt
Copy link
Contributor

cbgbt commented Oct 30, 2024

My understanding of the bug is that there's nothing particular about 1.25.0 which makes this race condition -- I suspect it has been possible since the device plugin was introduced, though perhaps the timing is somewhat more likely now.

As an update, I'm in the process of testing my changes right now. Hoping to have a PR out with the changes soon.

@cbgbt
Copy link
Contributor

cbgbt commented Nov 1, 2024

Sorry for the delay here, the race condition was so rare in my testing environment that it was challenging to prove that we actually resolved the issue.

After more thorough investigation, there are several unfortunate behaviors that contribute to this bug.

In this case:

  • The kubelet systemd service is Type=notify, which means it signals to systemd when it has finished starting (here in the kubelet sources)
  • The device-plugin socket is started in a goroutine in kubelet. There's no synchronization between when the socket is finally created and the systemd notification mentioned above, so kubelet can mark itself as "started" before the socket is created.
  • The nvidia-k8s-device-plugin is written to restart its k8s plugins if it notices a "create" event for the kubelet socket.
  • If the nvidia-k8s-device-plugin has started its device plugin gRPC server and kubelet has initially registered it, the nvidia-k8s-device-plugin's restart can cause an error state in kubelet, which appears to be attempting to communicate with the "old" socket and then removing the new registration.

In practice, we can always avoid the bug by refraining from starting nvidia-k8s-device-plugin until we know that kubelet.sock is in place. That's the approach taken in bottlerocket-os/bottlerocket-core-kit#228.

The issues caused by the device plugin restarts definitely require further investigation though.

@cbgbt
Copy link
Contributor

cbgbt commented Nov 1, 2024

bottlerocket-os/bottlerocket-core-kit#228 is merged, which should resolve this in an upcoming Bottlerocket release!


@bcressey has done some great work looking into why nvidia-k8s-device-plugin restarts can sometimes lead to GPU resources failing to be exposed. I'll summarize what he's mentioned to me.

It seems that you can reliably "break" the exposure of GPUs via the device plugin by doing something like this:

for i in {1..100} ; do
  echo $i
  kill -sHUP $(pgrep -f nvidia-device-plugin)
  sleep 0.2
  (journalctl -t kubelet -xe|tail -n1|grep 'client connection') && break
done

This is because SIGHUP also results in the device plugin doing a restart, which is what could potentially lead to the broken state we've been discussing.

For some background, you can find more on how device plugins are designed here.

Ok, so how do restarts work?

  • The kubelet plugin watcher claims that it's used for device plugins, and that removing the socket will trigger a deregister action. This doesn't appear to be implemented for device plugins.
  • What kubelet actually does after connecting to a plugin socket is just wait for ListAndWatch responses and bail out if there's an error.
  • Per this comment on a grpc-go issue, it seems like there's isn't an API for a grpc server to close a connection, which means that the device-plugin doesn't seem to have a good way to signal to kubelet that it would like to cease the session.

So what is likely happening here is that kubelet is likely trying to continue the "old" session after nvidia-k8s-device-plugin restarts and tries to form a new connection. When the "old" session eventually fails its ListenAndWatch call, we speculate that it is de-registering the "new" registration from the device-plugin.

I'll keep this issue open while:

  • We work to include my aforementioned patch in a release of Bottlerocket
  • We find the right set of issues for upstream projects relating to the retry behavior issues I mentioned.

@nikoul
Copy link
Author

nikoul commented Nov 1, 2024

thank you very much for the investigation and the upcoming fix!

@apjneeraj
Copy link

apjneeraj commented Nov 6, 2024

I got this same error too when I describe nodeclaim. I get this every time when time-slicing is enabled , nodeclaim state is False but node is in Ready state.

Also, while node is in ready state, pod which requested "nvidia.com/gpu: 1" still remains in pending state and not being scheduled. Once i delete the pod and quickly re-apply, pod get scheduled on that node and works but nodeclaim still in False state with same error : Resource "nvidia.com/gpu" was requested but not registered.

And if i delete the pod and wait for karpenter to consolidate that node, it won't happen and I would need to manually delete the nodeclaim to cleanup that node.

If time-slicing is not enabled, I haven't seen this issue.

Bottlerocket version: Bottlerocket OS 1.26.1 (aws-k8s-1.30-nvidia)

@apjneeraj
Copy link

apjneeraj commented Nov 13, 2024

Thank for new Bottlerocket version.

I tried again with latest Bottlerocket : Bottlerocket OS 1.26.2 (aws-k8s-1.30-nvidia) and behavior persist.

I am using Karpenter to launch nodes based on below resource requirement in a pod and that triggers Karpenter to launch a GPU node. Node becomes ready quickly but nodeclaim always stays in False state with same error and pod still remains in pending state and never gets scheduled.

Note: We have time-slicing configured.

NodeClaim Error:
Resource "nvidia.com/gpu" was requested but not registered

Resource:

    resources:
      limits:
        # nvidia.com/gpu.shared: 1
        nvidia.com/gpu: 1
        memory: "4Gi"
        cpu: "1"

My assumption is Karpenter working as expected and launches the instance for resource nvidia.com/gpu and that's what was requested but due to time-slicing configuration, node now only advertises nvidia.com/gpu.shared while pod requested nvidia.com/gpu:1 which can be validated in the pod event logs. In the events, Karpenter nominated the newly created node while default-scheduler failed to schedule pod as it is looking for nvidia.com/gpu capacity and that capacity is no more advertised and available.

Events:

Warning  FailedScheduling  13s   default-scheduler  0/29 nodes are available: 1 Insufficient nvidia.com/gpu, 1 node(s) had untolerated taint {workload/xxxxx: }, 1 node(s) had untolerated taint {workload/yyyyy: }, 18 node(s) had untolerated taint {workload/default: }, 2 node(s) had untolerated taint {workload/zzzzz: }, 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {workload/aaaaaa: }. preemption: 0/29 nodes are available: 1 No preemption victims found for incoming pod, 28 Preemption is not helpful for scheduling.
  Normal   Nominated         12s   karpenter          Pod should schedule on: nodeclaim/gpu-node-pool-mbxrg, node/ip-172-xx-xx-xxx.us-west-2.compute.internal

The new pods with below resource requirements will schedule just fine on same node as shared capacity is available and advertised.

Resource:

    resources:
      limits:
        nvidia.com/gpu.shared: 1
        memory: "4Gi"
        cpu: "1"

I had to delete the original pod and re-created with shared requirement and it works fine after that. Issue is nodeclaim is False state and will not be re-claimed by Karpenter when there is no workload running.

EDIT: I was able to fix my issue by not renaming the resource name from nvidia.com/gpu to nvidia.com/gpu.shared. In Bottlerocket configuration, it is set to true by default.

Ref: https://bottlerocket.dev/en/os/1.25.x/api/settings/kubelet-device-plugin/#nvidia-time-slicing-rename-by-default

@awoimbee
Copy link

awoimbee commented Nov 15, 2024

I got a nodeclaim stuck in Unknown state on v1.26.2 (aws-k8s-1.31-nvidia). Again causing pods to not get scheduled.
I do not use time slicing, I don't have any special configuration regarding the GPUs.

Karpenter and event log:

    {"level":"INFO","time":"2024-11-15T09:30:18.466Z","logger":"controller","message":"deleted nodeclaim","commit":"6174c75","controller":"nodeclaim.termination","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"mynodepool-btjcg"},"namespace":"","name":"mynodepool-btjcg","reconcileID":"5f6ddc41-3b7a-4051-be4b-fdfc8cc3fe3e","Node":{"name":"i-0bb15c620a9cf7aba.eu-west-1.compute.internal"},"provider-id":"aws:///eu-west-1a/i-0bb15c620a9cf7aba"}
43x Nominated: Pod should schedule on: nodeclaim/mynodepool-btjcg, node/i-0bb15c620a9cf7aba.eu-west-1.compute.internal (myapp-034e7963-5809-4383-a466-a6399fb2e763-8477f46859-z5h8j.180819b84ef9dc25)
    FailedScheduling: Failed to schedule pod, would schedule against uninitialized nodeclaim/mynodepool-btjcg, node/i-0bb15c620a9cf7aba.eu-west-1.compute.internal (myapp-ac1e4646-ac55-4d4b-a254-4f438f9ab53e-5fdfb9f9b4-4d65w.1807dbd7b70f6f8c)
17x Nominated: Pod should schedule on: nodeclaim/mynodepool-btjcg, node/i-0bb15c620a9cf7aba.eu-west-1.compute.internal (myapp-ac1e4646-ac55-4d4b-a254-4f438f9ab53e-5fdfb9f9b4-4d65w.1807dbcbc6fa446d)
    DisruptionBlocked: Cannot disrupt NodeClaim: state node isn't initialized (mynodepool-btjcg.1807cda0f184a647)
    Nominated: Pod should schedule on: nodeclaim/mynodepool-btjcg, node/i-0bb15c620a9cf7aba.eu-west-1.compute.internal (myapp-b9a89a1f-8621-4b1c-beca-0bdf13f28f41-6f6bc894bc-nxk4z.1807cd9fd93d0299)
    Registered: Status condition transitioned, Type: Registered, Status: Unknown -> True, Reason: Registered (mynodepool-btjcg.1807cd88e828352a)
    {"level":"INFO","time":"2024-11-14T10:02:49.560Z","logger":"controller","message":"registered nodeclaim","commit":"6174c75","controller":"nodeclaim.lifecycle","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"mynodepool-btjcg"},"namespace":"","name":"mynodepool-btjcg","reconcileID":"ac2e9b5d-a471-4db3-9cfe-a04b47e89515","provider-id":"aws:///eu-west-1a/i-0bb15c620a9cf7aba","Node":{"name":"i-0bb15c620a9cf7aba.eu-west-1.compute.internal"}}
    DisruptionBlocked: Cannot disrupt NodeClaim: state node doesn't contain both a node and a nodeclaim (mynodepool-btjcg.1807cd84c0cc2da1)
    Launched: Status condition transitioned, Type: Launched, Status: Unknown -> True, Reason: Launched (mynodepool-btjcg.1807cd839b9f2162)
    {"level":"INFO","time":"2024-11-14T10:02:26.850Z","logger":"controller","message":"launched nodeclaim","commit":"6174c75","controller":"nodeclaim.lifecycle","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"mynodepool-btjcg"},"namespace":"","name":"mynodepool-btjcg","reconcileID":"70de79d5-e652-4920-a3ec-aea20d0baf03","provider-id":"aws:///eu-west-1a/i-0bb15c620a9cf7aba","instance-type":"g5.4xlarge","zone":"eu-west-1a","capacity-type":"on-demand","allocatable":{"cpu":"15890m","ephemeral-storage":"538926258176","memory":"57691Mi","nvidia.com/gpu":"1","pods":"234","vpc.amazonaws.com/pod-eni":"34"}}
    Nominated: Pod should schedule on: nodeclaim/mynodepool-btjcg (myapp-b9a89a1f-8621-4b1c-beca-0bdf13f28f41-6f6bc894bc-nxk4z.1807cd82fb7efa63)
    {"level":"INFO","time":"2024-11-14T10:02:24.196Z","logger":"controller","message":"created nodeclaim","commit":"6174c75","controller":"provisioner","namespace":"","name":"","reconcileID":"862b75c6-8352-4a5f-8222-0069cc615e4a","NodePool":{"name":"mynodepool"},"NodeClaim":{"name":"mynodepool-btjcg"},"requests":{"cpu":"210m","memory":"240Mi","nvidia.com/gpu":"1","pods":"9"},"instance-types":"g5.4xlarge"}
nodeclaim.yaml (right after I requested its deletion)
apiVersion: karpenter.sh/v1
kind: NodeClaim
metadata:
  annotations:
    compatibility.karpenter.k8s.aws/cluster-name-tagged: "true"
    compatibility.karpenter.k8s.aws/kubelet-drift-hash: "15379597991425564585"
    karpenter.k8s.aws/ec2nodeclass-hash: "6440581379273964080"
    karpenter.k8s.aws/ec2nodeclass-hash-version: v3
    karpenter.k8s.aws/tagged: "true"
    karpenter.sh/nodepool-hash: "13389053327402833262"
    karpenter.sh/nodepool-hash-version: v3
    karpenter.sh/stored-version-migrated: "true"
  creationTimestamp: "2024-11-14T10:02:24Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2024-11-15T09:25:36Z"
  finalizers:
  - karpenter.sh/termination
  generateName: mynodepool-
  generation: 2
  labels:
    karpenter.k8s.aws/instance-category: g
    karpenter.k8s.aws/instance-cpu: "16"
    karpenter.k8s.aws/instance-cpu-manufacturer: amd
    karpenter.k8s.aws/instance-ebs-bandwidth: "4750"
    karpenter.k8s.aws/instance-encryption-in-transit-supported: "true"
    karpenter.k8s.aws/instance-family: g5
    karpenter.k8s.aws/instance-generation: "5"
    karpenter.k8s.aws/instance-gpu-count: "1"
    karpenter.k8s.aws/instance-gpu-manufacturer: nvidia
    karpenter.k8s.aws/instance-gpu-memory: "24576"
    karpenter.k8s.aws/instance-gpu-name: a10g
    karpenter.k8s.aws/instance-hypervisor: nitro
    karpenter.k8s.aws/instance-local-nvme: "600"
    karpenter.k8s.aws/instance-memory: "65536"
    karpenter.k8s.aws/instance-network-bandwidth: "10000"
    karpenter.k8s.aws/instance-size: 4xlarge
    karpenter.sh/capacity-type: on-demand
    karpenter.sh/nodepool: mynodepool
    kubernetes.io/arch: amd64
    kubernetes.io/os: linux
    node-role: training
    node.kubernetes.io/instance-type: g5.4xlarge
    nvidia.com/gpu: A10G
    topology.k8s.aws/zone-id: euw1-az1
    topology.kubernetes.io/region: eu-west-1
    topology.kubernetes.io/zone: eu-west-1a
  name: mynodepool-btjcg
  ownerReferences:
  - apiVersion: karpenter.sh/v1
    blockOwnerDeletion: true
    kind: NodePool
    name: mynodepool
    uid: 6bd5acad-c9b4-4350-a369-b151eb571089
  resourceVersion: "460997815"
  uid: 8a2001a9-af0d-4006-8741-1d515f993d70
spec:
  expireAfter: Never
  nodeClassRef:
    group: karpenter.k8s.aws
    kind: EC2NodeClass
    name: large-disk
  requirements:
  - key: karpenter.k8s.aws/instance-memory
    operator: Gt
    values:
    - "60000"
  - key: node.kubernetes.io/instance-type
    operator: In
    values:
    - g5.4xlarge
  - key: karpenter.sh/capacity-type
    operator: In
    values:
    - on-demand
  - key: karpenter.sh/nodepool
    operator: In
    values:
    - mynodepool
  - key: nvidia.com/gpu
    operator: In
    values:
    - A10G
  - key: karpenter.k8s.aws/instance-family
    operator: In
    values:
    - g5
  resources:
    requests:
      cpu: 210m
      memory: 240Mi
      nvidia.com/gpu: "1"
      pods: "9"
  taints:
  - effect: NoSchedule
    key: nvidia.com/gpu
    value: "true"
status:
  allocatable:
    cpu: 15890m
    ephemeral-storage: "538926258176"
    memory: 57691Mi
    nvidia.com/gpu: "1"
    pods: "234"
    vpc.amazonaws.com/pod-eni: "34"
  capacity:
    cpu: "16"
    ephemeral-storage: 600G
    memory: 60620Mi
    nvidia.com/gpu: "1"
    pods: "234"
    vpc.amazonaws.com/pod-eni: "34"
  conditions:
  - lastTransitionTime: "2024-11-14T10:12:26Z"
    message: ""
    reason: ConsistentStateFound
    status: "True"
    type: ConsistentStateFound
  - lastTransitionTime: "2024-11-14T10:03:07Z"
    message: Resource "nvidia.com/gpu" was requested but not registered
    reason: ResourceNotRegistered
    status: Unknown
    type: Initialized
  - lastTransitionTime: "2024-11-15T09:25:37Z"
    message: ""
    reason: InstanceTerminating
    status: "True"
    type: InstanceTerminating
  - lastTransitionTime: "2024-11-14T10:02:26Z"
    message: ""
    reason: Launched
    status: "True"
    type: Launched
  - lastTransitionTime: "2024-11-14T10:02:49Z"
    message: Initialized=Unknown
    reason: UnhealthyDependents
    status: Unknown
    type: Ready
  - lastTransitionTime: "2024-11-14T10:02:49Z"
    message: ""
    reason: Registered
    status: "True"
    type: Registered
  imageID: ami-06455fd9d0e2a0590
  nodeName: i-0bb15c620a9cf7aba.eu-west-1.compute.internal
  providerID: aws:///eu-west-1a/i-0bb15c620a9cf7aba

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/in-progress This issue is currently being worked on type/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants