-
Notifications
You must be signed in to change notification settings - Fork 519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DCGM will not run on GPU nodes with Bottlerocket OS #3992
Comments
Hey @vitaly-dt, thanks for reporting this. We do have the NVIDIA containerd runtime configured: bottlerocket/packages/containerd/containerd-config-toml_k8s_nvidia_containerd_sock Line 38 in c4b17dd
@ytsssun , lets try to reproduce this case by following: https://github.com/NVIDIA/dcgm-exporter |
@ytsssun, I think I know why DCGM might not work as expected, it is related to the @vitaly-dt , we are working on an API to allow to opt-in to the previous behavior, where unprivileged pods (aka normal workloads) have access to all the GPUs in an instance if the pod is configured with |
Thank you @arnaldo2792 for your quick response.
I can confirm that overall, none of the Nvidia operator-required services are running on bottlerocket in Kubernetes except for Here are the logs, and also the nvidia-smi output. nvidia-smi output:
nvidia-device-plugin:
|
There is a PR for the implementation, but we are planning when this feature will land in a release
With at least FWIW, in Bottlerocket, we run the NVIDIA device plugin as a system service, so you don't have to deploy it in your cluster. Running a second copy of it could cause unexpected behavior at runtime. |
Thank you - indeed it works with CAP_SYS_ADMIN. |
FWIW, I was able to reproduce this. Feel free to track the PR (#3994). Using the changes from the PR, I was able to build an Nvidia variant and launched a g5 instance. By enabling
nvidia-smi output
Userdata
|
Neat!
@vitaly-dt , please keep in mind that doing ⬆️ will grant all the containers with NVIDIA_VISIBLE_DEVICES=all access to all the GPUs, disregarding whatever resources constrains were set in the pod. So if in your pod you set:
In a container created with
|
Image I'm using:
Bottlerocket OS 1.20.0 (aws-k8s-1.29-nvidia)
What I expected to happen:
According to the documentation, all Nvidia tools should work on Bottlerocket
aws-k8s-*-nvidia
variantsLink
What actually happened:
DCGM exporter, or any other Nvidia tools are not able to run, due to the fact Nvidia container runtime is probably missing.
How to reproduce the problem:
Error message:
Similar issues:
#3967
#2347
The text was updated successfully, but these errors were encountered: