DCGM will not run on GPU nodes with Bottlerocket OS #3992

vitaly-dt · 2024-05-28T06:29:30Z

Image I'm using:
Bottlerocket OS 1.20.0 (aws-k8s-1.29-nvidia)

What I expected to happen:
According to the documentation, all Nvidia tools should work on Bottlerocket aws-k8s-*-nvidia variants
Link

What actually happened:
DCGM exporter, or any other Nvidia tools are not able to run, due to the fact Nvidia container runtime is probably missing.

How to reproduce the problem:

Add g6 GPU node, based on bottlerocket k8s nvidia ami.
Deploy DCGM Daemonset to monitor the GPU metrics

Error message:

time="2024-05-28T06:19:55Z" level=info msg="Starting dcgm-exporter"
Error: Failed to initialize NVML
time="2024-05-28T06:19:55Z" level=error msg="Encountered a failure." stacktrace="goroutine 1 [running]:\nruntime/debug.Stack()\n\t/usr/local/go/src/runtime/debug/stack.go:24 +0x5e\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.action.func1.1()\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:276 +0x3d\npanic({0x18058a0?, 0x2945390?})\n\t/usr/local/go/src/runtime/panic.go:914 +0x21f\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.initDCGM(0xc000103520)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:516 +0x9b\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.startDCGMExporter(0x47c312?, 0xc00021e810)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:296 +0xb2\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.action.func1()\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:280 +0x5b\ngithub.com/NVIDIA/dcgm-exporter/pkg/stdout.Capture({0x1cf3418?, 0xc0004c7e50}, 0xc000131b70)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/stdout/capture.go:77 +0x1f5\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.action(0xc0003c15c0)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:271 +0x67\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.NewApp.func1(0xc00019b080?)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:256 +0x13\ngithub.com/urfave/cli/v2.(*Command).Run(0xc00019b080, 0xc0003c15c0, {0xc0000400a0, 0x1, 0x1})\n\t/go/pkg/mod/github.com/urfave/cli/[email protected]/command.go:279 +0x9dd\ngithub.com/urfave/cli/v2.(*App).RunContext(0xc0001cd000, {0x1cf3300?, 0x2a0c420}, {0xc0000400a0, 0x1, 0x1})\n\t/go/pkg/mod/github.com/urfave/cli/[email protected]/app.go:337 +0x5db\ngithub.com/urfave/cli/v2.(*App).Run(0xc000131f20?, {0xc0000400a0?, 0x1?, 0x163cbb0?})\n\t/go/pkg/mod/github.com/urfave/cli/[email protected]/app.go:311 +0x2f\nmain.main()\n\t/go/src/github.com/NVIDIA/dcgm-exporter/cmd/dcgm-exporter/main.go:35 +0x5f\n"
Warning #2: dcgm-exporter doesn't have sufficient privileges to expose profiling metrics. To get profiling metrics with dcgm-exporter, use --cap-add SYS_ADMIN
2024/05/28 06:24:56 maxprocs: Leaving GOMAXPROCS=16: CPU quota undefined
time="2024-05-28T06:24:56Z" level=info msg="Starting dcgm-exporter"
Error: Failed to initialize NVML
time="2024-05-28T06:24:56Z" level=error msg="Encountered a failure." stacktrace="goroutine 1 [running]:\nruntime/debug.Stack()\n\t/usr/local/go/src/runtime/debug/stack.go:24 +0x5e\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.action.func1.1()\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:276 +0x3d\npanic({0x18058a0?, 0x2945390?})\n\t/usr/local/go/src/runtime/panic.go:914 +0x21f\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.initDCGM(0xc0003c76c0)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:516 +0x9b\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.startDCGMExporter(0x47c312?, 0xc000127fc0)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:296 +0xb2\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.action.func1()\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:280 +0x5b\ngithub.com/NVIDIA/dcgm-exporter/pkg/stdout.Capture({0x1cf3418?, 0xc0000fd680}, 0xc0004e1b70)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/stdout/capture.go:77 +0x1f5\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.action(0xc0003c19c0)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:271 +0x67\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.NewApp.func1(0xc0000342c0?)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:256 +0x13\ngithub.com/urfave/cli/v2.(*Command).Run(0xc0000342c0, 0xc0003c19c0, {0xc000126050, 0x1, 0x1})\n\t/go/pkg/mod/github.com/urfave/cli/[email protected]/command.go:279 +0x9dd\ngithub.com/urfave/cli/v2.(*App).RunContext(0xc0004c0400, {0x1cf3300?, 0x2a0c420}, {0xc000126050, 0x1, 0x1})\n\t/go/pkg/mod/github.com/urfave/cli/[email protected]/app.go:337 +0x5db\ngithub.com/urfave/cli/v2.(*App).Run(0xc0004e1f20?, {0xc000126050?, 0x1?, 0x163cbb0?})\n\t/go/pkg/mod/github.com/urfave/cli/[email protected]/app.go:311 +0x2f\nmain.main()\n\t/go/src/github.com/NVIDIA/dcgm-exporter/cmd/dcgm-exporter/main.go:35 +0x5f\n"

Similar issues:
#3967
#2347

The text was updated successfully, but these errors were encountered:

arnaldo2792 · 2024-05-29T04:27:45Z

Hey @vitaly-dt, thanks for reporting this. We do have the NVIDIA containerd runtime configured:

bottlerocket/packages/containerd/containerd-config-toml_k8s_nvidia_containerd_sock

Line 38 in c4b17dd

[plugins."io.containerd.grpc.v1.cri".containerd]

@ytsssun , lets try to reproduce this case by following: https://github.com/NVIDIA/dcgm-exporter

arnaldo2792 · 2024-05-29T04:44:57Z

@ytsssun, I think I know why DCGM might not work as expected, it is related to the NVIDIA_VISIBLE_DEVICES problem observed by others (see #3937 (comment)).

@vitaly-dt , we are working on an API to allow to opt-in to the previous behavior, where unprivileged pods (aka normal workloads) have access to all the GPUs in an instance if the pod is configured with NVIDIA_VISIBLE_DEVICES=all. Please see this doc provided by NVIDIA, to read about when NVIDIA_VISIBLE_DEVICES=all is an acceptable configuration.

vitaly-dt · 2024-05-29T06:24:27Z

Thank you @arnaldo2792 for your quick response.

Is there an estimation of when this API change will be released?
From your answer above, if I run highly privileged pods I should have access to all the GPUs?

I can confirm that overall, none of the Nvidia operator-required services are running on bottlerocket in Kubernetes except for node-feature-discovery-worker.

Here are the logs, and also the nvidia-smi output.

nvidia-smi output:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA L4                      Off | 00000000:35:00.0 Off |                    0 |
| N/A   28C    P8              11W /  72W |      0MiB / 23034MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

nvidia-device-plugin:

root@nvidia-device-plugin-daemonset-8qbx4:/# nvidia-device-plugin 
I0529 06:20:09.884988      66 main.go:178] Starting FS watcher.
I0529 06:20:09.885055      66 main.go:185] Starting OS watcher.
I0529 06:20:09.885276      66 main.go:200] Starting Plugins.
I0529 06:20:09.885304      66 main.go:257] Loading configuration.
I0529 06:20:09.885738      66 main.go:265] Updating config with default resource matching patterns.
I0529 06:20:09.885912      66 main.go:276] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "mpsRoot": "",
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0529 06:20:09.885938      66 main.go:279] Retrieving plugins.
W0529 06:20:09.885991      66 factory.go:31] No valid resources detected, creating a null CDI handler
I0529 06:20:09.886040      66 factory.go:104] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0529 06:20:09.886057      66 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0529 06:20:09.886064      66 factory.go:112] Incompatible platform detected
E0529 06:20:09.886068      66 factory.go:113] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0529 06:20:09.886074      66 factory.go:114] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0529 06:20:09.886083      66 factory.go:115] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0529 06:20:09.886090      66 factory.go:116] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
I0529 06:20:09.886095      66 main.go:308] No devices found. Waiting indefinitely.

arnaldo2792 · 2024-05-29T14:23:22Z

Is there an estimation of when this API change will be released?

There is a PR for the implementation, but we are planning when this feature will land in a release

From your answer above, if I run highly privileged pods I should have access to all the GPUs?

With at least CAP_SYS_ADMIN, correct, you don't need full privileged: true (that's too much privilege). But once the API lands, you can go back to the previous behavior if you still want it (that is, any container with NVIDIA_VISIBLE_DEVICES=all has access to all the GPUs).

FWIW, in Bottlerocket, we run the NVIDIA device plugin as a system service, so you don't have to deploy it in your cluster. Running a second copy of it could cause unexpected behavior at runtime.

vitaly-dt · 2024-05-30T06:29:21Z

Thank you - indeed it works with CAP_SYS_ADMIN.
I will follow the PR regarding the API changes.

ytsssun · 2024-05-30T06:47:33Z

FWIW, I was able to reproduce this.

Feel free to track the PR (#3994). Using the changes from the PR, I was able to build an Nvidia variant and launched a g5 instance. By enabling visible-devices-envvar-when-unprivileged, I was able to verified the DCGM getting initialized.

Warning #2: dcgm-exporter doesn't have sufficient privileges to expose profiling metrics. To get profiling metrics with dcgm-exporter, use --cap-add SYS_ADMIN
2024/05/30 06:29:53 maxprocs: Leaving GOMAXPROCS=4: CPU quota undefined
time="2024-05-30T06:29:54Z" level=info msg="Starting dcgm-exporter"
time="2024-05-30T06:29:54Z" level=info msg="DCGM successfully initialized!"
time="2024-05-30T06:29:54Z" level=info msg="Collecting DCP Metrics"

nvidia-smi output

# nvidia-smi -L
GPU 0: NVIDIA T4G (UUID: GPU-49f7dd52-5ee9-a5f0-e89c-068994aa6db4)

Userdata

[settings.kubernetes.nvidia.container-runtime]
  "visible-devices-as-volume-mounts" = false
  "visible-devices-envvar-when-unprivileged" = true

arnaldo2792 · 2024-05-30T15:07:41Z

Thank you - indeed it works with CAP_SYS_ADMIN. I will follow the PR regarding the API changes.

Neat!

By enabling visible-devices-envvar-when-unprivileged, I was able to verified the DCGM getting initialized.

@vitaly-dt , please keep in mind that doing ⬆️ will grant all the containers with NVIDIA_VISIBLE_DEVICES=all access to all the GPUs, disregarding whatever resources constrains were set in the pod. So if in your pod you set:

"nvidia.com/gpu": 1

In a container created with ENV NVIDIA_VISIBLE_DEVICES=all, that container will have access to all the GPUs, not just the one it was constrained to. This is particularly troublesome if your pods use CUDA images, since they are created with NVIDIA_VISIBLE_DEVICES=all:

docker inspect docker.io/nvidia/cuda:11.7.1-devel-ubuntu22.04 | jq '.[0]."Config"."Env"' | rg NVIDIA_VISIBLE_DEVICES
  "NVIDIA_VISIBLE_DEVICES=all",

vitaly-dt added status/needs-triage Pending triage or re-evaluation type/bug Something isn't working labels May 28, 2024

vitaly-dt closed this as completed May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DCGM will not run on GPU nodes with Bottlerocket OS #3992

DCGM will not run on GPU nodes with Bottlerocket OS #3992

vitaly-dt commented May 28, 2024

arnaldo2792 commented May 29, 2024

arnaldo2792 commented May 29, 2024

vitaly-dt commented May 29, 2024 •

edited

Loading

arnaldo2792 commented May 29, 2024 •

edited

Loading

vitaly-dt commented May 30, 2024

ytsssun commented May 30, 2024

arnaldo2792 commented May 30, 2024

DCGM will not run on GPU nodes with Bottlerocket OS #3992

DCGM will not run on GPU nodes with Bottlerocket OS #3992

Comments

vitaly-dt commented May 28, 2024

arnaldo2792 commented May 29, 2024

arnaldo2792 commented May 29, 2024

vitaly-dt commented May 29, 2024 • edited Loading

arnaldo2792 commented May 29, 2024 • edited Loading

vitaly-dt commented May 30, 2024

ytsssun commented May 30, 2024

arnaldo2792 commented May 30, 2024

vitaly-dt commented May 29, 2024 •

edited

Loading

arnaldo2792 commented May 29, 2024 •

edited

Loading