Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DCGM will not run on GPU nodes with Bottlerocket OS #3992

Closed
vitaly-dt opened this issue May 28, 2024 · 7 comments
Closed

DCGM will not run on GPU nodes with Bottlerocket OS #3992

vitaly-dt opened this issue May 28, 2024 · 7 comments
Labels
status/needs-triage Pending triage or re-evaluation type/bug Something isn't working

Comments

@vitaly-dt
Copy link

Image I'm using:
Bottlerocket OS 1.20.0 (aws-k8s-1.29-nvidia)

What I expected to happen:
According to the documentation, all Nvidia tools should work on Bottlerocket aws-k8s-*-nvidia variants
Link

What actually happened:
DCGM exporter, or any other Nvidia tools are not able to run, due to the fact Nvidia container runtime is probably missing.

How to reproduce the problem:

  • Add g6 GPU node, based on bottlerocket k8s nvidia ami.
  • Deploy DCGM Daemonset to monitor the GPU metrics

Error message:

time="2024-05-28T06:19:55Z" level=info msg="Starting dcgm-exporter"
Error: Failed to initialize NVML
time="2024-05-28T06:19:55Z" level=error msg="Encountered a failure." stacktrace="goroutine 1 [running]:\nruntime/debug.Stack()\n\t/usr/local/go/src/runtime/debug/stack.go:24 +0x5e\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.action.func1.1()\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:276 +0x3d\npanic({0x18058a0?, 0x2945390?})\n\t/usr/local/go/src/runtime/panic.go:914 +0x21f\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.initDCGM(0xc000103520)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:516 +0x9b\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.startDCGMExporter(0x47c312?, 0xc00021e810)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:296 +0xb2\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.action.func1()\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:280 +0x5b\ngithub.com/NVIDIA/dcgm-exporter/pkg/stdout.Capture({0x1cf3418?, 0xc0004c7e50}, 0xc000131b70)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/stdout/capture.go:77 +0x1f5\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.action(0xc0003c15c0)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:271 +0x67\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.NewApp.func1(0xc00019b080?)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:256 +0x13\ngithub.com/urfave/cli/v2.(*Command).Run(0xc00019b080, 0xc0003c15c0, {0xc0000400a0, 0x1, 0x1})\n\t/go/pkg/mod/github.com/urfave/cli/[email protected]/command.go:279 +0x9dd\ngithub.com/urfave/cli/v2.(*App).RunContext(0xc0001cd000, {0x1cf3300?, 0x2a0c420}, {0xc0000400a0, 0x1, 0x1})\n\t/go/pkg/mod/github.com/urfave/cli/[email protected]/app.go:337 +0x5db\ngithub.com/urfave/cli/v2.(*App).Run(0xc000131f20?, {0xc0000400a0?, 0x1?, 0x163cbb0?})\n\t/go/pkg/mod/github.com/urfave/cli/[email protected]/app.go:311 +0x2f\nmain.main()\n\t/go/src/github.com/NVIDIA/dcgm-exporter/cmd/dcgm-exporter/main.go:35 +0x5f\n"
Warning #2: dcgm-exporter doesn't have sufficient privileges to expose profiling metrics. To get profiling metrics with dcgm-exporter, use --cap-add SYS_ADMIN
2024/05/28 06:24:56 maxprocs: Leaving GOMAXPROCS=16: CPU quota undefined
time="2024-05-28T06:24:56Z" level=info msg="Starting dcgm-exporter"
Error: Failed to initialize NVML
time="2024-05-28T06:24:56Z" level=error msg="Encountered a failure." stacktrace="goroutine 1 [running]:\nruntime/debug.Stack()\n\t/usr/local/go/src/runtime/debug/stack.go:24 +0x5e\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.action.func1.1()\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:276 +0x3d\npanic({0x18058a0?, 0x2945390?})\n\t/usr/local/go/src/runtime/panic.go:914 +0x21f\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.initDCGM(0xc0003c76c0)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:516 +0x9b\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.startDCGMExporter(0x47c312?, 0xc000127fc0)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:296 +0xb2\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.action.func1()\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:280 +0x5b\ngithub.com/NVIDIA/dcgm-exporter/pkg/stdout.Capture({0x1cf3418?, 0xc0000fd680}, 0xc0004e1b70)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/stdout/capture.go:77 +0x1f5\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.action(0xc0003c19c0)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:271 +0x67\ngithub.com/NVIDIA/dcgm-exporter/pkg/cmd.NewApp.func1(0xc0000342c0?)\n\t/go/src/github.com/NVIDIA/dcgm-exporter/pkg/cmd/app.go:256 +0x13\ngithub.com/urfave/cli/v2.(*Command).Run(0xc0000342c0, 0xc0003c19c0, {0xc000126050, 0x1, 0x1})\n\t/go/pkg/mod/github.com/urfave/cli/[email protected]/command.go:279 +0x9dd\ngithub.com/urfave/cli/v2.(*App).RunContext(0xc0004c0400, {0x1cf3300?, 0x2a0c420}, {0xc000126050, 0x1, 0x1})\n\t/go/pkg/mod/github.com/urfave/cli/[email protected]/app.go:337 +0x5db\ngithub.com/urfave/cli/v2.(*App).Run(0xc0004e1f20?, {0xc000126050?, 0x1?, 0x163cbb0?})\n\t/go/pkg/mod/github.com/urfave/cli/[email protected]/app.go:311 +0x2f\nmain.main()\n\t/go/src/github.com/NVIDIA/dcgm-exporter/cmd/dcgm-exporter/main.go:35 +0x5f\n"

Similar issues:
#3967
#2347

@vitaly-dt vitaly-dt added status/needs-triage Pending triage or re-evaluation type/bug Something isn't working labels May 28, 2024
@arnaldo2792
Copy link
Contributor

Hey @vitaly-dt, thanks for reporting this. We do have the NVIDIA containerd runtime configured:

[plugins."io.containerd.grpc.v1.cri".containerd]

@ytsssun , lets try to reproduce this case by following: https://github.com/NVIDIA/dcgm-exporter

@arnaldo2792
Copy link
Contributor

@ytsssun, I think I know why DCGM might not work as expected, it is related to the NVIDIA_VISIBLE_DEVICES problem observed by others (see #3937 (comment)).

@vitaly-dt , we are working on an API to allow to opt-in to the previous behavior, where unprivileged pods (aka normal workloads) have access to all the GPUs in an instance if the pod is configured with NVIDIA_VISIBLE_DEVICES=all. Please see this doc provided by NVIDIA, to read about when NVIDIA_VISIBLE_DEVICES=all is an acceptable configuration.

@vitaly-dt
Copy link
Author

vitaly-dt commented May 29, 2024

Thank you @arnaldo2792 for your quick response.

  1. Is there an estimation of when this API change will be released?
  2. From your answer above, if I run highly privileged pods I should have access to all the GPUs?

I can confirm that overall, none of the Nvidia operator-required services are running on bottlerocket in Kubernetes except for node-feature-discovery-worker.

Here are the logs, and also the nvidia-smi output.

nvidia-smi output:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA L4                      Off | 00000000:35:00.0 Off |                    0 |
| N/A   28C    P8              11W /  72W |      0MiB / 23034MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

nvidia-device-plugin:

root@nvidia-device-plugin-daemonset-8qbx4:/# nvidia-device-plugin 
I0529 06:20:09.884988      66 main.go:178] Starting FS watcher.
I0529 06:20:09.885055      66 main.go:185] Starting OS watcher.
I0529 06:20:09.885276      66 main.go:200] Starting Plugins.
I0529 06:20:09.885304      66 main.go:257] Loading configuration.
I0529 06:20:09.885738      66 main.go:265] Updating config with default resource matching patterns.
I0529 06:20:09.885912      66 main.go:276] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "mpsRoot": "",
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0529 06:20:09.885938      66 main.go:279] Retrieving plugins.
W0529 06:20:09.885991      66 factory.go:31] No valid resources detected, creating a null CDI handler
I0529 06:20:09.886040      66 factory.go:104] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0529 06:20:09.886057      66 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0529 06:20:09.886064      66 factory.go:112] Incompatible platform detected
E0529 06:20:09.886068      66 factory.go:113] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0529 06:20:09.886074      66 factory.go:114] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0529 06:20:09.886083      66 factory.go:115] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0529 06:20:09.886090      66 factory.go:116] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
I0529 06:20:09.886095      66 main.go:308] No devices found. Waiting indefinitely.

@arnaldo2792
Copy link
Contributor

arnaldo2792 commented May 29, 2024

Is there an estimation of when this API change will be released?

There is a PR for the implementation, but we are planning when this feature will land in a release

From your answer above, if I run highly privileged pods I should have access to all the GPUs?

With at least CAP_SYS_ADMIN, correct, you don't need full privileged: true (that's too much privilege). But once the API lands, you can go back to the previous behavior if you still want it (that is, any container with NVIDIA_VISIBLE_DEVICES=all has access to all the GPUs).

FWIW, in Bottlerocket, we run the NVIDIA device plugin as a system service, so you don't have to deploy it in your cluster. Running a second copy of it could cause unexpected behavior at runtime.

@vitaly-dt
Copy link
Author

Thank you - indeed it works with CAP_SYS_ADMIN.
I will follow the PR regarding the API changes.

@ytsssun
Copy link
Contributor

ytsssun commented May 30, 2024

FWIW, I was able to reproduce this.

Feel free to track the PR (#3994). Using the changes from the PR, I was able to build an Nvidia variant and launched a g5 instance. By enabling visible-devices-envvar-when-unprivileged, I was able to verified the DCGM getting initialized.

Warning #2: dcgm-exporter doesn't have sufficient privileges to expose profiling metrics. To get profiling metrics with dcgm-exporter, use --cap-add SYS_ADMIN
2024/05/30 06:29:53 maxprocs: Leaving GOMAXPROCS=4: CPU quota undefined
time="2024-05-30T06:29:54Z" level=info msg="Starting dcgm-exporter"
time="2024-05-30T06:29:54Z" level=info msg="DCGM successfully initialized!"
time="2024-05-30T06:29:54Z" level=info msg="Collecting DCP Metrics"

nvidia-smi output

# nvidia-smi -L
GPU 0: NVIDIA T4G (UUID: GPU-49f7dd52-5ee9-a5f0-e89c-068994aa6db4)

Userdata

[settings.kubernetes.nvidia.container-runtime]
  "visible-devices-as-volume-mounts" = false
  "visible-devices-envvar-when-unprivileged" = true

@arnaldo2792
Copy link
Contributor

Thank you - indeed it works with CAP_SYS_ADMIN. I will follow the PR regarding the API changes.

Neat!

By enabling visible-devices-envvar-when-unprivileged, I was able to verified the DCGM getting initialized.

@vitaly-dt , please keep in mind that doing ⬆️ will grant all the containers with NVIDIA_VISIBLE_DEVICES=all access to all the GPUs, disregarding whatever resources constrains were set in the pod. So if in your pod you set:

"nvidia.com/gpu": 1

In a container created with ENV NVIDIA_VISIBLE_DEVICES=all, that container will have access to all the GPUs, not just the one it was constrained to. This is particularly troublesome if your pods use CUDA images, since they are created with NVIDIA_VISIBLE_DEVICES=all:

docker inspect docker.io/nvidia/cuda:11.7.1-devel-ubuntu22.04 | jq '.[0]."Config"."Env"' | rg NVIDIA_VISIBLE_DEVICES
  "NVIDIA_VISIBLE_DEVICES=all",

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/needs-triage Pending triage or re-evaluation type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants