AvailableGPUDevices mismatched with /proc/driver/nvidia/gpus/ #984

NguyenIconAI · 2025-02-26T01:12:55Z

Bug description

nvidia-smi --query-gpu=pci.domain,pci.bus_id,index,uuid --format=csv,noheader,nounits returns domain_id and bus_id. According to nvidia-smi documentation, the bus_id is composed by domain:bus:device.function in hex so it seems like we don't need to query pci.domain.
The logic here removes the domain from bus_id. I see two problems with it:

The justification the folder in /proc/driver/nvidia/gpus is just "bus:device.function" does not match with my machine and Azure VM (with A100 GPU). They both have /proc/driver/nvidia/gpus/<domain:bus:device.function>. Do you have any references for this naming convention? It makes this following case not work:

$ nvidia-smi --query-gpu=pci.domain,pci.bus_id,index,uuid --format=csv,noheader,nounits
0x0001, 0001:00:00.0, 0, GPU-<...>
$ ls /proc/driver/nvidia/gpus/
0001:00:00.0

The domain from nvidia-smi is 4 digits, while the bus_id has 8 digits on that part, so the trimming logic won't work in this case:

$ nvidia-smi --query-gpu=pci.domain,pci.bus_id,index,uuid --format=csv,noheader,nounits
0x0001, 00000001:00:00.0, 0, GPU-<...>

The reason why this logic has been working so far because most of the time, the domain ID is 0x0000 and this logic unintentionally removes the trailing zeroes from bus_id. In my case, since I use AKS with A100 GPUs in domain 1, the cluster cannot find available GPU:

11:05PM INF acquired port: 43743 container_id=endpoint-0a7126f4-bb1b-435e-90c2-8b8e46e8bb62-75bd7829
11:05PM ERR unable to run container error="not enough GPUs available, requested: 1, allocable: 1 out of 0" container_id=endpoint-0a7126f4-bb1b-435e-90c2-8b8e46e8bb62-75bd7829
11:05PM ERR failed to clean up container network: redis: nil container_id=endpoint-0a7126f4-bb1b-435e-90c2-8b8e46e8bb62-75bd7829
11:05PM INF finalized container shutdown container_id=endpoint-0a7126f4-bb1b-435e-90c2-8b8e46e8bb62-75bd7829

How to reproduce

Find a machine with domain ID 1
Deploy beta9 v0.1.305 with GPU. I believe this bug is in the latest version too (v0.1.318):

...
      nvidia:                      
        mode: local
        gpuType: "A100-80"
        runtime: nvidia
        jobSpec:
          nodeSelector:
            namespace: beta9gpu
        poolSizing:
          defaultWorkerCpu: 8000m
          defaultWorkerGpuType: "A100-80"
          defaultWorkerMemory: 32Gi
          minFreeCpu: 8000m
          minFreeGpu: 1
          minFreeMemory: 32Gi
          sharedMemoryLimitPct: 100%

Run a workload with GPU.

Environment

[] Beam Self-Hosted

Additional context

I think to fix it, just simply remove the trailing zeroes since NVIDIA-SMI output is 8 digits, while Linux (or at least Ubuntu 22.04) uses 4 digits for PCI devices.

Thank you for sharing this -- we'll fix it ASAP.

The text was updated successfully, but these errors were encountered:

dleviminzi · 2025-02-26T15:14:24Z

Hey @NguyenIconAI thank you for bringing this to our attention. I will merge a fix for this today and we will release it later today or tomorrow.

There was some incorrect logic around the parsing of output from `nvidia-smi`. Refer to issue #984 for more context.

dleviminzi self-assigned this Feb 26, 2025

dleviminzi mentioned this issue Feb 26, 2025

Fix: Correctly format the PCI path when checking device existence #988

Merged

dleviminzi added a commit that referenced this issue Feb 26, 2025

Fix: Correctly format the PCI path when checking device existence (#988)

a612c76

There was some incorrect logic around the parsing of output from `nvidia-smi`. Refer to issue #984 for more context.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AvailableGPUDevices mismatched with /proc/driver/nvidia/gpus/ #984

AvailableGPUDevices mismatched with /proc/driver/nvidia/gpus/ #984

NguyenIconAI commented Feb 26, 2025 •

edited

Loading

dleviminzi commented Feb 26, 2025

AvailableGPUDevices mismatched with /proc/driver/nvidia/gpus/ #984

AvailableGPUDevices mismatched with /proc/driver/nvidia/gpus/ #984

Comments

NguyenIconAI commented Feb 26, 2025 • edited Loading

Bug description

How to reproduce

Environment

Additional context

Thank you for sharing this -- we'll fix it ASAP.

dleviminzi commented Feb 26, 2025

NguyenIconAI commented Feb 26, 2025 •

edited

Loading