Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AvailableGPUDevices mismatched with /proc/driver/nvidia/gpus/ #984

Open
NguyenIconAI opened this issue Feb 26, 2025 · 1 comment
Open

AvailableGPUDevices mismatched with /proc/driver/nvidia/gpus/ #984

NguyenIconAI opened this issue Feb 26, 2025 · 1 comment
Assignees

Comments

@NguyenIconAI
Copy link

NguyenIconAI commented Feb 26, 2025

Bug description

  • nvidia-smi --query-gpu=pci.domain,pci.bus_id,index,uuid --format=csv,noheader,nounits returns domain_id and bus_id. According to nvidia-smi documentation, the bus_id is composed by domain:bus:device.function in hex so it seems like we don't need to query pci.domain.
  • The logic here removes the domain from bus_id. I see two problems with it:
  1. The justification the folder in /proc/driver/nvidia/gpus is just "bus:device.function" does not match with my machine and Azure VM (with A100 GPU). They both have /proc/driver/nvidia/gpus/<domain:bus:device.function>. Do you have any references for this naming convention? It makes this following case not work:
$ nvidia-smi --query-gpu=pci.domain,pci.bus_id,index,uuid --format=csv,noheader,nounits
0x0001, 0001:00:00.0, 0, GPU-<...>
$ ls /proc/driver/nvidia/gpus/
0001:00:00.0 
  1. The domain from nvidia-smi is 4 digits, while the bus_id has 8 digits on that part, so the trimming logic won't work in this case:
$ nvidia-smi --query-gpu=pci.domain,pci.bus_id,index,uuid --format=csv,noheader,nounits
0x0001, 00000001:00:00.0, 0, GPU-<...>

The reason why this logic has been working so far because most of the time, the domain ID is 0x0000 and this logic unintentionally removes the trailing zeroes from bus_id. In my case, since I use AKS with A100 GPUs in domain 1, the cluster cannot find available GPU:

11:05PM INF acquired port: 43743 container_id=endpoint-0a7126f4-bb1b-435e-90c2-8b8e46e8bb62-75bd7829
11:05PM ERR unable to run container error="not enough GPUs available, requested: 1, allocable: 1 out of 0" container_id=endpoint-0a7126f4-bb1b-435e-90c2-8b8e46e8bb62-75bd7829
11:05PM ERR failed to clean up container network: redis: nil container_id=endpoint-0a7126f4-bb1b-435e-90c2-8b8e46e8bb62-75bd7829
11:05PM INF finalized container shutdown container_id=endpoint-0a7126f4-bb1b-435e-90c2-8b8e46e8bb62-75bd7829

How to reproduce

  1. Find a machine with domain ID 1
  2. Deploy beta9 v0.1.305 with GPU. I believe this bug is in the latest version too (v0.1.318):
...
      nvidia:                      
        mode: local
        gpuType: "A100-80"
        runtime: nvidia
        jobSpec:
          nodeSelector:
            namespace: beta9gpu
        poolSizing:
          defaultWorkerCpu: 8000m
          defaultWorkerGpuType: "A100-80"
          defaultWorkerMemory: 32Gi
          minFreeCpu: 8000m
          minFreeGpu: 1
          minFreeMemory: 32Gi
          sharedMemoryLimitPct: 100%
  1. Run a workload with GPU.

Environment

  • [] Beam Self-Hosted

Additional context

I think to fix it, just simply remove the trailing zeroes since NVIDIA-SMI output is 8 digits, while Linux (or at least Ubuntu 22.04) uses 4 digits for PCI devices.

Thank you for sharing this -- we'll fix it ASAP.

@dleviminzi dleviminzi self-assigned this Feb 26, 2025
@dleviminzi
Copy link
Collaborator

Hey @NguyenIconAI thank you for bringing this to our attention. I will merge a fix for this today and we will release it later today or tomorrow.

dleviminzi added a commit that referenced this issue Feb 26, 2025
There was some incorrect logic around the parsing of output from
`nvidia-smi`. Refer to issue #984 for more context.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants