You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
nvidia-smi --query-gpu=pci.domain,pci.bus_id,index,uuid --format=csv,noheader,nounits returns domain_id and bus_id. According to nvidia-smi documentation, the bus_id is composed by domain:bus:device.function in hex so it seems like we don't need to query pci.domain.
The logic here removes the domain from bus_id. I see two problems with it:
The justification the folder in /proc/driver/nvidia/gpus is just "bus:device.function" does not match with my machine and Azure VM (with A100 GPU). They both have /proc/driver/nvidia/gpus/<domain:bus:device.function>. Do you have any references for this naming convention? It makes this following case not work:
The reason why this logic has been working so far because most of the time, the domain ID is 0x0000 and this logic unintentionally removes the trailing zeroes from bus_id. In my case, since I use AKS with A100 GPUs in domain 1, the cluster cannot find available GPU:
11:05PM INF acquired port: 43743 container_id=endpoint-0a7126f4-bb1b-435e-90c2-8b8e46e8bb62-75bd7829
11:05PM ERR unable to run container error="not enough GPUs available, requested: 1, allocable: 1 out of 0" container_id=endpoint-0a7126f4-bb1b-435e-90c2-8b8e46e8bb62-75bd7829
11:05PM ERR failed to clean up container network: redis: nil container_id=endpoint-0a7126f4-bb1b-435e-90c2-8b8e46e8bb62-75bd7829
11:05PM INF finalized container shutdown container_id=endpoint-0a7126f4-bb1b-435e-90c2-8b8e46e8bb62-75bd7829
How to reproduce
Find a machine with domain ID 1
Deploy beta9 v0.1.305 with GPU. I believe this bug is in the latest version too (v0.1.318):
I think to fix it, just simply remove the trailing zeroes since NVIDIA-SMI output is 8 digits, while Linux (or at least Ubuntu 22.04) uses 4 digits for PCI devices.
Thank you for sharing this -- we'll fix it ASAP.
The text was updated successfully, but these errors were encountered:
Bug description
nvidia-smi --query-gpu=pci.domain,pci.bus_id,index,uuid --format=csv,noheader,nounits
returns domain_id and bus_id. According to nvidia-smi documentation, the bus_id is composed bydomain:bus:device.function in hex
so it seems like we don't need to querypci.domain
.the folder in /proc/driver/nvidia/gpus is just "bus:device.function"
does not match with my machine and Azure VM (with A100 GPU). They both have/proc/driver/nvidia/gpus/<domain:bus:device.function>
. Do you have any references for this naming convention? It makes this following case not work:The reason why this logic has been working so far because most of the time, the domain ID is
0x0000
and this logic unintentionally removes the trailing zeroes from bus_id. In my case, since I use AKS with A100 GPUs in domain 1, the cluster cannot find available GPU:How to reproduce
Environment
Additional context
I think to fix it, just simply remove the trailing zeroes since NVIDIA-SMI output is 8 digits, while Linux (or at least Ubuntu 22.04) uses 4 digits for PCI devices.
Thank you for sharing this -- we'll fix it ASAP.
The text was updated successfully, but these errors were encountered: