You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description
I am currently using triton vllm backend for my kubernetes cluster. There are 2 GPUs that Triton is able to see, however it seems to only choose GPU 0 to load the model weights
The expectation was for Model A to be loaded into the GPU with index 0 and Model B to be loaded to GPU with index 1. With the logs-verbose for triton turned on, I tracked and saw that triton was able to see the GPUs, Identify and Load the GPUs, however using "nvidia-smi" I could see that my models were being loaded only into gpu 0
Hence my thoughts/hypothesis on why it wasn't working is maybe due to the bridge between triton's GPU acceptance and vLLM's GPU acceptance having a bug in implementation.
Hence I took a look at triton vLLM's model.py file (particularly the validate_device_config method)
The method shows the identification of GPU to be used, however the only line to set the GPU is using torch.cuda.set_device().
From my research however on setting different GPUs for vLLM as an individual backend, they mentioned the use of CUDA_VISIBLE_DEVICES to be the only method of controlling the GPU selection, but I do not see the implementation here.
I have done single GPU serving before where I did not select specific GPU, and my set up works fine, with my models functioning well.
**I am well aware of either controlling the GPU use using my own Kubernetes resource assigner (nvidia.com/gpu), as well as the use of tensor_parallel_size to split each model between GPUs.
My current goal seeking at least some confirmation as to why triton's supposed GPU selection in config.pbtxt is not working for vLLM.**
Triton Information
What version of Triton are you using?
The version I'm using is: "nvcr.io/nvidia/tritonserver:24.10-vllm-python-py3"
Are you using the Triton container or did you build it yourself?
I am using a pre built container from the official NGC registry
To Reproduce
Steps to reproduce the behavior.
My set up uses OpenShift on top of Kubernetes so it may be challenging to recreate exactly. However I am just loading 2 models assigned with 2 separate GPUs on the same triton vLLM server
Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).
Hi @rmccorm4, thanks for the prompt reply. Currently I am unable to do so due to my tight schedule. However I came across an early post that you replied to around in Feb regarding a similar issue.
The original poster stated in a reply to you that "Thank for your reply.
Using KIND_GPU and set CUDA_VISIBLE_DEVICES before initializing vllm engine make it works as expected.
I will try starting 4 instances with KIND_MODEL and parsing the model_instance_name."
Based on the support from this older poster and the documentation on how to set vLLM to choose GPU using CUDA_VISIBLE_DEVISES, I agree that it will likely work.
Currently I do not need an immediate fix for this as I can work with the current methods available to me to manage my GPU, however I am just curious on future updates for this bug. Seems like despite that post being quite a few months ago, there is still no official change to the releases of NGC images.
Just hope that you can maybe give me info on maybe when this will be an official bug fix so I can tell my immediate superiors. Thanks!
Description
I am currently using triton vllm backend for my kubernetes cluster. There are 2 GPUs that Triton is able to see, however it seems to only choose GPU 0 to load the model weights
I have set my config.pbtxt instance groups to be:
Model A
instance_group [
{
count: 1
kind: KIND_GPU
gpus: [0]
}
]
Model B
instance_group [
{
count: 1
kind: KIND_GPU
gpus: [1]
}
]
The expectation was for Model A to be loaded into the GPU with index 0 and Model B to be loaded to GPU with index 1. With the logs-verbose for triton turned on, I tracked and saw that triton was able to see the GPUs, Identify and Load the GPUs, however using "nvidia-smi" I could see that my models were being loaded only into gpu 0
Hence my thoughts/hypothesis on why it wasn't working is maybe due to the bridge between triton's GPU acceptance and vLLM's GPU acceptance having a bug in implementation.
Hence I took a look at triton vLLM's model.py file (particularly the validate_device_config method)
The method shows the identification of GPU to be used, however the only line to set the GPU is using torch.cuda.set_device().
From my research however on setting different GPUs for vLLM as an individual backend, they mentioned the use of CUDA_VISIBLE_DEVICES to be the only method of controlling the GPU selection, but I do not see the implementation here.
References:
I have done single GPU serving before where I did not select specific GPU, and my set up works fine, with my models functioning well.
**I am well aware of either controlling the GPU use using my own Kubernetes resource assigner (nvidia.com/gpu), as well as the use of tensor_parallel_size to split each model between GPUs.
My current goal seeking at least some confirmation as to why triton's supposed GPU selection in config.pbtxt is not working for vLLM.**
Triton Information
What version of Triton are you using?
The version I'm using is: "nvcr.io/nvidia/tritonserver:24.10-vllm-python-py3"
Are you using the Triton container or did you build it yourself?
I am using a pre built container from the official NGC registry
To Reproduce
Steps to reproduce the behavior.
My set up uses OpenShift on top of Kubernetes so it may be challenging to recreate exactly. However I am just loading 2 models assigned with 2 separate GPUs on the same triton vLLM server
Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).
Models:
Models configs are simple, bare minimum and non-ensemble, only with:
Expected behavior
A clear and concise description of what you expected to happen.
Expect triton vLLM to load my models individually onto different GPUs
The text was updated successfully, but these errors were encountered: