Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Triton x vLLM backend GPU selection issue #7786

Open
Tedyang2003 opened this issue Nov 13, 2024 · 2 comments
Open

Triton x vLLM backend GPU selection issue #7786

Tedyang2003 opened this issue Nov 13, 2024 · 2 comments
Assignees
Labels
module: backends Issues related to the backends

Comments

@Tedyang2003
Copy link

Description
I am currently using triton vllm backend for my kubernetes cluster. There are 2 GPUs that Triton is able to see, however it seems to only choose GPU 0 to load the model weights

I have set my config.pbtxt instance groups to be:

Model A

instance_group [
{
count: 1
kind: KIND_GPU
gpus: [0]
}
]

Model B

instance_group [
{
count: 1
kind: KIND_GPU
gpus: [1]
}
]

The expectation was for Model A to be loaded into the GPU with index 0 and Model B to be loaded to GPU with index 1. With the logs-verbose for triton turned on, I tracked and saw that triton was able to see the GPUs, Identify and Load the GPUs, however using "nvidia-smi" I could see that my models were being loaded only into gpu 0

Hence my thoughts/hypothesis on why it wasn't working is maybe due to the bridge between triton's GPU acceptance and vLLM's GPU acceptance having a bug in implementation.

Hence I took a look at triton vLLM's model.py file (particularly the validate_device_config method)

image

The method shows the identification of GPU to be used, however the only line to set the GPU is using torch.cuda.set_device().

From my research however on setting different GPUs for vLLM as an individual backend, they mentioned the use of CUDA_VISIBLE_DEVICES to be the only method of controlling the GPU selection, but I do not see the implementation here.

References:

I have done single GPU serving before where I did not select specific GPU, and my set up works fine, with my models functioning well.

**I am well aware of either controlling the GPU use using my own Kubernetes resource assigner (nvidia.com/gpu), as well as the use of tensor_parallel_size to split each model between GPUs.

My current goal seeking at least some confirmation as to why triton's supposed GPU selection in config.pbtxt is not working for vLLM.**

Triton Information
What version of Triton are you using?
The version I'm using is: "nvcr.io/nvidia/tritonserver:24.10-vllm-python-py3"

Are you using the Triton container or did you build it yourself?
I am using a pre built container from the official NGC registry

To Reproduce
Steps to reproduce the behavior.
My set up uses OpenShift on top of Kubernetes so it may be challenging to recreate exactly. However I am just loading 2 models assigned with 2 separate GPUs on the same triton vLLM server

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

Models:

Models configs are simple, bare minimum and non-ensemble, only with:

  • name
  • backend
  • instance-group.

Expected behavior
A clear and concise description of what you expected to happen.

Expect triton vLLM to load my models individually onto different GPUs

@rmccorm4 rmccorm4 added the module: backends Issues related to the backends label Nov 14, 2024
@rmccorm4
Copy link
Collaborator

Hi @Tedyang2003, thanks for raising this issue!

Do you mind trying to replace that line you've identified

torch.cuda.set_device(triton_device_id)

with something like this:

os.environ["CUDA_VISIBLE_DEVICES"] = triton_device_id

and report back whether it behaves as you'd expect or not?

@rmccorm4 rmccorm4 self-assigned this Nov 14, 2024
@Tedyang2003
Copy link
Author

Tedyang2003 commented Nov 14, 2024

Hi @rmccorm4, thanks for the prompt reply. Currently I am unable to do so due to my tight schedule. However I came across an early post that you replied to around in Feb regarding a similar issue.

#6855

The original poster stated in a reply to you that "Thank for your reply.
Using KIND_GPU and set CUDA_VISIBLE_DEVICES before initializing vllm engine make it works as expected.
I will try starting 4 instances with KIND_MODEL and parsing the model_instance_name."

Based on the support from this older poster and the documentation on how to set vLLM to choose GPU using CUDA_VISIBLE_DEVISES, I agree that it will likely work.

Currently I do not need an immediate fix for this as I can work with the current methods available to me to manage my GPU, however I am just curious on future updates for this bug. Seems like despite that post being quite a few months ago, there is still no official change to the releases of NGC images.

Just hope that you can maybe give me info on maybe when this will be an official bug fix so I can tell my immediate superiors. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: backends Issues related to the backends
Development

No branches or pull requests

2 participants