Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated vllm tutorial to use vllm based container from NGC #65

Merged
merged 7 commits into from
Oct 31, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 0 additions & 28 deletions Quick_Deploy/vLLM/Dockerfile

This file was deleted.

153 changes: 114 additions & 39 deletions Quick_Deploy/vLLM/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,38 +31,43 @@

The following tutorial demonstrates how to deploy a simple
[facebook/opt-125m](https://huggingface.co/facebook/opt-125m) model on
Triton Inference Server using Triton's [Python backend](https://github.com/triton-inference-server/python_backend) and the
[vLLM](https://github.com/vllm-project/vllm) library.
Triton Inference Server using the Triton's
[Python-based](https://github.com/triton-inference-server/backend/blob/main/docs/python_based_backends.md#python-based-backends)
[vLLM](https://github.com/triton-inference-server/vllm_backend/tree/main)
backend.

*NOTE*: The tutorial is intended to be a reference example only and has [known limitations](#limitations).


## Step 1: Build a Triton Container Image with vLLM
## Step 1: Prepare your model repository

We will build a new container image derived from tritonserver:23.08-py3 with vLLM.
To use Triton, we need to build a model repository. For this tutorial we will
use the model repository, provided in the [samples](https://github.com/triton-inference-server/vllm_backend/tree/main/samples)
folder of the [vllm_backend](https://github.com/triton-inference-server/vllm_backend/tree/main)
repository.

The following set of commands will create a `model_repository/vllm_model/1`
directory and copy 2 files:
[`model.json`](https://github.com/triton-inference-server/vllm_backend/blob/main/samples/model_repository/vllm_model/1/model.json)
and
[`config.pbtxt`](https://github.com/triton-inference-server/vllm_backend/blob/main/samples/model_repository/vllm_model/config.pbtxt),
required to serve the [facebook/opt-125m](https://huggingface.co/facebook/opt-125m) model.
```
docker build -t tritonserver_vllm .
mkdir -p model_repository/vllm_model/1
wget -P model_repository/vllm_model/1 https://raw.githubusercontent.com/triton-inference-server/vllm_backend/main/samples/model_repository/vllm_model/1/model.json
wget -P model_repository/vllm_model/ https://raw.githubusercontent.com/triton-inference-server/vllm_backend/main/samples/model_repository/vllm_model/config.pbtxt
```

The above command should create the tritonserver_vllm image with vLLM and all of its dependencies.


## Step 2: Start Triton Inference Server

A sample model repository for deploying `facebook/opt-125m` using vLLM in Triton is
included with this demo as `model_repository` directory.
The model repository should look like this:
```
model_repository/
`-- vllm
|-- 1
| `-- model.py
|-- config.pbtxt
|-- vllm_engine_args.json
└── vllm_model
├── 1
│   └── model.json
└── config.pbtxt
```

The content of `vllm_engine_args.json` is:
The content of `model.json` is:

```json
{
Expand All @@ -71,53 +76,116 @@ The content of `vllm_engine_args.json` is:
"gpu_memory_utilization": 0.5
}
```

This file can be modified to provide further settings to the vLLM engine. See vLLM
[AsyncEngineArgs](https://github.com/vllm-project/vllm/blob/32b6816e556f69f1672085a6267e8516bcb8e622/vllm/engine/arg_utils.py#L165)
and
[EngineArgs](https://github.com/vllm-project/vllm/blob/32b6816e556f69f1672085a6267e8516bcb8e622/vllm/engine/arg_utils.py#L11)
for supported key-value pairs.
for supported key-value pairs. Inflight batching and paged attention is handled
by the vLLM engine.

For multi-GPU support, EngineArgs like `tensor_parallel_size` can be specified in [`vllm_engine_args.json`](model_repository/vllm/vllm_engine_args.json).
For multi-GPU support, EngineArgs like `tensor_parallel_size` can be specified
in [`model.json`](https://github.com/triton-inference-server/vllm_backend/blob/main/samples/model_repository/vllm_model/1/model.json).

*Note*: vLLM greedily consume up to 90% of the GPU's memory under default settings.
This tutorial updates this behavior by setting `gpu_memory_utilization` to 50%.
You can tweak this behavior using fields like `gpu_memory_utilization` and other settings
in [`vllm_engine_args.json`](model_repository/vllm/vllm_engine_args.json).
in [`model.json`](https://github.com/triton-inference-server/vllm_backend/blob/main/samples/model_repository/vllm_model/1/model.json).

Read through the documentation in [`model.py`](model_repository/vllm/1/model.py) to understand how
to configure this sample for your use-case.
Read through the documentation in [`model.py`](https://github.com/triton-inference-server/vllm_backend/blob/main/src/model.py)
to understand how to configure this sample for your use-case.

Run the following commands to start the server container:
## Step 2: Launch Triton Inference Server

Once you have the model repository setup, it is time to launch the triton server.
Starting with 23.10 release, a dedicated container with vLLM pre-installed
is available on [NGC.](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver)
To use this container to launch Triton, you can use the docker command below.
```
docker run --gpus all -it --rm -p 8001:8001 --shm-size=1G --ulimit memlock=-1 --ulimit stack=67108864 -v ${PWD}:/work -w /work tritonserver_vllm tritonserver --model-store ./model_repository
docker run --gpus all -it --net=host --rm -p 8001:8001 --shm-size=1G --ulimit memlock=-1 --ulimit stack=67108864 -v ${PWD}:/work -w /work nvcr.io/nvidia/tritonserver:<xx.yy>-vllm-python-py3 tritonserver --model-store ./model_repository
```
Throughout the tutorial, \<xx.yy\> is the version of Triton
that you want to use. Please note, that Triton's vLLM
container was first published in 23.10 release, so any prior version
will not work.

Upon successful start of the server, you should see the following at the end of the output.
After you start Triton you will see output on the console showing
the server starting up and loading the model. When you see output
like the following, Triton is ready to accept inference requests.

```
I0901 23:39:08.729123 1 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:8001
I0901 23:39:08.729640 1 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000
I0901 23:39:08.772522 1 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002
I1030 22:33:28.291908 1 grpc_server.cc:2513] Started GRPCInferenceService at 0.0.0.0:8001
I1030 22:33:28.292879 1 http_server.cc:4497] Started HTTPService at 0.0.0.0:8000
I1030 22:33:28.335154 1 http_server.cc:270] Started Metrics Service at 0.0.0.0:8002
```

## Step 3: Use a Triton Client to Query the Server
## Step 3: Use a Triton Client to Send Your First Inference Request

We will run the client within Triton's SDK container to issue multiple async requests using the
In this tutorial, we will show how to send an inference request to the
[facebook/opt-125m](https://huggingface.co/facebook/opt-125m) model in 2 ways:

* [Using the generate endpoint](#using-generate-endpoint)
* [Using the gRPC asyncio client](#using-grpc-asyncio-client)

### Using the Generate Endpoint
After you start Triton with the sample model_repository,
you can quickly run your first inference request with the
[generate](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_generate.md)
endpoint.

Start Triton's SDK container with the following command:
```
docker run -it --net=host -v ${PWD}:/workspace/ nvcr.io/nvidia/tritonserver:<xx.yy>-py3-sdk bash
```

Now, let's send an inference request:
```
curl -X POST localhost:8000/v2/models/vllm_model/generate -d '{"text_input": "What is Triton Inference Server?", "parameters": {"stream": false, "temperature": 0}}'
```

Upon success, you should see a response from the server like this one:
```
{"model_name":"vllm_model","model_version":"1","text_output":"What is Triton Inference Server?\n\nTriton Inference Server is a server that is used by many"}
```

### Using the gRPC Asyncio Client
Now, we will see how to run the client within Triton's SDK container
to issue multiple async requests using the
[gRPC asyncio client](https://github.com/triton-inference-server/client/blob/main/src/python/library/tritonclient/grpc/aio/__init__.py)
library.

This method requires a
[client.py](https://github.com/triton-inference-server/vllm_backend/blob/main/samples/client.py)
script and a set of
[prompts](https://github.com/triton-inference-server/vllm_backend/blob/main/samples/prompts.txt),
which are provided in the
[samples](https://github.com/triton-inference-server/vllm_backend/tree/main/samples)
folder of
[vllm_backend](https://github.com/triton-inference-server/vllm_backend/tree/main)
repository.

Use the following command to download `client.py` and `prompts.txt` to your
current directory:
```
docker run -it --net=host -v ${PWD}:/workspace/ nvcr.io/nvidia/tritonserver:23.08-py3-sdk bash
wget https://raw.githubusercontent.com/triton-inference-server/vllm_backend/main/samples/client.py
wget https://raw.githubusercontent.com/triton-inference-server/vllm_backend/main/samples/prompts.txt
```

Within the container, run [`client.py`](client.py) with:
Now, we are ready to start Triton's SDK container:
```
docker run -it --net=host -v ${PWD}:/workspace/ nvcr.io/nvidia/tritonserver:<xx.yy>-py3-sdk bash
```

Within the container, run
[`client.py`](https://github.com/triton-inference-server/vllm_backend/blob/main/samples/client.py)
with:
```
python3 client.py
```

The client reads prompts from the [prompts.txt](prompts.txt) file, sends them to Triton server for
The client reads prompts from the
[prompts.txt](https://github.com/triton-inference-server/vllm_backend/blob/main/samples/prompts.txt)
file, sends them to Triton server for
inference, and stores the results into a file named `results.txt` by default.

The output of the client should look like below:
Expand All @@ -128,15 +196,22 @@ Storing results into `results.txt`...
PASS: vLLM example
```

You can inspect the contents of the `results.txt` for the response from the server. The `--iterations`
flag can be used with the client to increase the load on the server by looping through the list of
provided prompts in [`prompts.txt`](prompts.txt).
You can inspect the contents of the `results.txt` for the response
from the server. The `--iterations` flag can be used with the client
to increase the load on the server by looping through the list of
provided prompts in
[prompts.txt](https://github.com/triton-inference-server/vllm_backend/blob/main/samples/prompts.txt).

When you run the client in verbose mode with the `--verbose` flag, the client will print more details
about the request/response transactions.
When you run the client in verbose mode with the `--verbose` flag,
the client will print more details about the request/response transactions.

## Limitations

- We use decoupled streaming protocol even if there is exactly 1 response for each request.
- The asyncio implementation is exposed to model.py.
- Does not support providing specific subset of GPUs to be used.
- If you are running multiple instances of Triton server with
a Python-based vLLM backend, you need to specify a different
`shm-region-prefix-name` for each server. See
[here](https://github.com/triton-inference-server/python_backend#running-multiple-instances-of-triton-server)
for more information.
Loading
Loading