triton-inference-server · oandreeva-nv · Oct 31, 2023 · Oct 30, 2023 · Oct 31, 2023 · Oct 31, 2023
diff --git a/Quick_Deploy/vLLM/Dockerfile b/Quick_Deploy/vLLM/Dockerfile
diff --git a/Quick_Deploy/vLLM/README.md b/Quick_Deploy/vLLM/README.md
@@ -31,38 +31,43 @@
 
 The following tutorial demonstrates how to deploy a simple
 [facebook/opt-125m](https://huggingface.co/facebook/opt-125m) model on
-Triton Inference Server using Triton's [Python backend](https://github.com/triton-inference-server/python_backend) and the
-[vLLM](https://github.com/vllm-project/vllm) library.
+Triton Inference Server using the Triton's
+[Python-based](https://github.com/triton-inference-server/backend/blob/main/docs/python_based_backends.md#python-based-backends)
+[vLLM](https://github.com/triton-inference-server/vllm_backend/tree/main)
+backend.
 
 *NOTE*: The tutorial is intended to be a reference example only and has [known limitations](#limitations).
 
 
-## Step 1: Build a Triton Container Image with vLLM
+## Step 1: Prepare your model repository
 
-We will build a new container image derived from tritonserver:23.08-py3 with vLLM.
+To use Triton, we need to build a model repository. For this tutorial we will
+use the model repository, provided in the [samples](https://github.com/triton-inference-server/vllm_backend/tree/main/samples)
+folder of the [vllm_backend](https://github.com/triton-inference-server/vllm_backend/tree/main)
+repository.
 
+The following set of commands will create a `model_repository/vllm_model/1`
+directory and copy 2 files:
+[`model.json`](https://github.com/triton-inference-server/vllm_backend/blob/main/samples/model_repository/vllm_model/1/model.json)
+and
+[`config.pbtxt`](https://github.com/triton-inference-server/vllm_backend/blob/main/samples/model_repository/vllm_model/config.pbtxt),
+required to serve the [facebook/opt-125m](https://huggingface.co/facebook/opt-125m) model.
 ```
-docker build -t tritonserver_vllm .
+mkdir -p model_repository/vllm_model/1
+wget -P model_repository/vllm_model/1 https://raw.githubusercontent.com/triton-inference-server/vllm_backend/main/samples/model_repository/vllm_model/1/model.json
+wget -P model_repository/vllm_model/ https://raw.githubusercontent.com/triton-inference-server/vllm_backend/main/samples/model_repository/vllm_model/config.pbtxt
 ```
 
-The above command should create the tritonserver_vllm image with vLLM and all of its dependencies.
-
-
-## Step 2: Start Triton Inference Server
-
-A sample model repository for deploying `facebook/opt-125m` using vLLM in Triton is
-included with this demo as `model_repository` directory.
 The model repository should look like this:
 ```
 model_repository/
-`-- vllm
-    |-- 1
-    |   `-- model.py
-    |-- config.pbtxt
-    |-- vllm_engine_args.json
+└── vllm_model
+    ├── 1
+    │   └── model.json
+    └── config.pbtxt
 ```
 
-The content of `vllm_engine_args.json` is:
+The content of `model.json` is:
 
 ```json
 {
@@ -71,53 +76,116 @@ The content of `vllm_engine_args.json` is:
     "gpu_memory_utilization": 0.5
 }
 ```
+
 This file can be modified to provide further settings to the vLLM engine. See vLLM
 [AsyncEngineArgs](https://github.com/vllm-project/vllm/blob/32b6816e556f69f1672085a6267e8516bcb8e622/vllm/engine/arg_utils.py#L165)
 and
 [EngineArgs](https://github.com/vllm-project/vllm/blob/32b6816e556f69f1672085a6267e8516bcb8e622/vllm/engine/arg_utils.py#L11)
-for supported key-value pairs.
+for supported key-value pairs. Inflight batching and paged attention is handled
+by the vLLM engine.
 
-For multi-GPU support, EngineArgs like `tensor_parallel_size` can be specified in [`vllm_engine_args.json`](model_repository/vllm/vllm_engine_args.json).
+For multi-GPU support, EngineArgs like `tensor_parallel_size` can be specified
+in [`model.json`](https://github.com/triton-inference-server/vllm_backend/blob/main/samples/model_repository/vllm_model/1/model.json).
 
 *Note*: vLLM greedily consume up to 90% of the GPU's memory under default settings.
 This tutorial updates this behavior by setting `gpu_memory_utilization` to 50%.
 You can tweak this behavior using fields like `gpu_memory_utilization` and other settings
-in [`vllm_engine_args.json`](model_repository/vllm/vllm_engine_args.json).
+in [`model.json`](https://github.com/triton-inference-server/vllm_backend/blob/main/samples/model_repository/vllm_model/1/model.json).
 
-Read through the documentation in [`model.py`](model_repository/vllm/1/model.py) to understand how
-to configure this sample for your use-case.
+Read through the documentation in [`model.py`](https://github.com/triton-inference-server/vllm_backend/blob/main/src/model.py)
+to understand how to configure this sample for your use-case.
 
-Run the following commands to start the server container:
+## Step 2: Launch Triton Inference Server
 
+Once you have the model repository setup, it is time to launch the triton server.
+Starting with 23.10 release, a dedicated container with vLLM pre-installed
+is available on [NGC.](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver)
+To use this container to launch Triton, you can use the docker command below.
 ```
-docker run --gpus all -it --rm -p 8001:8001 --shm-size=1G --ulimit memlock=-1 --ulimit stack=67108864 -v ${PWD}:/work -w /work tritonserver_vllm tritonserver --model-store ./model_repository
+docker run --gpus all -it --net=host --rm -p 8001:8001 --shm-size=1G --ulimit memlock=-1 --ulimit stack=67108864 -v ${PWD}:/work -w /work nvcr.io/nvidia/tritonserver:<xx.yy>-vllm-python-py3 tritonserver --model-store ./model_repository
 ```
+Here and later throughout the tutorial \<xx.yy\> is the version of Triton
+that you want to use (and pulled above). Please note, that Triton's vLLM
+container was first published in 23.10 release, so any prior version
+will not work.
 
-Upon successful start of the server, you should see the following at the end of the output.
+After you start Triton you will see output on the console showing
+the server starting up and loading the model. When you see output
+like the following, Triton is ready to accept inference requests.
 
 ```
-I0901 23:39:08.729123 1 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:8001
-I0901 23:39:08.729640 1 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000
-I0901 23:39:08.772522 1 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002
+I1030 22:33:28.291908 1 grpc_server.cc:2513] Started GRPCInferenceService at 0.0.0.0:8001
+I1030 22:33:28.292879 1 http_server.cc:4497] Started HTTPService at 0.0.0.0:8000
+I1030 22:33:28.335154 1 http_server.cc:270] Started Metrics Service at 0.0.0.0:8002
 ```
 
-## Step 3: Use a Triton Client to Query the Server
+## Step 3: Use a Triton Client to Send Your First Inference Request
 
-We will run the client within Triton's SDK container to issue multiple async requests using the
+In this tutorial, we will show how to send an inference request to the
+[facebook/opt-125m](https://huggingface.co/facebook/opt-125m) model in 2 ways:
+
+* [Using the generate endpoint](#using-generate-endpoint)
+* [Using the gRPC asyncio client](#using-grpc-asyncio-client)
+
+### Using the Generate Endpoint
+After you start Triton with the sample model_repository,
+you can quickly run your first inference request with the
+[generate](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_generate.md)
+endpoint.
+
+Start Triton's SDK container with the following command:
+```
+docker run -it --net=host -v ${PWD}:/workspace/ nvcr.io/nvidia/tritonserver:<xx.yy>-py3-sdk bash
+```
+
+Now, let's send an inference request:
+```
+curl -X POST localhost:8000/v2/models/vllm_model/generate -d '{"text_input": "What is Triton Inference Server?", "parameters": {"stream": false, "temperature": 0}}'
+```
+
+Upon success, you should see a response from the server like this one:
+```
+{"model_name":"vllm_model","model_version":"1","text_output":"What is Triton Inference Server?\n\nTriton Inference Server is a server that is used by many"}
+```
+
+### Using the gRPC Asyncio Client
+Now, we will see how to run the client within Triton's SDK container
+to issue multiple async requests using the
 [gRPC asyncio client](https://github.com/triton-inference-server/client/blob/main/src/python/library/tritonclient/grpc/aio/__init__.py)
 library.
 
+This method requires a
+[client.py](https://github.com/triton-inference-server/vllm_backend/blob/main/samples/client.py)
+script and a set of
+[prompts](https://github.com/triton-inference-server/vllm_backend/blob/main/samples/prompts.txt),
+which are provided in the
+[samples](https://github.com/triton-inference-server/vllm_backend/tree/main/samples)
+folder of
+[vllm_backend](https://github.com/triton-inference-server/vllm_backend/tree/main)
+repository.
+
+Use the following command to download `client.py` and `prompts.txt` to your
+current directory:
 ```
-docker run -it --net=host -v ${PWD}:/workspace/ nvcr.io/nvidia/tritonserver:23.08-py3-sdk bash
+wget https://raw.githubusercontent.com/triton-inference-server/vllm_backend/main/samples/client.py
+wget https://raw.githubusercontent.com/triton-inference-server/vllm_backend/main/samples/prompts.txt
 ```
 
-Within the container, run [`client.py`](client.py) with:
+Now, we are ready to start Triton's SDK container:
+```
+docker run -it --net=host -v ${PWD}:/workspace/ nvcr.io/nvidia/tritonserver:<xx.yy>-py3-sdk bash
+```
 
+Within the container, run
+[`client.py`](https://github.com/triton-inference-server/vllm_backend/blob/main/samples/client.py)
+with:
 ```
 python3 client.py
 ```
 
-The client reads prompts from the [prompts.txt](prompts.txt) file, sends them to Triton server for
+The client reads prompts from the
+[prompts.txt](https://github.com/triton-inference-server/vllm_backend/blob/main/samples/prompts.txt)
+file, sends them to Triton server for
 inference, and stores the results into a file named `results.txt` by default.
 
 The output of the client should look like below:
@@ -128,15 +196,22 @@ Storing results into `results.txt`...
 PASS: vLLM example
 ```
 
-You can inspect the contents of the `results.txt` for the response from the server. The `--iterations`
-flag can be used with the client to increase the load on the server by looping through the list of
-provided prompts in [`prompts.txt`](prompts.txt).
+You can inspect the contents of the `results.txt` for the response
+from the server. The `--iterations` flag can be used with the client
+to increase the load on the server by looping through the list of
+provided prompts in
+[prompts.txt](https://github.com/triton-inference-server/vllm_backend/blob/main/samples/prompts.txt).
 
-When you run the client in verbose mode with the `--verbose` flag, the client will print more details
-about the request/response transactions.
+When you run the client in verbose mode with the `--verbose` flag,
+the client will print more details about the request/response transactions.
 
 ## Limitations
 
 - We use decoupled streaming protocol even if there is exactly 1 response for each request.
 - The asyncio implementation is exposed to model.py.
 - Does not support providing specific subset of GPUs to be used.
+- If you are running multiple instances of Triton server with
+a Python-based vLLM backend, you need to specify a different
+`shm-region-prefix-name` for each server. See
+[here](https://github.com/triton-inference-server/python_backend#running-multiple-instances-of-triton-server)
+for more information.