Skip to content

Commit

Permalink
Update GAP tutorial of vllm backend (#743)
Browse files Browse the repository at this point in the history
* Update GAP tutorial to be testable

---------

Co-authored-by: tgerdes <[email protected]>
Co-authored-by: Timothy Gerdes <[email protected]>
Co-authored-by: David Yastremsky <[email protected]>
  • Loading branch information
4 people authored Jul 12, 2024
1 parent b9bab20 commit db888f1
Showing 1 changed file with 44 additions and 69 deletions.
113 changes: 44 additions & 69 deletions src/c++/perf_analyzer/genai-perf/docs/tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,57 +30,47 @@ OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

- [Profile GPT2 running on Triton + TensorRT-LLM](#tensorrt-llm)
- [Profile GPT2 running on Triton + vLLM](#triton-vllm)
- [Profile GPT2 running on OpenAI API-Compatible Server](#openai)
- [Profile GPT2 running on OpenAI Chat Completions API-Compatible Server](#openai-chat)
- [Profile GPT2 running on OpenAI Completions API-Compatible Server](#openai-completions)

---

## Profile GPT2 running on Triton + TensorRT-LLM <a id="tensorrt-llm"></a>

### Running GPT2 on Triton Inference Server using TensorRT-LLM
### Run GPT2 on Triton Inference Server using TensorRT-LLM

<details>
<summary>See instructions</summary>

1. Run Triton Inference Server with TensorRT-LLM backend container:
Run Triton Inference Server with TensorRT-LLM backend container:

```bash
export RELEASE="yy.mm" # e.g. export RELEASE="24.03"
export RELEASE="yy.mm" # e.g. export RELEASE="24.06"

docker run -it --net=host --rm --gpus=all --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/tritonserver:${RELEASE}-trtllm-python-py3
```

2. Install Triton CLI (~5 min):
docker run -it --net=host --gpus=all --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/tritonserver:${RELEASE}-trtllm-python-py3

```bash
# Install Triton CLI (~5 min):
pip install "git+https://github.com/triton-inference-server/[email protected]"
```

3. Download model:

```bash
# Download model:
triton import -m gpt2 --backend tensorrtllm
```

4. Run server:

```bash
# Run server:
triton start
```

</details>

### Running GenAI-Perf
### Run GenAI-Perf

1. Run Triton Inference Server SDK container:
Run GenAI-Perf from Triton Inference Server SDK container:

```bash
export RELEASE="yy.mm" # e.g. export RELEASE="24.03"

docker run -it --net=host --rm --gpus=all nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk
```
export RELEASE="yy.mm" # e.g. export RELEASE="24.06"

2. Run GenAI-Perf:
docker run -it --net=host --gpus=all nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk

# Run GenAI-Perf in the container:
```bash
genai-perf profile \
-m gpt2 \
Expand Down Expand Up @@ -120,51 +110,41 @@ Request throughput (per sec): 4.44
## Profile GPT2 running on Triton + vLLM <a id="triton-vllm"></a>
### Running GPT2 on Triton Inference Server using vLLM
### Run GPT2 on Triton Inference Server using vLLM
<details>
<summary>See instructions</summary>
1. Run Triton Inference Server with vLLM backend container:
Run Triton Inference Server with vLLM backend container:
```bash
export RELEASE="yy.mm" # e.g. export RELEASE="24.03"
export RELEASE="yy.mm" # e.g. export RELEASE="24.06"
docker run -it --net=host --rm --gpus=all --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/tritonserver:${RELEASE}-vllm-python-py3
```
2. Install Triton CLI (~5 min):
docker run -it --net=host --gpus=1 --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/tritonserver:${RELEASE}-vllm-python-py3
```bash
# Install Triton CLI (~5 min):
pip install "git+https://github.com/triton-inference-server/[email protected]"
```

3. Download model:
```bash
# Download model:
triton import -m gpt2 --backend vllm
```

4. Run server:
```bash
# Run server:
triton start
```

</details>

### Running GenAI-Perf
### Run GenAI-Perf

1. Run Triton Inference Server SDK container:
Run GenAI-Perf from Triton Inference Server SDK container:

```bash
export RELEASE="yy.mm" # e.g. export RELEASE="24.03"
export RELEASE="yy.mm" # e.g. export RELEASE="24.06"

docker run -it --net=host --rm --gpus=all nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk
```

2. Run GenAI-Perf:
docker run -it --net=host --gpus=1 nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk

# Run GenAI-Perf in the container:
```bash
genai-perf profile \
-m gpt2 \
Expand Down Expand Up @@ -202,35 +182,31 @@ Output token throughput (per sec): 290.24
Request throughput (per sec): 2.57
```
## Profile GPT2 running on OpenAI API-Compatible Server <a id="openai"></a>

### OpenAI Chat Completions API
## Profile GPT2 running on OpenAI Chat API-Compatible Server <a id="openai-chat"></a>
#### Running GPT2 on [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat)-compatible server
### Run GPT2 on [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat)-compatible server
<details>
<summary>See instructions</summary>
1. Run the vLLM inference server:
Run the vLLM inference server:
```bash
docker run -it --net=host --rm --gpus=all vllm/vllm-openai:latest --model gpt2 --dtype float16 --max-model-len 1024
docker run -it --net=host --gpus=all vllm/vllm-openai:latest --model gpt2 --dtype float16 --max-model-len 1024
```

</details>

#### Running GenAI-Perf
### Run GenAI-Perf

1. Run Triton Inference Server SDK container:
Run GenAI-Perf from Triton Inference Server SDK container:

```bash
export RELEASE="yy.mm" # e.g. export RELEASE="24.03"
export RELEASE="yy.mm" # e.g. export RELEASE="24.06"

docker run -it --net=host --rm --gpus=all nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk
```

2. Run GenAI-Perf:
docker run -it --net=host --gpus=all nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk

# Run GenAI-Perf in the container:
```bash
genai-perf profile \
-m gpt2 \
Expand Down Expand Up @@ -268,33 +244,32 @@ Output token throughput (per sec): 401.62
Request throughput (per sec): 3.52
```
### OpenAI Completions API
## Profile GPT2 running on OpenAI Completions API-Compatible Server <a id="openai-completions"></a>
#### Running GPT2 on [OpenAI Completions API](https://platform.openai.com/docs/api-reference/completions)-compatible server
### Running GPT2 on [OpenAI Completions API](https://platform.openai.com/docs/api-reference/completions)-compatible server
<details>
<summary>See instructions</summary>
1. Run the vLLM inference server:
Run the vLLM inference server:
```bash
docker run -it --net=host --rm --gpus=all vllm/vllm-openai:latest --model gpt2 --dtype float16 --max-model-len 1024
docker run -it --net=host --gpus=all vllm/vllm-openai:latest --model gpt2 --dtype float16 --max-model-len 1024
```

</details>

#### Running GenAI-Perf
### Run GenAI-Perf

1. Run Triton Inference Server SDK container:
Run GenAI-Perf from Triton Inference Server SDK container:

```bash
export RELEASE="yy.mm" # e.g. export RELEASE="24.03"
export RELEASE="yy.mm" # e.g. export RELEASE="24.06"

docker run -it --net=host --rm --gpus=all nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk
```
docker run -it --net=host --gpus=all nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk

2. Run GenAI-Perf:

# Run GenAI-Perf in the container:
```bash
genai-perf profile \
-m gpt2 \
Expand Down

0 comments on commit db888f1

Please sign in to comment.