Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update Llama2 tutorial for running with trtllm #73

Closed
wants to merge 14 commits into from
132 changes: 89 additions & 43 deletions Popular_Models_Guide/Llama2/trtllm_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,18 +35,18 @@ are some version mismatches between the `tutorials` and `tensorrt_backend` repos
Refer to [llama.md](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama.md)
for more detailed modifications if necessary.


## Pre-build instructions

For this tutorial, we are using the Llama2-7B HuggingFace model with pre-trained weights.
Clone the repo of the model with weights and tokens [here](https://huggingface.co/meta-llama/Llama-2-7b-hf/tree/main).
You will need to get permissions for the Llama2 repository as well as get access to the huggingface cli. To get access to the huggingface cli, go here: [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens).
For this tutorial, we are using the Llama2-7B HuggingFace model with pre-trained
weights. Please follow the [README.md](README.md) for pre-build instructions
and links for how to run Llama with other backends.

## Installation

1. The installation starts with cloning the TensorRT-LLM Backend and update the TensorRT-LLM submodule:
1. The installation starts with cloning the TensorRT-LLM Backend and update the
TensorRT-LLM submodule. Note the release that has been tested is `v0.5.0`
```bash
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git --branch <release branch>
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git --branch <release branch>
# Update the submodules
cd tensorrtllm_backend
# Install git-lfs if needed
Expand All @@ -55,22 +55,33 @@ git lfs install
git submodule update --init --recursive
```

2. Launch Triton docker container with TensorRT-LLM backend. Note I'm mounting `tensorrtllm_backend` to `/tensorrtllm_backend` and the Llama2 model to `/Llama-2-7b-hf` in the docker container for simplicity. Make an `engines` folder outside docker to reuse engines for future runs.
2. Launch Triton docker container with TensorRT-LLM backend. Note I'm mounting
`tensorrtllm_backend` to `/tensorrtllm_backend` and the Llama2 model to
`/Llama-2-7b-hf` in the docker container for simplicity. Make an `engines`
folder outside docker to reuse engines for future runs.
```bash
docker run --rm -it --net host --shm-size=2g \
--ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
-v /path/to/tensorrtllm_backend:/tensorrtllm_backend \
-v /path/to/Llama2/repo:/Llama-2-7b-hf \
-v /path/to/engines:/engines \
-v /your_path_to/tensorrtllm_backend:/tensorrtllm_backend \
-v /your_path_to/Llama-2-7b-hf:/Llama-2-7b-hf \
-v /your_path_to/engines:/engines \
nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3

# Install Sentencepiece
pip3 install SentencePiece protobuf
```

Alternatively, you can follow instructions [here](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/README.md) to build Triton Server with Tensorrt-LLM Backend if you want to build a specialized container.
Alternatively, you can follow instructions [here](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/README.md)
to build Triton Server with Tensorrt-LLM Backend if you want to build a specialized
container.

Don't forget to allow gpu usage when you launch the container.

## Create Engines for each model [skip this step if you already have an engine]
TensorRT-LLM requires each model to be compiled for the configuration you need before running. To do so, before you run your model for the first time on Triton Server you will need to create a TensorRT-LLM engine for the model for the configuration you want with the following steps:
TensorRT-LLM requires each model to be compiled for the configuration you need
before running. To do so, before you run your model for the first time on Triton
Server you will need to create a TensorRT-LLM engine for the model for the
configuration you want with the following steps:

1. Install Tensorrt-LLM python package
```bash
Expand All @@ -90,27 +101,35 @@ TensorRT-LLM requires each model to be compiled for the configuration you need b

2. Compile model engines

The script to build Llama models is located in [TensorRT-LLM repository](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples). We use the one located in the docker container as `/tensorrtllm_backend/tensorrt_llm/examples/llama/build.py`.
This command compiles the model with inflight batching and 1 GPU. To run with more GPUs, you will need to change the build command to use `--world_size X`.
More details for the scripting please see the documentation for the Llama example [here](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama/README.md).
The script to build Llama models is located in [TensorRT-LLM repository](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples).
We use the one located in the docker container as `/tensorrtllm_backend/tensorrt_llm/examples/llama/build.py`.
This command compiles the model with inflight batching and 1 GPU. To run
with more GPUs, you will need to change the build command to use `--world_size X`.
More details for the scripting please see the documentation for the Llama
example [here](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama/README.md).

```bash
python /tensorrtllm_backend/tensorrt_llm/examples/llama/build.py --model_dir /Llama-2-7b-hf/ \
--dtype bfloat16 \
--use_gpt_attention_plugin bfloat16 \
--use_inflight_batching \
--paged_kv_cache \
--remove_input_padding \
--use_gemm_plugin bfloat16 \
--output_dir /engines/1-gpu/ \
--world_size 1
BUILD_SCRIPT=/tensorrtllm_backend/tensorrt_llm/examples/llama/build.py
export TOKENIZER_DIR=/Llama-2-7b-hf
export ENGINE_DIR=/engines/h100/batch_128
export MAX_BATCH_SIZE=128
python ${BUILD_SCRIPT} \
--model_dir ${TOKENIZER_DIR} \
--dtype float16 \
--remove_input_padding \
--use_gpt_attention_plugin float16 \
--enable_context_fmha \
--use_gemm_plugin float16 \
--output_dir ${ENGINE_DIR} \
--paged_kv_cache \
--max_batch_size ${MAX_BATCH_SIZE}
```

> Optional: You can check test the output of the model with `run.py`
> located in the same llama examples folder.
>
> ```bash
> python3 /tensorrtllm_backend/tensorrt_llm/examples/llama/run.py --engine_dir=/engines/1-gpu/ --max_output_len 100 --tokenizer_dir /Llama-2-7b-hf --input_text "How do I count to ten in French?"
> python3 /tensorrtllm_backend/tensorrt_llm/examples/run.py --engine_dir=${ENGINE_DIR} --max_output_len 100 --tokenizer_dir ${TOKENIZER_DIR} --input_text "How do I count to ten in French?"
> ```

## Serving with Triton
Expand All @@ -122,30 +141,49 @@ To run our Llama2-7B model, you will need to:

1. Copy over the inflight batcher models repository

```bash
cp -R /tensorrtllm_backend/all_models/inflight_batcher_llm /opt/tritonserver/.
```
```bash
mkdir -p /opt/tritonserver/model_repository
cp -r /tensorrtllm_backend/all_models/inflight_batcher_llm/* /opt/tritonserver/model_repository/.
```

2. Modify config.pbtxt for the preprocessing, postprocessing and processing steps. See details in [documentation](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/README.md#create-the-model-repository):
2. Modify config.pbtxt for the preprocessing, postprocessing and processing steps.
See details in [documentation](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/README.md#create-the-model-repository).
The `config.pbtxt` has a lot of variables so it might be easier to the
provided `fill_template.py` script.

pskiran1 marked this conversation as resolved.
Show resolved Hide resolved
```bash
# preprocessing
sed -i 's#${tokenizer_dir}#/Llama-2-7b-hf/#' /opt/tritonserver/inflight_batcher_llm/preprocessing/config.pbtxt
sed -i 's#${tokenizer_type}#auto#' /opt/tritonserver/inflight_batcher_llm/preprocessing/config.pbtxt
sed -i 's#${tokenizer_dir}#/Llama-2-7b-hf/#' /opt/tritonserver/inflight_batcher_llm/postprocessing/config.pbtxt
sed -i 's#${tokenizer_type}#auto#' /opt/tritonserver/inflight_batcher_llm/postprocessing/config.pbtxt

sed -i 's#${decoupled_mode}#false#' /opt/tritonserver/inflight_batcher_llm/tensorrt_llm/config.pbtxt
sed -i 's#${engine_dir}#/engines/1-gpu/#' /opt/tritonserver/inflight_batcher_llm/tensorrt_llm/config.pbtxt
```
Also, ensure that the `gpt_model_type` parameter is set to `inflight_fused_batching`
export TOKENIZER_DIR=/Llama-2-7b-hf
export ENGINE_DIR=/engines/h100/batch_128
export MAX_BATCH_SIZE=128
FILL_TEMPLATE_SCRIPT=/tensorrtllm_backend/tools/fill_template.py
MODEL_FOLDER=/opt/tritonserver/model_repository
TOKENIZER_TYPE="llama"
# Batch size here is same as what we specified in the engine as ${MAX_BATCH_SIZE}
INSTANCE_COUNT=1
BATCHING_STRATEGY="inflight_batching"
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/preprocessing/config.pbtxt \
tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/postprocessing/config.pbtxt \
tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/ensemble/config.pbtxt \
triton_max_batch_size:${MAX_BATCH_SIZE}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt \
triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_DIR},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:${BATCHING_STRATEGY},max_queue_delay_microseconds:600
```

3. Launch Tritonserver

Use the [launch_triton_server.py](https://github.com/triton-inference-server/tensorrtllm_backend/blob/release/0.5.0/scripts/launch_triton_server.py) script. This launches multiple instances of `tritonserver` with MPI.
```bash
python3 /tensorrtllm_backend/scripts/launch_triton_server.py --world_size=<world size of the engine> --model_repo=/opt/tritonserver/inflight_batcher_llm
```
The server has launched successfully when you see the following outputs in your console:

```
I0922 23:28:40.351809 1 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:8001
I0922 23:28:40.352017 1 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000
I0922 23:28:40.395611 1 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002
```

## Client

Expand All @@ -156,15 +194,23 @@ You can test the results of the run with:
# Using the SDK container as an example
docker run --rm -it --net host --shm-size=2g \
--ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
-v /path/to/tensorrtllm_backend:/tensorrtllm_backend \
-v /path/to/Llama2/repo:/Llama-2-7b-hf \
-v /path/to/engines:/engines \
-v /your_path_to/tensorrtllm_backend:/tensorrtllm_backend \
-v /your_path_to/Llama2/repo:/Llama-2-7b-hf \
-v /your_path_to/engines:/engines \
nvcr.io/nvidia/tritonserver:23.10-py3-sdk
# Install extra dependencies for the script
pip3 install transformers sentencepiece
python3 /tensorrtllm_backend/inflight_batcher_llm/client/inflight_batcher_llm_client.py --request-output-len 200 --tokenizer_type llama --tokenizer_dir /Llama-2-7b-hf
```

2. The [generate endpoint](https://github.com/triton-inference-server/tensorrtllm_backend/tree/release/0.5.0#query-the-server-with-the-triton-generate-endpoint) if you are using the Triton TensorRT-LLM Backend container with versions greater than `r23.10`.

```bash
$ curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is Triton Inference Server?", "parameters": {"stream": false, "temperature": 0}}'
# returns (formatted for better visualization)
> {
"model_name":"ensemble",
"model_version":"1",
"text_output":"What is Triton Inference Server?\nTriton Inference Server is a lightweight, high-performance"
}
```

14 changes: 10 additions & 4 deletions Popular_Models_Guide/Llama2/vllm_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,11 +26,15 @@
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-->

The vLLM Backend uses vLLM to do inference. Read more about vLLM [here](https://blog.vllm.ai/2023/06/20/vllm.html) and the vLLM Backend [here](https://github.com/triton-inference-server/vllm_backend).
The vLLM Backend uses vLLM to do inference. Read more about vLLM
[here](https://blog.vllm.ai/2023/06/20/vllm.html) and the vLLM Backend
[here](https://github.com/triton-inference-server/vllm_backend).

## Pre-build instructions

For this tutorial, we are using the Llama2-7B HuggingFace model with pre-trained weights. Please follow the [README.md](README.md) for pre-build instructions and links for how to run Llama with other backends.
For this tutorial, we are using the Llama2-7B HuggingFace model with pre-trained
weights. Please follow the [README.md](README.md) for pre-build instructions
and links for how to run Llama with other backends.

## Installation

Expand All @@ -42,7 +46,8 @@ docker run --rm -it --net host --shm-size=2g \
-v $PWD/llama2vllm:/opt/tritonserver/model_repository/llama2vllm \
nvcr.io/nvidia/tritonserver:23.11-vllm-python-py3
```
This will create a `/opt/tritonserver/model_repository` folder that contains the `llama2vllm` model. The model itself will be pulled from the HuggingFace
This will create a `/opt/tritonserver/model_repository` folder that contains the
`llama2vllm` model. The model itself will be pulled from the HuggingFace

Once in the container, install the `huggingface-cli` and login with your own credentials.
```bash
Expand All @@ -67,7 +72,8 @@ I0922 23:28:40.395611 1 http_server.cc:187] Started Metrics Service at 0.0.0.0:8

## Sending requests via the `generate` endpoint

As a simple example to make sure the server works, you can use the `generate` endpoint to test. More about the generate endpoint [here](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_generate.md).
As a simple example to make sure the server works, you can use the `generate`
endpoint to test. More about the generate endpoint [here](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_generate.md).

```bash
$ curl -X POST localhost:8000/v2/models/llama2vllm/generate -d '{"text_input": "What is Triton Inference Server?", "parameters": {"stream": false, "temperature": 0}}'
Expand Down
Loading