Skip to content

Commit

Permalink
make language inall tutorials more similar
Browse files Browse the repository at this point in the history
  • Loading branch information
jbkyang-nvi committed Dec 7, 2023
1 parent 27e8259 commit d69b68f
Show file tree
Hide file tree
Showing 4 changed files with 60 additions and 10 deletions.
1 change: 0 additions & 1 deletion Popular_Models_Guide/Llama2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,6 @@

# Deploying Hugging Face Transformer Models in Triton

For this tutorial, we are using the Llama2-7B HuggingFace model with pre-trained weights.
There are multiple ways to run Llama2 with Tritonserver.
1. Infer with [TensorRT-LLM Backend](trtllm_guide.md#infer-with-tensorrt-llm-backend)
2. Infer with [vLLM Backend](vllm_guide.md#infer-with-vllm-backend)
Expand Down
25 changes: 21 additions & 4 deletions Popular_Models_Guide/Llama2/trtllm_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-->

TensorRT-LLM is Nvidia's recommended solution of running Large Language Models(LLMs) on Nvidia GPUs. Read more about TensoRT-LLM [here](https://github.com/NVIDIA/TensorRT-LLM) and Triton's TensorRTLLM Backend [here](https://github.com/triton-inference-server/tensorrtllm_backend).

## Pre-build instructions

For this tutorial, we are using the Llama2-7B HuggingFace model with pre-trained weights. Please follow the [README.md](README.md) for pre-build instructions and links for how to run Llama with other backends.
Expand Down Expand Up @@ -110,9 +112,9 @@ To run our Llama2-7B model, you will need to:

1. Copy over the inflight batcher models repository

```bash
cp -R /tensorrtllm_backend/all_models/inflight_batcher_llm /opt/tritonserver/.
```
```bash
cp -R /tensorrtllm_backend/all_models/inflight_batcher_llm /opt/tritonserver/.
```

2. Modify config.pbtxt for the preprocessing, postprocessing and processing steps. See details in [documentation](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/README.md#create-the-model-repository):

Expand All @@ -134,6 +136,13 @@ To run our Llama2-7B model, you will need to:
```bash
python3 /tensorrtllm_backend/scripts/launch_triton_server.py --world_size=<world size of the engine> --model_repo=/opt/tritonserver/inflight_batcher_llm
```
The server has launched successfully when you see the following outputs in your console:

```
I0922 23:28:40.351809 1 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:8001
I0922 23:28:40.352017 1 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000
I0922 23:28:40.395611 1 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002
```

## Client

Expand All @@ -154,6 +163,14 @@ python3 /tensorrtllm_backend/inflight_batcher_llm/client/inflight_batcher_llm_cl
```

2. The [generate endpoint](https://github.com/triton-inference-server/tensorrtllm_backend/tree/release/0.5.0#query-the-server-with-the-triton-generate-endpoint) if you are using the Triton TensorRT-LLM Backend container with versions greater than `r23.10`.

```bash
$ curl -X POST localhost:8000/v2/models/llama7b/generate -d '{"text_input": "What is Triton Inference Server?", "parameters": {"stream": false, "temperature": 0}}'
# returns (formatted for better visualization)
> {
"model_name":"llama2vllm",
"model_version":"1",
"text_output":"What is Triton Inference Server?\nTriton Inference Server is a lightweight, high-performance"
}
```


42 changes: 37 additions & 5 deletions Popular_Models_Guide/Llama2/vllm_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,24 +25,56 @@
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-->

The vLLM backend uses the vLLM Backend to do inference. Read more about vLLM [here](https://blog.vllm.ai/2023/06/20/vllm.html) and the vLLM Backend [here](https://github.com/triton-inference-server/vllm_backend).

## Pre-build instructions

For this tutorial, we are using the Llama2-7B HuggingFace model with pre-trained weights. Please follow the [README.md](README.md) for pre-build instructions and links for how to run Llama with other backends.

## Infer with vLLM Backend
## Installation

The vLLM backend uses vLLM to do inference. The triton vLLM container can be cloned with
The triton vLLM container can be pulled from [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver) with

```bash
docker run --rm -it --net host --shm-size=2g \
--ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
-v /path/to/Llama2/repo:/Llama-2-7b-hf \
-v $PWD/llama2vllm:/opt/tritonserver/model_repository/llama2vllm \
nvcr.io/nvidia/tritonserver:23.11-vllm-python-py3
```
This will create a `/opt/tritonserver/model_repository` folder that contains the `llama2vllm` model. The model itself will be pulled from the HuggingFace

Once in the container, install the `huggingface-cli` and login.
Once in the container, install the `huggingface-cli` and login with your own credentials.
```bash
pip install --upgrade huggingface_hub
huggingface-cli login --token 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
huggingface-cli login --token <your access token here>
```


## Serving with Triton

Then you can run the tritonserver as usual
```bash
tritonserver --model-repository model-repository
```
The server has launched successfully when you see the following outputs in your console:

```
I0922 23:28:40.351809 1 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:8001
I0922 23:28:40.352017 1 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000
I0922 23:28:40.395611 1 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002
```

## Sending requests

As a simple example to make sure the server works, you can use the `generate` endpoint to test. More about the generate endpoint [here](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_generate.md).

```bash
$ curl -X POST localhost:8000/v2/models/llama2vllm/generate -d '{"text_input": "What is Triton Inference Server?", "parameters": {"stream": false, "temperature": 0}}'
# returns (formatted for better visualization)
> {
"model_name":"llama2vllm",
"model_version":"1",
"text_output":"What is Triton Inference Server?\nTriton Inference Server is a lightweight, high-performance"
}
```
2 changes: 2 additions & 0 deletions Quick_Deploy/HuggingFaceTransformers/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,8 @@ sufficient infrastructure.
*NOTE*: The tutorial is intended to be a reference example only. It may not be tuned for
optimal performance.

*NOTE*: Llama 2 models are not specifically mentioned in the steps below, but can be run if `tiiuae/falcon-7b` is replaced with `meta-llama/Llama-2-7b-hf`, and `falcon7b` folder is replaced by `llama7b` folder.

## Step 1: Create a Model Repository

The first step is to create a model repository containing the models we want the Triton
Expand Down

0 comments on commit d69b68f

Please sign in to comment.