make language inall tutorials more similar

triton-inference-server · Dec 7, 2023 · d69b68f · d69b68f
1 parent 27e8259
commit d69b68f
Show file tree

Hide file tree

Showing 4 changed files with 60 additions and 10 deletions.
diff --git a/Popular_Models_Guide/Llama2/README.md b/Popular_Models_Guide/Llama2/README.md
@@ -28,7 +28,6 @@
 
 # Deploying Hugging Face Transformer Models in Triton
 
-For this tutorial, we are using the Llama2-7B HuggingFace model with pre-trained weights.
 There are multiple ways to run Llama2 with Tritonserver.
 1. Infer with [TensorRT-LLM Backend](trtllm_guide.md#infer-with-tensorrt-llm-backend)
 2. Infer with [vLLM Backend](vllm_guide.md#infer-with-vllm-backend)

diff --git a/Popular_Models_Guide/Llama2/trtllm_guide.md b/Popular_Models_Guide/Llama2/trtllm_guide.md
@@ -26,6 +26,8 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 -->
 
+TensorRT-LLM is Nvidia's recommended solution of running Large Language Models(LLMs) on Nvidia GPUs. Read more about TensoRT-LLM [here](https://github.com/NVIDIA/TensorRT-LLM) and Triton's TensorRTLLM Backend [here](https://github.com/triton-inference-server/tensorrtllm_backend).
+
 ## Pre-build instructions
 
 For this tutorial, we are using the Llama2-7B HuggingFace model with pre-trained weights. Please follow the [README.md](README.md) for pre-build instructions and links for how to run Llama with other backends.
@@ -110,9 +112,9 @@ To run our Llama2-7B model, you will need to:
 
 1. Copy over the inflight batcher models repository
 
- ```bash
- cp -R /tensorrtllm_backend/all_models/inflight_batcher_llm /opt/tritonserver/.
- ```
+    ```bash
+    cp -R /tensorrtllm_backend/all_models/inflight_batcher_llm /opt/tritonserver/.
+    ```
 
 2. Modify config.pbtxt for the preprocessing, postprocessing and processing steps. See details in [documentation](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/README.md#create-the-model-repository):
 
@@ -134,6 +136,13 @@ To run our Llama2-7B model, you will need to:
     ```bash
     python3 /tensorrtllm_backend/scripts/launch_triton_server.py --world_size=<world size of the engine> --model_repo=/opt/tritonserver/inflight_batcher_llm
     ```
+    The server has launched successfully when you see the following outputs in your console:
+
+    ```
+    I0922 23:28:40.351809 1 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:8001
+    I0922 23:28:40.352017 1 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000
+    I0922 23:28:40.395611 1 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002
+    ```
 
 ## Client
 
@@ -154,6 +163,14 @@ python3 /tensorrtllm_backend/inflight_batcher_llm/client/inflight_batcher_llm_cl
 ```
 
 2. The [generate endpoint](https://github.com/triton-inference-server/tensorrtllm_backend/tree/release/0.5.0#query-the-server-with-the-triton-generate-endpoint) if you are using the Triton TensorRT-LLM Backend container with versions greater than `r23.10`.
-
+```bash
+$ curl -X POST localhost:8000/v2/models/llama7b/generate -d '{"text_input": "What is Triton Inference Server?", "parameters": {"stream": false, "temperature": 0}}'
+# returns (formatted for better visualization)
+> {
+    "model_name":"llama2vllm",
+    "model_version":"1",
+    "text_output":"What is Triton Inference Server?\nTriton Inference Server is a lightweight, high-performance"
+  }
+```
 
 
diff --git a/Popular_Models_Guide/Llama2/vllm_guide.md b/Popular_Models_Guide/Llama2/vllm_guide.md
@@ -25,24 +25,56 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 -->
+
+The vLLM backend uses the vLLM Backend to do inference. Read more about vLLM [here](https://blog.vllm.ai/2023/06/20/vllm.html) and the vLLM Backend [here](https://github.com/triton-inference-server/vllm_backend).
+
 ## Pre-build instructions
 
 For this tutorial, we are using the Llama2-7B HuggingFace model with pre-trained weights. Please follow the [README.md](README.md) for pre-build instructions and links for how to run Llama with other backends.
 
-## Infer with vLLM Backend
+## Installation
 
-The vLLM backend uses vLLM to do inference. The triton vLLM container can be cloned with
+The triton vLLM container can be pulled from [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver) with
 
 ```bash
 docker run --rm -it --net host --shm-size=2g \
     --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
-    -v /path/to/Llama2/repo:/Llama-2-7b-hf \
+    -v $PWD/llama2vllm:/opt/tritonserver/model_repository/llama2vllm \
     nvcr.io/nvidia/tritonserver:23.11-vllm-python-py3
 ```
+This will create a `/opt/tritonserver/model_repository` folder that contains the `llama2vllm` model. The model itself will be pulled from the HuggingFace
 
-Once in the container, install the `huggingface-cli` and login.
+Once in the container, install the `huggingface-cli` and login with your own credentials.
 ```bash
 pip install --upgrade huggingface_hub
-huggingface-cli login --token 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
+huggingface-cli login --token <your access token here>
+```
+
+
+## Serving with Triton
+
+Then you can run the tritonserver as usual
+```bash
+tritonserver --model-repository model-repository
 ```
+The server has launched successfully when you see the following outputs in your console:
 
+```
+I0922 23:28:40.351809 1 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:8001
+I0922 23:28:40.352017 1 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000
+I0922 23:28:40.395611 1 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002
+```
+
+## Sending requests
+
+As a simple example to make sure the server works, you can use the `generate` endpoint to test. More about the generate endpoint [here](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_generate.md).
+
+```bash
+$ curl -X POST localhost:8000/v2/models/llama2vllm/generate -d '{"text_input": "What is Triton Inference Server?", "parameters": {"stream": false, "temperature": 0}}'
+# returns (formatted for better visualization)
+> {
+    "model_name":"llama2vllm",
+    "model_version":"1",
+    "text_output":"What is Triton Inference Server?\nTriton Inference Server is a lightweight, high-performance"
+  }
+```
diff --git a/Quick_Deploy/HuggingFaceTransformers/README.md b/Quick_Deploy/HuggingFaceTransformers/README.md
@@ -42,6 +42,8 @@ sufficient infrastructure.
 *NOTE*: The tutorial is intended to be a reference example only. It may not be tuned for
 optimal performance.
 
+*NOTE*: Llama 2 models are not specifically mentioned in the steps below, but can be run if `tiiuae/falcon-7b` is replaced with `meta-llama/Llama-2-7b-hf`, and `falcon7b` folder is replaced by `llama7b` folder.
+
 ## Step 1: Create a Model Repository
 
 The first step is to create a model repository containing the models we want the Triton