perf: use neuron TGI upstream image

huggingface · Feb 24, 2025 · c1cf0f0 · c1cf0f0
1 parent ccf3b45
commit c1cf0f0
Show file tree

Hide file tree

Showing 8 changed files with 19 additions and 37 deletions.
diff --git a/benchmark/text-generation-inference/performance/README.md b/benchmark/text-generation-inference/performance/README.md
@@ -16,31 +16,11 @@ $ cd optimum-neuron/benchmark/text-generation-inference/
 
 Edit the `.env` file to select the model to use for the benchmark and its configuration.
 
-The following instructions assume that you are testing a locally built image, so docker would have stored image neuronx-tgi:latest.
-
-You can confirm this by running:
-
-```shell
-$ docker image ls
-```
-
-If you have not built it locally, you can download it and retag it using the following commands
+Download the [text-generation-inference image](https://github.com/huggingface/text-generation-inference/pkgs/container/text-generation-inference) you want to use. Here we will simply use the latest one:
 
 ```shell
-$ docker pull ghcr.io/huggingface/neuronx-tgi:latest
-$ docker tag ghcr.io/huggingface/neuronx-tgi:latest neuronx-tgi:latest
+$ docker pull ghcr.io/huggingface/text-generation-inference:latest-neuron
 ```
-You should then see the single IMAGE ID with two different sets of tags:
-
-```shell
-$ docker image ls
-REPOSITORY                        TAG       IMAGE ID       CREATED        SIZE
-neuronx-tgi                       latest    f5ba57f8517b   12 hours ago   11.3GB
-ghcr.io/huggingface/neuronx-tgi   latest    f5ba57f8517b   12 hours ago   11.3GB
-```
-
-
-Alternatively, you can edit the appropriate docker-compose.yaml to supply the fully path by changing ```neuronx-tgi:latest``` to ```ghcr.io/huggingface/neuronx-tgi:latest```
 
 ## Start the servers
 
@@ -94,10 +74,10 @@ See the [llama3-70b-trn1.32xlarge](llama3-70b-trn1.32xlarge) as an example.
 
 It is best to compile the model with the software in the container you will be using to ensure all library versions match.
 
-As an example, you can compile with the following command.  
+As an example, you can compile with the following command.
 
-**If you make changes, make sure your batch size, sequence length, and num_cores for compilation match the MAX_BATCH_SIZE, and MAX_TOTAL_TOKENS settings in the .env file and the HF_NUM_CORES setting in the docker-compose file.  
-MAX_INPUT_LENGTH needs to be less than sequence_length/MAX_TOTAL_TOKENS.  The directory at the end of the compile command needs to match the MODEL_ID in the .env file.**  
+**If you make changes, make sure your batch size, sequence length, and num_cores for compilation match the MAX_BATCH_SIZE, and MAX_TOTAL_TOKENS settings in the .env file and the HF_NUM_CORES setting in the docker-compose file.
+MAX_INPUT_LENGTH needs to be less than sequence_length/MAX_TOTAL_TOKENS.  The directory at the end of the compile command needs to match the MODEL_ID in the .env file.**
 
 ```
 docker run -p 8080:80 \
@@ -119,14 +99,14 @@ docker run -p 8080:80 \
 --device=/dev/neuron14 \
 --device=/dev/neuron15 \
 -ti \
---entrypoint "optimum-cli" neuronx-tgi:latest \
+--entrypoint "optimum-cli" ghcr.io/huggingface/text-generation-inference:latest-neuron \
 export neuron --model NousResearch/Meta-Llama-3-70B-Instruct \
 --sequence_length 4096 \
 --batch_size 4 \
 --num_cores 32 \
 /data/exportedmodel/
 ```
-See the [Hugging Face documentation](https://huggingface.co/docs/optimum-neuron/en/guides/export_model#exporting-a-model-to-neuron-using-the-cli) for more information on compilation.  
+See the [Hugging Face documentation](https://huggingface.co/docs/optimum-neuron/en/guides/export_model#exporting-a-model-to-neuron-using-the-cli) for more information on compilation.
 
 Note that the .env file has a path for MODEL_ID to load the model from the /data directory.
 

diff --git a/benchmark/text-generation-inference/performance/llama3-70b-inf2.48xlarge/docker-compose.yaml b/benchmark/text-generation-inference/performance/llama3-70b-inf2.48xlarge/docker-compose.yaml
@@ -2,7 +2,7 @@ version: '3.7'
 
 services:
   tgi-1:
-    image: neuronx-tgi:latest
+    image: ghcr.io/huggingface/text-generation-inference:latest-neuron
     ports:
       - "8080:8080"
     environment:
@@ -26,4 +26,4 @@ services:
       - "/dev/neuron8"
       - "/dev/neuron9"
       - "/dev/neuron10"
-      - "/dev/neuron11"
+      - "/dev/neuron11"
diff --git a/benchmark/text-generation-inference/performance/llama3-70b-trn1.32xlarge/docker-compose.yaml b/benchmark/text-generation-inference/performance/llama3-70b-trn1.32xlarge/docker-compose.yaml
@@ -2,7 +2,7 @@ version: '3.7'
 
 services:
   tgi-1:
-    image: neuronx-tgi:latest
+    image: ghcr.io/huggingface/text-generation-inference:latest-neuron
     ports:
       - "8080:8080"
     environment:
@@ -33,4 +33,3 @@ services:
       - "/dev/neuron13"
       - "/dev/neuron14"
       - "/dev/neuron15"
-
diff --git a/benchmark/text-generation-inference/performance/llama3.1-8b/docker-compose.yaml b/benchmark/text-generation-inference/performance/llama3.1-8b/docker-compose.yaml
@@ -2,7 +2,7 @@ version: '3.7'
 
 services:
   tgi-1:
-    image: neuronx-tgi:latest
+    image: ghcr.io/huggingface/text-generation-inference:latest-neuron
     ports:
       - "8081:8081"
     environment:
@@ -22,7 +22,7 @@ services:
       - "/dev/neuron3"
 
   tgi-2:
-    image: neuronx-tgi:latest
+    image: ghcr.io/huggingface/text-generation-inference:latest-neuron
     ports:
       - "8082:8082"
     environment:
@@ -42,7 +42,7 @@ services:
       - "/dev/neuron7"
 
   tgi-3:
-    image: neuronx-tgi:latest
+    image: ghcr.io/huggingface/text-generation-inference:latest-neuron
     ports:
       - "8083:8083"
     environment:

diff --git a/...nce/performance/meta-llama_Llama-3.1-8B-Instruct#2025-01-07-15-24-04_guidellm_report.json b/...nce/performance/meta-llama_Llama-3.1-8B-Instruct#2025-01-07-15-24-04_guidellm_report.json
diff --git a/...nce/performance/meta-llama_Llama-3.1-8B-Instruct#2025-01-07-16-27-13_guidellm_report.json b/...nce/performance/meta-llama_Llama-3.1-8B-Instruct#2025-01-07-16-27-13_guidellm_report.json
diff --git a/...nce/performance/meta-llama_Llama-3.1-8B-Instruct#2025-01-07-16-44-04_guidellm_report.json b/...nce/performance/meta-llama_Llama-3.1-8B-Instruct#2025-01-07-16-44-04_guidellm_report.json
diff --git a/benchmark/text-generation-inference/performance/qwen2.5-7b/docker-compose.yaml b/benchmark/text-generation-inference/performance/qwen2.5-7b/docker-compose.yaml
@@ -2,7 +2,7 @@ version: '3.7'
 
 services:
   tgi-1:
-    image: neuronx-tgi:latest
+    image: ghcr.io/huggingface/text-generation-inference:latest-neuron
     ports:
       - "8081:8081"
     environment:
@@ -22,7 +22,7 @@ services:
       - "/dev/neuron3"
 
   tgi-2:
-    image: neuronx-tgi:latest
+    image: ghcr.io/huggingface/text-generation-inference:latest-neuron
     ports:
       - "8082:8082"
     environment:
@@ -42,7 +42,7 @@ services:
       - "/dev/neuron7"
 
   tgi-3:
-    image: neuronx-tgi:latest
+    image: ghcr.io/huggingface/text-generation-inference:latest-neuron
     ports:
       - "8083:8083"
     environment: