Skip to content

Latest commit

 

History

History
107 lines (84 loc) · 4.29 KB

README.md

File metadata and controls

107 lines (84 loc) · 4.29 KB

Testing TensorRT-LLM backend

Tests in this CI directory can be run manually to provide extensive testing.

Run QA Tests

Run the testing within the Triton container.

docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /path/to/tensorrtllm_backend:/opt/tritonserver/tensorrtllm_backend nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3 bash

# Change directory to the test and run the test.sh script
cd /opt/tritonserver/tensorrtllm_backend/ci/L0_backend_trtllm
bash -x ./test.sh

Run the e2e/benchmark_core_model to benchmark

These two tests are ran in the L0_backend_trtllm test. Below are the instructions to run the tests manually.

Generate the model repository

Follow the instructions in the Create the model repository section to prepare the model repository.

Modify the model configuration

Follow the instructions in the Modify the model configuration section to modify the model configuration based on the needs.

End to end test

End to end test script sends requests to the deployed ensemble model.

Ensemble model is ensembled by three models: preprocessing, tensorrt_llm and postprocessing:

  • "preprocessing": This model is used for tokenizing, meaning the conversion from prompts(string) to input_ids(list of ints).
  • "tensorrt_llm": This model is a wrapper of your TensorRT-LLM model and is used for inferencing
  • "postprocessing": This model is used for de-tokenizing, meaning the conversion from output_ids(list of ints) to outputs(string).

The end to end latency includes the total latency of the three parts of an ensemble model.

cd tools/inflight_batcher_llm
python3 end_to_end_test.py --dataset <dataset path>

Expected outputs

[INFO] Functionality test succeed.
[INFO] Warm up for benchmarking.
[INFO] Start benchmarking on 125 prompts.
[INFO] Total Latency: 11099.243 ms

benchmark_core_model

benchmark_core_model script sends requests directly to the deployed tensorrt_llm model, the benchmark_core_model latency indicates the inference latency of TensorRT-LLM, not including the pre/post-processing latency which is usually handled by a third-party library such as HuggingFace.

cd tools/inflight_batcher_llm
python3 benchmark_core_model.py dataset --dataset <dataset path>

Expected outputs

[INFO] Warm up for benchmarking.
[INFO] Start benchmarking on 125 prompts.
[INFO] Total Latency: 10213.462 ms

Please note that the expected outputs in that document are only for reference, specific performance numbers depend on the GPU you're using.