Failed to launch triton-server：”error: creating server: Internal - failed to load all models“ #7950

pzydzh · 2025-01-17T10:12:53Z

Environment

CPU architecture: x86_64
CPU/Host memory size: 1.0Ti
GPU properties: 8.0
GPU name: NVIDIA A800-SXM4-80GB
GPU memory size: 81920 MiB
Clock frequencies used: 210 MHz

Libraries

TensorRT-LLM: v0.10.0
CUDA: 12.3
Container used : 24.03-trtllm-python-py3
NVIDIA driver version: 535.183.06
OS : Ubuntu 22.04

Reproduction Steps

Build a single-GPU float16 engine from HF weights.
docker image: nvcr.io/nvidia/tensorrt:24.12-py

python3 convert_checkpoint.py --model_dir /path/to/Qwen2-7B-Instruct \
                              --output_dir /path/to/trt_llm_model/tllm_checkpoint_1gpu_qwen2_7b \
                              --dtype float16
                              
trtllm-build --checkpoint_dir /path/to/trt_llm_model/tllm_checkpoint_1gpu_qwen2_7b \
                --output_dir /path/to/trt_llm_model/tmp/Qwen/7B/trt_engines/fp16/1-gpu/ \
                --gemm_plugin float16

2.Run(Successful)

python3 ../run.py --input_text "你好，你是谁？" \
                  --max_output_len=100 \
                  --tokenizer_dir=/path/to/Qwen2-7B-Instruct \
                  --engine_dir=/path/to/trt_llm_model/tmp/Qwen/7B/trt_engines/fp16/1-gpu/

launch triton-server

cd /
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git
cd /tensorrtllm_backend
cp /path/to/trt_llm_model/tmp/Qwen/7B/trt_engines/fp16/1-gpu/* all_models/inflight_batcher_llm/tensorrt_llm/1/

export HF_LLAMA_MODEL=/path/to/Qwen2-7B-Instruct
export ENGINE_PATH=/tensorrtllm_backend/all_models/inflight_batcher_llm/tensorrt_llm/1/

python3 tools/fill_template.py -i all_models/inflight_batcher_llm/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:1,preprocessing_instance_count:1
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:1,postprocessing_instance_count:1
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:1,decoupled_mode:True,repetition_penalty:0.9,bls_instance_count:1,accumulate_tokens:False
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/ensemble/config.pbtxt triton_max_batch_size:1
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt triton_max_batch_size:1,decoupled_mode:False,decoding_mode:top_p,enable_chunked_context:True,batch_scheduler_policy:max_utilization,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.9,exclude_input_in_output:True,enable_kv_cache_reuse:True,batching_strategy:v1,enable_trt_overlap:True,max_queue_delay_microseconds:0

python3 scripts/launch_triton_server.py --model_repo=all_models/inflight_batcher_llm --world_size 1

ERROR log

I0117 09:50:01.229588 4096 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7fadda000000' with size 268435456
I0117 09:50:01.232842 4096 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0117 09:50:01.232848 4096 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I0117 09:50:01.232851 4096 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864
I0117 09:50:01.232853 4096 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864
I0117 09:50:01.232856 4096 cuda_memory_manager.cc:107] CUDA memory pool is created on device 4 with size 67108864
I0117 09:50:01.232858 4096 cuda_memory_manager.cc:107] CUDA memory pool is created on device 5 with size 67108864
I0117 09:50:01.232861 4096 cuda_memory_manager.cc:107] CUDA memory pool is created on device 6 with size 67108864
I0117 09:50:01.232863 4096 cuda_memory_manager.cc:107] CUDA memory pool is created on device 7 with size 67108864
[libprotobuf ERROR /tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-build/grpc-repo/src/grpc/third_party/protobuf/src/google/protobuf/text_format.cc:335] Error parsing text-format inference.ModelConfig: 263:16: Expected integer or identifier, got: $
E0117 09:50:01.977892 4096 model_repository_manager.cc:1335] Poll failed for model directory 'ensemble': failed to read text proto from all_models/inflight_batcher_llm/ensemble/config.pbtxt
[libprotobuf ERROR /tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-build/grpc-repo/src/grpc/third_party/protobuf/src/google/protobuf/text_format.cc:335] Error parsing text-format inference.ModelConfig: 51:16: Expected integer or identifier, got: $
E0117 09:50:01.978300 4096 model_repository_manager.cc:1335] Poll failed for model directory 'tensorrt_llm': failed to read text proto from all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt
[libprotobuf ERROR /tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-build/grpc-repo/src/grpc/third_party/protobuf/src/google/protobuf/text_format.cc:335] Error parsing text-format inference.ModelConfig: 311:16: Expected integer or identifier, got: $
E0117 09:50:01.978485 4096 model_repository_manager.cc:1335] Poll failed for model directory 'tensorrt_llm_bls': failed to read text proto from all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt
I0117 09:50:01.978537 4096 model_lifecycle.cc:469] loading: preprocessing:1
I0117 09:50:01.978558 4096 model_lifecycle.cc:469] loading: postprocessing:1
I0117 09:50:01.987773 4096 python_be.cc:2391] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)
I0117 09:50:01.987866 4096 python_be.cc:2391] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
[TensorRT-LLM][WARNING] Don't setup 'skip_special_tokens' correctly (set value is ${skip_special_tokens}). Set it as True by default.
I0117 09:50:04.022585 4096 model_lifecycle.cc:835] successfully loaded 'postprocessing'
[TensorRT-LLM][WARNING] 'max_num_images' parameter is not set correctly (value is ${max_num_images}). Will be set to None
[TensorRT-LLM][WARNING] Don't setup 'add_special_tokens' correctly (set value is ${add_special_tokens}). Set it as True by default.
I0117 09:50:04.888933 4096 model_lifecycle.cc:835] successfully loaded 'preprocessing'
I0117 09:50:04.889011 4096 server.cc:607]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0117 09:50:04.889050 4096 server.cc:634]
+---------+-------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| Backend | Path                                                  | Config                                                                                                                                  |
+---------+-------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| python  | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","shm-re |
|         |                                                       | gion-prefix-name":"prefix0_","default-max-batch-size":"4"}}                                                                             |
+---------+-------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+

I0117 09:50:04.889072 4096 server.cc:677]
+----------------+---------+--------+
| Model          | Version | Status |
+----------------+---------+--------+
| postprocessing | 1       | READY  |
| preprocessing  | 1       | READY  |
+----------------+---------+--------+

I0117 09:50:04.978006 4096 metrics.cc:877] Collecting metrics for GPU 0: NVIDIA A800-SXM4-80GB
I0117 09:50:04.978028 4096 metrics.cc:877] Collecting metrics for GPU 1: NVIDIA A800-SXM4-80GB
I0117 09:50:04.978034 4096 metrics.cc:877] Collecting metrics for GPU 2: NVIDIA A800-SXM4-80GB
I0117 09:50:04.978039 4096 metrics.cc:877] Collecting metrics for GPU 3: NVIDIA A800-SXM4-80GB
I0117 09:50:04.978043 4096 metrics.cc:877] Collecting metrics for GPU 4: NVIDIA A800-SXM4-80GB
I0117 09:50:04.978048 4096 metrics.cc:877] Collecting metrics for GPU 5: NVIDIA A800-SXM4-80GB
I0117 09:50:04.978052 4096 metrics.cc:877] Collecting metrics for GPU 6: NVIDIA A800-SXM4-80GB
I0117 09:50:04.978057 4096 metrics.cc:877] Collecting metrics for GPU 7: NVIDIA A800-SXM4-80GB
I0117 09:50:05.015581 4096 metrics.cc:770] Collecting CPU metrics
I0117 09:50:05.015715 4096 tritonserver.cc:2538]
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                  |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                 |
| server_version                   | 2.44.0                                                                                                                                                                 |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor |
|                                  | _data parameters statistics trace logging                                                                                                                              |
| model_repository_path[0]         | all_models/inflight_batcher_llm                                                                                                                                        |
| model_control_mode               | MODE_NONE                                                                                                                                                              |
| strict_model_config              | 1                                                                                                                                                                      |
| rate_limit                       | OFF                                                                                                                                                                    |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                              |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                               |
| cuda_memory_pool_byte_size{1}    | 67108864                                                                                                                                                               |
| cuda_memory_pool_byte_size{2}    | 67108864                                                                                                                                                               |
| cuda_memory_pool_byte_size{3}    | 67108864                                                                                                                                                               |
| cuda_memory_pool_byte_size{4}    | 67108864                                                                                                                                                               |
| cuda_memory_pool_byte_size{5}    | 67108864                                                                                                                                                               |
| cuda_memory_pool_byte_size{6}    | 67108864                                                                                                                                                               |
| cuda_memory_pool_byte_size{7}    | 67108864                                                                                                                                                               |
| min_supported_compute_capability | 6.0                                                                                                                                                                    |
| strict_readiness                 | 1                                                                                                                                                                      |
| exit_timeout                     | 30                                                                                                                                                                     |
| cache_enabled                    | 0                                                                                                                                                                      |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0117 09:50:05.015722 4096 server.cc:307] Waiting for in-flight requests to complete.
I0117 09:50:05.015727 4096 server.cc:323] Timeout 30: Found 0 model versions that have in-flight inferences
I0117 09:50:05.015890 4096 server.cc:338] All models are stopped, unloading models
I0117 09:50:05.015896 4096 server.cc:347] Timeout 30: Found 2 live models and 0 in-flight non-inference requests
I0117 09:50:06.015958 4096 server.cc:347] Timeout 29: Found 2 live models and 0 in-flight non-inference requests
Cleaning up...
Cleaning up...
I0117 09:50:06.605298 4096 model_lifecycle.cc:620] successfully unloaded 'postprocessing' version 1
I0117 09:50:06.749957 4096 model_lifecycle.cc:620] successfully unloaded 'preprocessing' version 1
I0117 09:50:07.016032 4096 server.cc:347] Timeout 28: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[12193,1],0]
  Exit code:    1
--------------------------------------------------------------------------

The text was updated successfully, but these errors were encountered:

rmccorm4 · 2025-01-17T19:32:38Z

Hi @pzydzh, from the server logs, it looks like there are some template values in the config.pbtxt files that were missed by the fill_template.py commands you ran:

E0117 09:50:01.977892 4096 model_repository_manager.cc:1335] Poll failed for model directory 'ensemble': failed to read text proto from all_models/inflight_batcher_llm/ensemble/config.pbtxt
[libprotobuf ERROR /tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-build/grpc-repo/src/grpc/third_party/protobuf/src/google/protobuf/text_format.cc:335] Error parsing text-format inference.ModelConfig: 51:16: Expected integer or identifier, got: $
E0117 09:50:01.978300 4096 model_repository_manager.cc:1335] Poll failed for model directory 'tensorrt_llm': failed to read text proto from all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt
[libprotobuf ERROR /tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-build/grpc-repo/src/grpc/third_party/protobuf/src/google/protobuf/text_format.cc:335] Error parsing text-format inference.ModelConfig: 311:16: Expected integer or identifier, got: $
E0117 09:50:01.978485 4096 model_repository_manager.cc:1335] Poll failed for model directory 'tensorrt_llm_bls': failed to read text proto from all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt

You can try looking into those files manually and search for the $ symbol to find them.

Something that may help avoid this issue - if you're running the 24.03 version of tritonserver / 0.10.0 trtllm, make sure to clone the matching versions of any git repositories to make sure they match up correctly, such as: git clone -b v0.10.0https://github.com/triton-inference-server/tensorrtllm_backend.git.

Also, I noticed you mentioned 24.03 for Triton version, but 24.12 for TRT version. You also generally want to stick to matching versions everywhere for best support.

You can try following the latest README quickstart steps here as a sanity check: https://github.com/triton-inference-server/tensorrtllm_backend/tree/main?tab=readme-ov-file#quick-start

CC @krishung5 @schetlur-nv

pzydzh · 2025-01-20T08:34:54Z

Hi @pzydzh, from the server logs, it looks like there are some template values in the config.pbtxt files that were missed by the fill_template.py commands you ran:
E0117 09:50:01.977892 4096 model_repository_manager.cc:1335] Poll failed for model directory 'ensemble': failed to read text proto from all_models/inflight_batcher_llm/ensemble/config.pbtxt
[libprotobuf ERROR /tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-build/grpc-repo/src/grpc/third_party/protobuf/src/google/protobuf/text_format.cc:335] Error parsing text-format inference.ModelConfig: 51:16: Expected integer or identifier, got: $
E0117 09:50:01.978300 4096 model_repository_manager.cc:1335] Poll failed for model directory 'tensorrt_llm': failed to read text proto from all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt
[libprotobuf ERROR /tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-build/grpc-repo/src/grpc/third_party/protobuf/src/google/protobuf/text_format.cc:335] Error parsing text-format inference.ModelConfig: 311:16: Expected integer or identifier, got: $
E0117 09:50:01.978485 4096 model_repository_manager.cc:1335] Poll failed for model directory 'tensorrt_llm_bls': failed to read text proto from all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt
You can try looking into those files manually and search for the $ symbol to find them.

Something that may help avoid this issue - if you're running the 24.03 version of tritonserver / 0.10.0 trtllm, make sure to clone the matching versions of any git repositories to make sure they match up correctly, such as: git clone -b v0.10.0https://github.com/triton-inference-server/tensorrtllm_backend.git.

Also, I noticed you mentioned 24.03 for Triton version, but 24.12 for TRT version. You also generally want to stick to matching versions everywhere for best support.

You can try following the latest README quickstart steps here as a sanity check: https://github.com/triton-inference-server/tensorrtllm_backend/tree/main?tab=readme-ov-file#quick-start

CC @krishung5 @schetlur-nv

Thank you for responding.

I have tried downloading the latest image to deploy the model.

Libraries
TensorRT-LLM：0.16.0 ---> git clone -b v0.16.0 https://github.com/NVIDIA/TensorRT-LLM.git
Tensorrtllm_backend：0.16.0 ---> git clone -b v0.16.0 https://github.com/triton-inference-server/tensorrtllm_backend.git

steps

convert checkpoint using image: nvcr.io/nvidia/tensorrt:24.12-py3 (the same as before)

python3 convert_checkpoint.py --model_dir /path/to/Qwen2-7B-Instruct \
                              --output_dir /path/to/trt_llm_model/tllm_checkpoint_1gpu_qwen2_7b \
                              --dtype float16
                              
trtllm-build --checkpoint_dir /path/to/trt_llm_model/tllm_checkpoint_1gpu_qwen2_7b \
                --output_dir /path/to/trt_llm_model/tmp/Qwen/7B/trt_engines/fp16/1-gpu/ \
                --gemm_plugin float16

launch triton-server using image: nvcr.io/nvidia/tritonserver:24.12-py3 ("the same version of TensorRT-LLM backend as the version of TensorRT-LLM in the container")

cd /
git clone -b v0.16.0 https://github.com/triton-inference-server/tensorrtllm_backend.git
cd /tensorrtllm_backend
git submodule update --init --recursive
git lfs install
git lfs pull
pip install -r requirements.txt

cp /path/to/trt_llm_model/tmp/Qwen/7B/trt_engines/fp16/1-gpu/* all_models/inflight_batcher_llm/tensorrt_llm/1/

ENGINE_DIR=/path/to/trt_llm_model/tmp/Qwen/7B/trt_engines/fp16/1-gpu
TOKENIZER_DIR=/path/to/Qwen2-7B-Instruct
MODEL_FOLDER=/tensorrtllm_backend/all_models/inflight_batcher_llm
TRITON_MAX_BATCH_SIZE=4
INSTANCE_COUNT=1
MAX_QUEUE_DELAY_MS=0
MAX_QUEUE_SIZE=0
FILL_TEMPLATE_SCRIPT=/tensorrtllm_backend/tools/fill_template.py
DECOUPLED_MODE=false

python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/ensemble/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:${ENGINE_DIR},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:inflight_fused_batching,max_queue_size:${MAX_QUEUE_SIZE},encoder_input_features_data_type:TYPE_FP16
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT},max_queue_size:${MAX_QUEUE_SIZE}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT}

# launch triton-server
python3 scripts/launch_triton_server.py --model_repo=all_models/inflight_batcher_llm --world_size 1

But I still got the error:

I0120 08:28:08.584441 2049 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I0120 08:28:08.584452 2049 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 1 with size 67108864"
I0120 08:28:08.584456 2049 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 2 with size 67108864"
I0120 08:28:08.584460 2049 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 3 with size 67108864"
I0120 08:28:08.584464 2049 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 4 with size 67108864"
I0120 08:28:08.584468 2049 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 5 with size 67108864"
I0120 08:28:08.584472 2049 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 6 with size 67108864"
I0120 08:28:08.584476 2049 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 7 with size 67108864"
I0120 08:28:09.425338 2049 model_lifecycle.cc:473] "loading: postprocessing:1"
I0120 08:28:09.425397 2049 model_lifecycle.cc:473] "loading: preprocessing:1"
I0120 08:28:09.425458 2049 model_lifecycle.cc:473] "loading: tensorrt_llm:1"
I0120 08:28:09.425498 2049 model_lifecycle.cc:473] "loading: tensorrt_llm_bls:1"
E0120 08:28:09.425599 2049 model_lifecycle.cc:654] "failed to load 'tensorrt_llm' version 1: Invalid argument: unable to find backend library for backend 'tensorrtllm', try specifying runtime on the model configuration."
I0120 08:28:09.425626 2049 model_lifecycle.cc:789] "failed to load 'tensorrt_llm'"
I0120 08:28:09.437259 2049 python_be.cc:2249] "TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (CPU device 0)"
I0120 08:28:09.440795 2049 python_be.cc:2249] "TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)"
I0120 08:28:09.441372 2049 python_be.cc:2249] "TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)"
I0120 08:28:11.433546 2049 model_lifecycle.cc:849] "successfully loaded 'tensorrt_llm_bls'"
[TensorRT-LLM][WARNING] Don't setup 'skip_special_tokens' correctly (set value is ${skip_special_tokens}). Set it as True by default.
I0120 08:28:12.280111 2049 model_lifecycle.cc:849] "successfully loaded 'postprocessing'"
[TensorRT-LLM][WARNING] 'max_num_images' parameter is not set correctly (value is ${max_num_images}). Will be set to None
[TensorRT-LLM][WARNING] Don't setup 'add_special_tokens' correctly (set value is ${add_special_tokens}). Set it as True by default.
I0120 08:28:13.364362 2049 model_lifecycle.cc:849] "successfully loaded 'preprocessing'"
E0120 08:28:13.364430 2049 model_repository_manager.cc:703] "Invalid argument: ensemble 'ensemble' depends on 'tensorrt_llm' which has no loaded version. Model 'tensorrt_llm' loading failed with error: version 1 is at UNAVAILABLE state: Invalid argument: unable to find backend library for backend 'tensorrtllm', try specifying runtime on the model configuration.;"
I0120 08:28:13.364498 2049 server.cc:604]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0120 08:28:13.364521 2049 server.cc:631]
+---------+-------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| Backend | Path                                                  | Config                                                                                                                                  |
+---------+-------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| python  | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","shm-re |
|         |                                                       | gion-prefix-name":"prefix0_","default-max-batch-size":"4"}}                                                                             |
+---------+-------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+

I0120 08:28:13.364572 2049 server.cc:674]
+------------------+---------+---------------------------------------------------------------------------------------------------------------------------------------------+
| Model            | Version | Status                                                                                                                                      |
+------------------+---------+---------------------------------------------------------------------------------------------------------------------------------------------+
| postprocessing   | 1       | READY                                                                                                                                       |
| preprocessing    | 1       | READY                                                                                                                                       |
| tensorrt_llm     | 1       | UNAVAILABLE: Invalid argument: unable to find backend library for backend 'tensorrtllm', try specifying runtime on the model configuration. |
| tensorrt_llm_bls | 1       | READY                                                                                                                                       |
+------------------+---------+---------------------------------------------------------------------------------------------------------------------------------------------+

I0120 08:28:13.466631 2049 metrics.cc:890] "Collecting metrics for GPU 0: NVIDIA A800-SXM4-80GB"
I0120 08:28:13.466663 2049 metrics.cc:890] "Collecting metrics for GPU 1: NVIDIA A800-SXM4-80GB"
I0120 08:28:13.466671 2049 metrics.cc:890] "Collecting metrics for GPU 2: NVIDIA A800-SXM4-80GB"
I0120 08:28:13.466678 2049 metrics.cc:890] "Collecting metrics for GPU 3: NVIDIA A800-SXM4-80GB"
I0120 08:28:13.466686 2049 metrics.cc:890] "Collecting metrics for GPU 4: NVIDIA A800-SXM4-80GB"
I0120 08:28:13.466693 2049 metrics.cc:890] "Collecting metrics for GPU 5: NVIDIA A800-SXM4-80GB"
I0120 08:28:13.466699 2049 metrics.cc:890] "Collecting metrics for GPU 6: NVIDIA A800-SXM4-80GB"
I0120 08:28:13.466706 2049 metrics.cc:890] "Collecting metrics for GPU 7: NVIDIA A800-SXM4-80GB"
I0120 08:28:13.507846 2049 metrics.cc:783] "Collecting CPU metrics"
I0120 08:28:13.507967 2049 tritonserver.cc:2598]
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                  |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                 |
| server_version                   | 2.53.0                                                                                                                                                                 |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor |
|                                  | _data parameters statistics trace logging                                                                                                                              |
| model_repository_path[0]         | all_models/inflight_batcher_llm                                                                                                                                        |
| model_control_mode               | MODE_NONE                                                                                                                                                              |
| strict_model_config              | 1                                                                                                                                                                      |
| model_config_name                |                                                                                                                                                                        |
| rate_limit                       | OFF                                                                                                                                                                    |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                              |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                               |
| cuda_memory_pool_byte_size{1}    | 67108864                                                                                                                                                               |
| cuda_memory_pool_byte_size{2}    | 67108864                                                                                                                                                               |
| cuda_memory_pool_byte_size{3}    | 67108864                                                                                                                                                               |
| cuda_memory_pool_byte_size{4}    | 67108864                                                                                                                                                               |
| cuda_memory_pool_byte_size{5}    | 67108864                                                                                                                                                               |
| cuda_memory_pool_byte_size{6}    | 67108864                                                                                                                                                               |
| cuda_memory_pool_byte_size{7}    | 67108864                                                                                                                                                               |
| min_supported_compute_capability | 6.0                                                                                                                                                                    |
| strict_readiness                 | 1                                                                                                                                                                      |
| exit_timeout                     | 30                                                                                                                                                                     |
| cache_enabled                    | 0                                                                                                                                                                      |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0120 08:28:13.508038 2049 server.cc:305] "Waiting for in-flight requests to complete."
I0120 08:28:13.508048 2049 server.cc:321] "Timeout 30: Found 0 model versions that have in-flight inferences"
I0120 08:28:13.508600 2049 server.cc:336] "All models are stopped, unloading models"
I0120 08:28:13.508609 2049 server.cc:345] "Timeout 30: Found 3 live models and 0 in-flight non-inference requests"
I0120 08:28:14.508675 2049 server.cc:345] "Timeout 29: Found 3 live models and 0 in-flight non-inference requests"
Cleaning up...
Cleaning up...
Cleaning up...
I0120 08:28:15.120311 2049 model_lifecycle.cc:636] "successfully unloaded 'tensorrt_llm_bls' version 1"
I0120 08:28:15.169481 2049 model_lifecycle.cc:636] "successfully unloaded 'postprocessing' version 1"
I0120 08:28:15.472158 2049 model_lifecycle.cc:636] "successfully unloaded 'preprocessing' version 1"
I0120 08:28:15.508751 2049 server.cc:345] "Timeout 28: Found 0 live models and 0 in-flight non-inference requests"
error: creating server: Internal - failed to load all models
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[31871,1],0]
  Exit code:    1
--------------------------------------------------------------------------

I noticed that there is an error message saying 'failed to load 'tensorrt_llm' version 1'. I have already installed tensorrt_llm, and it can also be seen in pip. Why did the loading fail?

Thank you.

rmccorm4 added the module: backends Issues related to the backends label Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to launch triton-server：”error: creating server: Internal - failed to load all models“ #7950

Failed to launch triton-server：”error: creating server: Internal - failed to load all models“ #7950

pzydzh commented Jan 17, 2025

rmccorm4 commented Jan 17, 2025 •

edited

Loading

pzydzh commented Jan 20, 2025

Failed to launch triton-server：”error: creating server: Internal - failed to load all models“ #7950

Failed to launch triton-server：”error: creating server: Internal - failed to load all models“ #7950

Comments

pzydzh commented Jan 17, 2025

Environment

Libraries

Reproduction Steps

ERROR log

rmccorm4 commented Jan 17, 2025 • edited Loading

pzydzh commented Jan 20, 2025

rmccorm4 commented Jan 17, 2025 •

edited

Loading