Performance issue - High queue times in perf_analyzer #7986

asaff1 · 2025-02-04T09:24:06Z

I've used trtexec to improve a model performance.
perf_analyzer shows that the infer compute time is very log (a few milliseconds), yet the queue and wait time are high (300ms). What is the reason for requests spending a long time in the queue? A detailed explanation here will be appreciated.
Ideally I want to request time to match the inference time. Any ideas?
I've tried playing with instance_groups with no success.

root@5d6049652465:/opt/tritonserver# perf_analyzer -i grpc -m model_trt_fp16 --concurrency 1000 --shared-memory system
*** Measurement Settings ***
  Batch size: 1
  Service Kind: Triton
  Using "time_windows" mode for stabilization
  Measurement window: 5000 msec
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1000
  Client:
    Request count: 72262
    Throughput: 4005.66 infer/sec
    Avg latency: 247602 usec (standard deviation 9319 usec)
    p50 latency: 245439 usec
    p90 latency: 255399 usec
    p95 latency: 282710 usec
    p99 latency: 345187 usec
    Avg gRPC time: 247578 usec ((un)marshal request/response 11 usec + response wait 247567 usec)
  Server:
    Inference count: 72262
    Execution count: 4518
    Successful request count: 72262
    Avg request latency: 247496 usec (overhead 844 usec + queue 242704 usec + compute input 2241 usec + compute infer 1613 usec + compute output 93 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1000, throughput: 4005.66 infer/sec, latency 247602 usec

config.pbtxt:

platform: "tensorrt_plan"
max_batch_size: 16
dynamic_batching {
  preferred_batch_size: [4, 8, 16]
  max_queue_delay_microseconds: 100
}
optimization {
  cuda { graphs: true }
}

model_warmup {
  batch_size: 1
  inputs {
    key: "input"
    value {
      data_type: TYPE_FP32
      dims: [3, 256, 256]
      zero_data: true
    }
  }
}
model_warmup {
  batch_size: 2
  inputs {
    key: "input"
    value {
      data_type: TYPE_FP32
      dims: [3, 256, 256]
      zero_data: true
    }
  }
}
model_warmup {
  batch_size: 3
  inputs {
    key: "input"
    value {
      data_type: TYPE_FP32
      dims: [3, 256, 256]
      zero_data: true
    }
  }
}
model_warmup {
  batch_size: 4
  inputs {
    key: "input"
    value {
      data_type: TYPE_FP32
      dims: [3, 256, 256]
      zero_data: true
    }
  }
}

The text was updated successfully, but these errors were encountered:

rmccorm4 · 2025-02-05T19:48:16Z

Hi @asaff1, the queue time is likely so high compared to compute times because the model config has defined a max batch size of 16, but is being hit by PA with a concurrency of 1000, leaving many requests to be queued while at most 16 requests at a time are executed.

Can you build an engine that supports a greater max batch size?

I've tried playing with instance_groups with no success.

Can you elaborate on this? What instance group configurations have you tried, and how did they affect the results?

asaff1 · 2025-02-06T14:57:37Z

@rmccorm4 I understand. Thanks.

The system has one RTX 4090. My goal is to reduce latency to bellow 10ms and to have 1000 requests per second. I've tried using PA with --request-range 1000 but I get the warning

[WARNING] Perf Analyzer was not able to keep up with the desired request rate. 99.96% of the requests were delayed.

And then I see less infer/sec then using --concurrency 1000

What is better in this case, increasing model instances, or increasing batch size?

I've tried to increate to max_batch_size = 128, (and disabled preferred_batch_sizes ), and latency is still high

root@c13a7cbe8571:/opt/shlomo/benchmark_models# perf_analyzer -i grpc -m model_trt_fp16 --concurrency 1000 --shared-memory system
*** Measurement Settings ***
  Batch size: 1
  Service Kind: Triton
  Using "time_windows" mode for stabilization
  Measurement window: 5000 msec
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1000
  Client:
    Request count: 90978
    Throughput: 5044.3 infer/sec
    Avg latency: 197140 usec (standard deviation 11611 usec)
    p50 latency: 196110 usec
    p90 latency: 217432 usec
    p95 latency: 227722 usec
    p99 latency: 245582 usec
    Avg gRPC time: 197135 usec ((un)marshal request/response 5 usec + response wait 197130 usec)
  Server:
    Inference count: 90888
    Execution count: 712
    Successful request count: 90888
    Avg request latency: 197695 usec (overhead 3773 usec + queue 168707 usec + compute input 13714 usec + compute infer 11225 usec + compute output 275 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1000, throughput: 5044.3 infer/sec, latency 197140 usec

Tried to increate instance_group =8 and max_batch_size = 64, and did see some latency improvement, still not my goal:

root@5f8b48be57d2:/opt/tritonserver# perf_analyzer -i grpc -m model_trt_fp16 --concurrency 1000 --shared-memory system
*** Measurement Settings ***
  Batch size: 1
  Service Kind: Triton
  Using "time_windows" mode for stabilization
  Measurement window: 5000 msec
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1000
  Client:
    Request count: 152714
    Throughput: 8454.02 infer/sec
    Avg latency: 117871 usec (standard deviation 6402 usec)
    p50 latency: 117517 usec
    p90 latency: 133338 usec
    p95 latency: 139219 usec
    p99 latency: 149502 usec
    Avg gRPC time: 117860 usec ((un)marshal request/response 5 usec + response wait 117855 usec)
  Server:
    Inference count: 152714
    Execution count: 2388
    Successful request count: 152714
    Avg request latency: 117684 usec (overhead 2016 usec + queue 94008 usec + compute input 8234 usec + compute infer 12207 usec + compute output 1218 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1000, throughput: 8454.02 infer/sec, latency 117871 usec

Then, I've tried max_batch_size = 64 and instance_group = 16, which I assume should be able to handle 1024 requests concurrently ? (64x16 = 1024), yet the latency queue time is still high:

root@5f8b48be57d2:/opt/tritonserver# perf_analyzer -i grpc -m model_trt_fp16 --concurrency 1000 --shared-memory system
*** Measurement Settings ***
  Batch size: 1
  Service Kind: Triton
  Using "time_windows" mode for stabilization
  Measurement window: 5000 msec
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1000

  Client:
    Request count: 155884
    Throughput: 8631.19 infer/sec
    Avg latency: 115285 usec (standard deviation 8749 usec)
    p50 latency: 115097 usec
    p90 latency: 132436 usec
    p95 latency: 138639 usec
    p99 latency: 151940 usec
    Avg gRPC time: 115276 usec ((un)marshal request/response 5 usec + response wait 115271 usec)
  Server:
    Inference count: 155881
    Execution count: 2438
    Successful request count: 155881
    Avg request latency: 114746 usec (overhead 2320 usec + queue 91500 usec + compute input 8227 usec + compute infer 11599 usec + compute output 1099 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1000, throughput: 8631.19 infer/sec, latency 115285 usec

How should I proceed?
Also, when dynamic_batching is enabled, what is the best values to give for trtexec for optShape? Today I use:

--minShapes=input:1x3x256x256 --optShapes=input:16x3x256x256 --maxShapes=input:256x3x256x256

What performance impact it has? In my application requests are coming one by one.

rmccorm4 · 2025-02-07T18:49:11Z

Hi @asaff1,

Have you tried Model Analyzer for finding an optimal model config (instance count, batching settings, etc):

asaff1 · 2025-02-18T14:19:23Z

@rmccorm4 Thanks, Finally I've managed to play with model analyzer a bit. (For some reason, the 24.12 release didn't work, and consumed all system memory. I've used release 24.01).

I'm now running with 5:GPU instance count and batch size = 4. (found best for my latency budget).
Using this optimal config with perf_analyzer, I'm seeing a big differences if using --shared-memory system or not.

Without shared memory:

# perf_analyzer -i grpc -m model_trt_fp16 --request-rate 1500
*** Measurement Settings ***
  Batch size: 1
  Service Kind: Triton
  Using "time_windows" mode for stabilization
  Measurement window: 5000 msec
  Using uniform distribution on request generation
  Using synchronous calls for inference
  Stabilizing using average latency

Request Rate: 1500 inference requests per seconds
  Client:
    Request count: 21900
    Avg send request rate: 1216.35 infer/sec
    [WARNING] Perf Analyzer was not able to keep up with the desired request rate. 99.81% of the requests were delayed.
    Throughput: 1215.82 infer/sec
    Avg latency: 3286 usec (standard deviation 581 usec)
    p50 latency: 3253 usec
    p90 latency: 4038 usec
    p95 latency: 4237 usec
    p99 latency: 4765 usec
    Avg gRPC time: 3276 usec ((un)marshal request/response 51 usec + response wait 3225 usec)
  Server:
    Inference count: 21902
    Execution count: 20398
    Successful request count: 21902
    Avg request latency: 1381 usec (overhead 128 usec + queue 256 usec + compute input 129 usec + compute infer 783 usec + compute output 84 usec)

Inferences/Second vs. Client Average Batch Latency
Request Rate: 1500.00, throughput: 1215.82 infer/sec, latency 3286 use

With --shared memory system:

# perf_analyzer -i grpc -m model_trt_fp16 --request-rate 1500 --shared-memory system
*** Measurement Settings ***
  Batch size: 1
  Service Kind: Triton
  Using "time_windows" mode for stabilization
  Measurement window: 5000 msec
  Using uniform distribution on request generation
  Using synchronous calls for inference
  Stabilizing using average latency

Request Rate: 1500 inference requests per seconds
  Client:
    Request count: 27011
    Throughput: 1499.85 infer/sec
    Avg latency: 1197 usec (standard deviation 124 usec)
    p50 latency: 1202 usec
    p90 latency: 1270 usec
    p95 latency: 1297 usec
    p99 latency: 1362 usec
    Avg gRPC time: 1188 usec ((un)marshal request/response 5 usec + response wait 1183 usec)
  Server:
    Inference count: 27012
    Execution count: 26996
    Successful request count: 27012
    Avg request latency: 1050 usec (overhead 53 usec + queue 184 usec + compute input 101 usec + compute infer 655 usec + compute output 56 usec)

Inferences/Second vs. Client Average Batch Latency
Request Rate: 1500, throughput: 1499.85 infer/sec, latency 1197 usec

Also when trying multiple times, I see that the shared memory measurement is much more stable.
When using plain gRPC (without shared memory) - the latency fluctuate a lot between each test run.
I do understand that shared memory involves less copies, but is it possible that gRPC is that slow for sending the image? Everything is running on the same PC. What are the options to get shared memory like performance when running with multiple hosts (connected with fast ethernet)?

asaff1 changed the title ~~high queue times in perf_analyzer~~ Performance issue - High queue times in perf_analyzer Feb 4, 2025

rmccorm4 added question Further information is requested performance A possible performance tune-up labels Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issue - High queue times in perf_analyzer #7986

Performance issue - High queue times in perf_analyzer #7986

asaff1 commented Feb 4, 2025

rmccorm4 commented Feb 5, 2025

asaff1 commented Feb 6, 2025

rmccorm4 commented Feb 7, 2025

asaff1 commented Feb 18, 2025 •

edited

Loading

Performance issue - High queue times in perf_analyzer #7986

Performance issue - High queue times in perf_analyzer #7986

Comments

asaff1 commented Feb 4, 2025

rmccorm4 commented Feb 5, 2025

asaff1 commented Feb 6, 2025

rmccorm4 commented Feb 7, 2025

asaff1 commented Feb 18, 2025 • edited Loading

asaff1 commented Feb 18, 2025 •

edited

Loading