Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decoupled Async Execute #7062

Merged
merged 8 commits into from
Apr 11, 2024
Merged

Decoupled Async Execute #7062

merged 8 commits into from
Apr 11, 2024

Conversation

kthui
Copy link
Contributor

@kthui kthui commented Apr 1, 2024

Related PR: triton-inference-server/python_backend#350

Tests for Python backend decoupled async execute function. Covered cases:

  • A async execute function may start multiple coroutines and those are ran concurrently.
  • Multiple async execute coroutines are ran concurrently.
  • An async execute coroutine started late but finished early will respond before another async execute coroutine that started early but finished late.
  • A coroutine may raise an exception and it will be logged.

@kthui kthui marked this pull request as ready for review April 3, 2024 18:28
Copy link
Member

@Tabrizian Tabrizian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice testing. Could you also profile this model with perf analyzer with high concurrencies just to make sure everything works fine?

@kthui
Copy link
Contributor Author

kthui commented Apr 8, 2024

Nice testing. Could you also profile this model with perf analyzer with high concurrencies just to make sure everything works fine?

Yes, the perf_analyzer run on "async_execute_decouple" model works great

# perf_analyzer -i grpc -m async_execute_decouple --async --streaming true -b 8 --concurrency-range 128:128 --input-data zero
*** Measurement Settings ***
  Batch size: 8
  Service Kind: Triton
  Using "time_windows" mode for stabilization
  Measurement window: 5000 msec
  Latency limit: 0 msec
  Concurrency limit: 128 concurrent requests
  Using asynchronous calls for inference
  Detected decoupled model, using the first response for measuring latency
  Stabilizing using average latency

Request concurrency: 128
  Client: 
    Request count: 44624
    Throughput: 19743.8 infer/sec
    Response Throughput: 2467.97 infer/sec
    Avg latency: 51762 usec (standard deviation 4416 usec)
    p50 latency: 53131 usec
    p90 latency: 54035 usec
    p95 latency: 54909 usec
    p99 latency: 55042 usec
    
  Server: 
    Inference count: 357016
    Execution count: 44627
    Successful request count: 44627
    Avg request latency: 49176 usec (overhead 1 usec + queue 48797 usec + compute input 25 usec + compute infer 349 usec + compute output 3 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 128, throughput: 19743.8 infer/sec, latency 51762 usec

@kthui kthui merged commit f1a515d into main Apr 11, 2024
3 checks passed
@kthui kthui deleted the jacky-py-aio branch April 11, 2024 17:55
kthui added a commit that referenced this pull request Apr 11, 2024
* Add async execute decoupled test

* Add decoupled bls async exec test

* Enhance test with different durations for concurrent executes
mc-nv pushed a commit that referenced this pull request Apr 11, 2024
* Add async execute decoupled test

* Add decoupled bls async exec test

* Enhance test with different durations for concurrent executes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants