You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description
A blank Triton Python model incurs anywhere between 11ms to 20ms even if there's no internal processing happening. This overhead is expensive in some applications that run on really tight latency SLAs (sub 100ms per request). Notice that the inner core of the model server takes less than 0.5ms to complete. See the code below.
In addition, it seems that overhead is not consistent, and almost looks cyclical (see logs below).
Triton Information
24.04-py3
Are you using the Triton container or did you build it yourself? Using NGC's Triton container
To Reproduce
Run this snippet as-is. It's a stand-alone repro. There's no additional config or model artifacts that come with it.
{"model_name":"dummy","model_version":"4","outputs":[{"name":"OUTPUT","datatype":"BYTES","shape":[1,1],"data":["test"]}]}
real 0m0.020s
user 0m0.004s
sys 0m0.007s
{"model_name":"dummy","model_version":"4","outputs":[{"name":"OUTPUT","datatype":"BYTES","shape":[1,1],"data":["test"]}]}
real 0m0.017s
user 0m0.004s
sys 0m0.007s
{"model_name":"dummy","model_version":"4","outputs":[{"name":"OUTPUT","datatype":"BYTES","shape":[1,1],"data":["test"]}]}
real 0m0.011s
user 0m0.003s
sys 0m0.004s
{"model_name":"dummy","model_version":"4","outputs":[{"name":"OUTPUT","datatype":"BYTES","shape":[1,1],"data":["test"]}]}
real 0m0.019s
user 0m0.005s
sys 0m0.007s
{"model_name":"dummy","model_version":"4","outputs":[{"name":"OUTPUT","datatype":"BYTES","shape":[1,1],"data":["test"]}]}
real 0m0.016s
user 0m0.004s
sys 0m0.007s
{"model_name":"dummy","model_version":"4","outputs":[{"name":"OUTPUT","datatype":"BYTES","shape":[1,1],"data":["test"]}]}
real 0m0.018s
user 0m0.005s
sys 0m0.007s
{"model_name":"dummy","model_version":"4","outputs":[{"name":"OUTPUT","datatype":"BYTES","shape":[1,1],"data":["test"]}]}
real 0m0.013s
user 0m0.004s
sys 0m0.005s
{"model_name":"dummy","model_version":"4","outputs":[{"name":"OUTPUT","datatype":"BYTES","shape":[1,1],"data":["test"]}]}
real 0m0.017s
user 0m0.005s
sys 0m0.006s
{"model_name":"dummy","model_version":"4","outputs":[{"name":"OUTPUT","datatype":"BYTES","shape":[1,1],"data":["test"]}]}
real 0m0.013s
user 0m0.003s
sys 0m0.005s
{"model_name":"dummy","model_version":"4","outputs":[{"name":"OUTPUT","datatype":"BYTES","shape":[1,1],"data":["test"]}]}
real 0m0.012s
user 0m0.003s
sys 0m0.004s
Expected behavior
If the inner model takes less than 0.5ms to run (uncomment the time lines to verify):
Why is there an additional 10ms overhead per request (I understand this differs between machines, but the lowest I've seen is 7ms per request)
Why is the overhead volatile, peaking at 19ms in this case? (disregarding warm-up)
Given a really tight SLA, any 1ms can help save us expensive latency.
The text was updated successfully, but these errors were encountered:
Description
A blank Triton Python model incurs anywhere between 11ms to 20ms even if there's no internal processing happening. This overhead is expensive in some applications that run on really tight latency SLAs (sub 100ms per request). Notice that the inner core of the model server takes less than 0.5ms to complete. See the code below.
In addition, it seems that overhead is not consistent, and almost looks cyclical (see logs below).
Triton Information
24.04-py3
Are you using the Triton container or did you build it yourself? Using NGC's Triton container
To Reproduce
Run this snippet as-is. It's a stand-alone repro. There's no additional config or model artifacts that come with it.
For my local server, I run the following loop:
returns
Expected behavior
If the inner model takes less than 0.5ms to run (uncomment the time lines to verify):
Given a really tight SLA, any 1ms can help save us expensive latency.
The text was updated successfully, but these errors were encountered: