Expensive & Volatile Triton Server latency #7739

jadhosn · 2024-10-24T14:28:45Z

Description
A blank Triton Python model incurs anywhere between 11ms to 20ms even if there's no internal processing happening. This overhead is expensive in some applications that run on really tight latency SLAs (sub 100ms per request). Notice that the inner core of the model server takes less than 0.5ms to complete. See the code below.

In addition, it seems that overhead is not consistent, and almost looks cyclical (see logs below).

Triton Information
24.04-py3

Are you using the Triton container or did you build it yourself? Using NGC's Triton container

To Reproduce
Run this snippet as-is. It's a stand-alone repro. There's no additional config or model artifacts that come with it.

import time
import numpy as np
import triton_python_backend_utils as pb_utils

class TritonPythonModel:

    @staticmethod
    def auto_complete_config(auto_complete_model_config):
        auto_complete_model_config.add_input({"name": "INPUT",  "data_type": "TYPE_STRING", "dims": [-1]})
        auto_complete_model_config.add_output({"name": "OUTPUT", "data_type": "TYPE_STRING", "dims": [-1]})
        auto_complete_model_config.set_max_batch_size(0)
        return auto_complete_model_config

    def execute(self, requests):
        responses = []
        # __start = time.time()
        for request in requests:
            in_numpy = pb_utils.get_input_tensor_by_name(request, "INPUT").as_numpy()
            out_numpy = np.array([in_numpy], dtype=np.object_)
            out_pb = pb_utils.Tensor("OUTPUT", out_numpy)
            responses.append(pb_utils.InferenceResponse(output_tensors=[out_pb]))
        # print(f"Elapsed Time: {(time.time() - __start)*1000}", flush=True)
        return responses

For my local server, I run the following loop:

 for i in {1..10}; do time curl -X POST -k localhost:8000/v2/models/dummy/infer -d '{"inputs":[{"name":"INPUT","datatype":"BYTES","shape":[1],"data":["test"]}]}' && sleep 0.1; done

returns

{"model_name":"dummy","model_version":"4","outputs":[{"name":"OUTPUT","datatype":"BYTES","shape":[1,1],"data":["test"]}]}
real 0m0.020s
user 0m0.004s
sys 0m0.007s
{"model_name":"dummy","model_version":"4","outputs":[{"name":"OUTPUT","datatype":"BYTES","shape":[1,1],"data":["test"]}]}
real 0m0.017s
user 0m0.004s
sys 0m0.007s
{"model_name":"dummy","model_version":"4","outputs":[{"name":"OUTPUT","datatype":"BYTES","shape":[1,1],"data":["test"]}]}
real 0m0.011s
user 0m0.003s
sys 0m0.004s
{"model_name":"dummy","model_version":"4","outputs":[{"name":"OUTPUT","datatype":"BYTES","shape":[1,1],"data":["test"]}]}
real 0m0.019s
user 0m0.005s
sys 0m0.007s
{"model_name":"dummy","model_version":"4","outputs":[{"name":"OUTPUT","datatype":"BYTES","shape":[1,1],"data":["test"]}]}
real 0m0.016s
user 0m0.004s
sys 0m0.007s
{"model_name":"dummy","model_version":"4","outputs":[{"name":"OUTPUT","datatype":"BYTES","shape":[1,1],"data":["test"]}]}
real 0m0.018s
user 0m0.005s
sys 0m0.007s
{"model_name":"dummy","model_version":"4","outputs":[{"name":"OUTPUT","datatype":"BYTES","shape":[1,1],"data":["test"]}]}
real 0m0.013s
user 0m0.004s
sys 0m0.005s
{"model_name":"dummy","model_version":"4","outputs":[{"name":"OUTPUT","datatype":"BYTES","shape":[1,1],"data":["test"]}]}
real 0m0.017s
user 0m0.005s
sys 0m0.006s
{"model_name":"dummy","model_version":"4","outputs":[{"name":"OUTPUT","datatype":"BYTES","shape":[1,1],"data":["test"]}]}
real 0m0.013s
user 0m0.003s
sys 0m0.005s
{"model_name":"dummy","model_version":"4","outputs":[{"name":"OUTPUT","datatype":"BYTES","shape":[1,1],"data":["test"]}]}
real 0m0.012s
user 0m0.003s
sys 0m0.004s

Expected behavior
If the inner model takes less than 0.5ms to run (uncomment the time lines to verify):

Why is there an additional 10ms overhead per request (I understand this differs between machines, but the lowest I've seen is 7ms per request)
Why is the overhead volatile, peaking at 19ms in this case? (disregarding warm-up)

Given a really tight SLA, any 1ms can help save us expensive latency.

nnshah1 · 2024-10-24T14:53:58Z

@statiraju for viz

nnshah1 added the performance A possible performance tune-up label Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expensive & Volatile Triton Server latency #7739

Expensive & Volatile Triton Server latency #7739

jadhosn commented Oct 24, 2024

nnshah1 commented Oct 24, 2024

Expensive & Volatile Triton Server latency #7739

Expensive & Volatile Triton Server latency #7739

Comments

jadhosn commented Oct 24, 2024

nnshah1 commented Oct 24, 2024