Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expensive & Volatile Triton Server latency #7739

Open
jadhosn opened this issue Oct 24, 2024 · 1 comment
Open

Expensive & Volatile Triton Server latency #7739

jadhosn opened this issue Oct 24, 2024 · 1 comment
Labels
performance A possible performance tune-up

Comments

@jadhosn
Copy link

jadhosn commented Oct 24, 2024

Description
A blank Triton Python model incurs anywhere between 11ms to 20ms even if there's no internal processing happening. This overhead is expensive in some applications that run on really tight latency SLAs (sub 100ms per request). Notice that the inner core of the model server takes less than 0.5ms to complete. See the code below.

In addition, it seems that overhead is not consistent, and almost looks cyclical (see logs below).

Triton Information
24.04-py3

Are you using the Triton container or did you build it yourself? Using NGC's Triton container

To Reproduce
Run this snippet as-is. It's a stand-alone repro. There's no additional config or model artifacts that come with it.

import time
import numpy as np
import triton_python_backend_utils as pb_utils

class TritonPythonModel:

    @staticmethod
    def auto_complete_config(auto_complete_model_config):
        auto_complete_model_config.add_input({"name": "INPUT",  "data_type": "TYPE_STRING", "dims": [-1]})
        auto_complete_model_config.add_output({"name": "OUTPUT", "data_type": "TYPE_STRING", "dims": [-1]})
        auto_complete_model_config.set_max_batch_size(0)
        return auto_complete_model_config

    def execute(self, requests):
        responses = []
        # __start = time.time()
        for request in requests:
            in_numpy = pb_utils.get_input_tensor_by_name(request, "INPUT").as_numpy()
            out_numpy = np.array([in_numpy], dtype=np.object_)
            out_pb = pb_utils.Tensor("OUTPUT", out_numpy)
            responses.append(pb_utils.InferenceResponse(output_tensors=[out_pb]))
        # print(f"Elapsed Time: {(time.time() - __start)*1000}", flush=True)
        return responses

For my local server, I run the following loop:

 for i in {1..10}; do time curl -X POST -k localhost:8000/v2/models/dummy/infer -d '{"inputs":[{"name":"INPUT","datatype":"BYTES","shape":[1],"data":["test"]}]}' && sleep 0.1; done

returns

{"model_name":"dummy","model_version":"4","outputs":[{"name":"OUTPUT","datatype":"BYTES","shape":[1,1],"data":["test"]}]}
real 0m0.020s
user 0m0.004s
sys 0m0.007s
{"model_name":"dummy","model_version":"4","outputs":[{"name":"OUTPUT","datatype":"BYTES","shape":[1,1],"data":["test"]}]}
real 0m0.017s
user 0m0.004s
sys 0m0.007s
{"model_name":"dummy","model_version":"4","outputs":[{"name":"OUTPUT","datatype":"BYTES","shape":[1,1],"data":["test"]}]}
real 0m0.011s
user 0m0.003s
sys 0m0.004s
{"model_name":"dummy","model_version":"4","outputs":[{"name":"OUTPUT","datatype":"BYTES","shape":[1,1],"data":["test"]}]}
real 0m0.019s
user 0m0.005s
sys 0m0.007s
{"model_name":"dummy","model_version":"4","outputs":[{"name":"OUTPUT","datatype":"BYTES","shape":[1,1],"data":["test"]}]}
real 0m0.016s
user 0m0.004s
sys 0m0.007s
{"model_name":"dummy","model_version":"4","outputs":[{"name":"OUTPUT","datatype":"BYTES","shape":[1,1],"data":["test"]}]}
real 0m0.018s
user 0m0.005s
sys 0m0.007s
{"model_name":"dummy","model_version":"4","outputs":[{"name":"OUTPUT","datatype":"BYTES","shape":[1,1],"data":["test"]}]}
real 0m0.013s
user 0m0.004s
sys 0m0.005s
{"model_name":"dummy","model_version":"4","outputs":[{"name":"OUTPUT","datatype":"BYTES","shape":[1,1],"data":["test"]}]}
real 0m0.017s
user 0m0.005s
sys 0m0.006s
{"model_name":"dummy","model_version":"4","outputs":[{"name":"OUTPUT","datatype":"BYTES","shape":[1,1],"data":["test"]}]}
real 0m0.013s
user 0m0.003s
sys 0m0.005s
{"model_name":"dummy","model_version":"4","outputs":[{"name":"OUTPUT","datatype":"BYTES","shape":[1,1],"data":["test"]}]}
real 0m0.012s
user 0m0.003s
sys 0m0.004s

Expected behavior
If the inner model takes less than 0.5ms to run (uncomment the time lines to verify):

  1. Why is there an additional 10ms overhead per request (I understand this differs between machines, but the lowest I've seen is 7ms per request)
  2. Why is the overhead volatile, peaking at 19ms in this case? (disregarding warm-up)

Given a really tight SLA, any 1ms can help save us expensive latency.

@nnshah1
Copy link
Contributor

nnshah1 commented Oct 24, 2024

@statiraju for viz

@nnshah1 nnshah1 added the performance A possible performance tune-up label Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A possible performance tune-up
Development

No branches or pull requests

2 participants