perf: Python binding inference performance improvement #426

kthui · 2025-01-11T02:47:56Z

What does the PR do?

Refactor infer() and async_infer() APIs to handle memory allocation and callbacks internally in C++, and only expose the basic interface for the Python iterator to fetch responses.

Checklist

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

Related PRs:

triton-inference-server/server#7949

Where should the reviewer start?

Start with the tritonserver_pybind.cc for the interface change, then move on to _model.py on how the Python iterator interacts with the interface. Finally, move on to the _request.py and _response.py on how they support the Python iterator.

For testing, start with test_binding.py and test_api.py, and then _tensor.py on the DLPack limitation regarding bytes.

Test plan:

Existing L0_python_api is sufficient for catching any regression from this performance improvement. It is modified to test from the new interface.

CI Pipeline ID: 23077037

Caveats:

User are no longer able to specify custom:

request release callback
response allocator
response callback

Currently, only CPU memory output is supported at the binding level, so GPU memory output will involve an extra D2H copy at the backend and a H2D copy at the frontend. This will be resolved as a follow-up.

The test_stop failure will have to be triaged and fixed as a follow-up.

Background

N/A

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

N/A

* fix request tensor lifecycle * remove prints for benchmarking * remove one h2h response copy * add allocator delete * schedule future.set_result to be called in event loop

* Use tensor object in output * Enable infer() to use the improved binding * Pass the C pointer directly to Python * Move output device if differ from request setting * Copyright and pre-commit

* Fix py_future object lifecycle * Fix request released after complete final

nnshah1 · 2025-01-15T21:28:25Z

python/test/test_api.py

@@ -137,6 +137,7 @@ def test_memory_fallback_to_cpu(self, server_options):

        tritonserver.default_memory_allocators[tritonserver.MemoryType.GPU] = allocator


should this be removed - or changed in some way to indicate the allocator is internal ....

yes, updated the test to indicate the allocator is internal, and it will always use CPU memory regardless of the backend memory preference.

nnshah1 · 2025-01-15T21:30:43Z

python/test/test_api.py

@@ -164,6 +165,7 @@ def test_memory_allocator_exception(self, server_options):
            ):
                pass

+    @pytest.mark.skip(reason="Skipping test, infer no longer use allocator")


we should keep this but refactor - if user requests output memory type gpu and that is not supported by the internal allocator - we would still want to raise an exception during inference

yes, we should raise an exception if the output memory type specified on the request is not supported, but currently the bindings does not accept a requested output memory type, so I think we can skip this test for now and add proper testing after adding support for allocating GPU memory.

hmm, but if we fail because say cupy is not available? Can we still make sure the right error gets propagated?

yes, the "test_unsupported_memory_type" is repurposed for testing moving outputs to unsupported memory type

nnshah1 · 2025-01-15T21:32:06Z

python/test/test_api.py

@@ -418,6 +420,9 @@ def test_ready(self, server_options):
        server = tritonserver.Server(server_options).start()
        assert server.ready()

+    @pytest.mark.skip(
+        reason="Skipping test, some request/response object may not be released which may cause server stop to fail"


Can we use xfail instead of skip?

@rmccorm4 - what do you think - as we had spent time on fixing this - how much an issue is this?

yes, switched to xfail.

python/test/test_binding.py

python/tritonserver/_c/tritonserver_pybind.cc

* Indicate allocator is internal on test_memory_fallback_to_cpu() * Remove test_memory_allocator_exception() as infer no longer use custom allocators * Update reason for skipping test_unsupported_memory_type() * Mark test_stop() with xfail instead of skip

python/tritonserver/_c/tritonserver_pybind.cc

nnshah1 · 2025-01-16T00:47:47Z

python/tritonserver/_c/tritonserver_pybind.cc

+      TRITONSERVER_MemoryType* actual_memory_type,
+      int64_t* actual_memory_type_id)
+  {
+    *buffer = malloc(byte_size * sizeof(uint8_t));


are there tritonserver apis we should be using for memory allocation instead of directly calling malloc?

I don't think there is one, because there are multiple calls to malloc in the core code base for different purpose, i.e.
https://github.com/triton-inference-server/core/blob/r24.12/src/backend_memory_manager.cc#L98
https://github.com/triton-inference-server/core/blob/r24.12/src/backend_model_instance.cc#L73
https://github.com/triton-inference-server/core/blob/r24.12/src/cache_manager.cc#L95

There is one on the server repo, but it is our of reach from the core binding:
https://github.com/triton-inference-server/server/blob/r24.12/src/memory_alloc.cc#L96

@GuanLuo to double check - should we be using the backend_memory_manager ? Reason I ask is that these / and the server already do support the different types of memory gpu / cpu / cpu pinned - and we would want that kind of support here if we can get it - and we should reuse it ...

What is in server certainly won't work, it is tightly related to the server frontend for optimization which is why allocator is up for the user to define.

You probably can reuse the one for backend but that is kind of strange. Plus the one in backend may not be as well crafted as you expected.. Please evaluate whether it may sense to proceed in this way.

Let's make that a todo as we bring in support for more memory types - we should look to see if we can reuse a common memory allocator in the core - of the ones backend_memeory_manager seems the closest - but agree is odd to reuse here directly.

nnshah1 · 2025-01-16T00:49:16Z

python/tritonserver/_c/tritonserver_pybind.cc

+    *actual_memory_type = TRITONSERVER_MEMORY_CPU;
+    *actual_memory_type_id = 0;
+    // The response allocator needs to be kept alive until the allocated memory
+    // is released via the release function.


is this true? - question: why not an allocator singleton for cpu - and reuse it for all requests?

yes, the core will need the allocator opaque object for finding the set release function. The allocator opaque object is basically this class that stores the release function pointer.

yes, updated to use singleton allocator object for all instances of the request wrapper.

nnshah1 · 2025-01-16T00:50:20Z

python/tritonserver/_c/tritonserver_pybind.cc

+      int64_t memory_type_id)
+  {
+    free(buffer);
+    // Release ownership of the response allocator.


thinking a singleton could be simpler and avoid alloc / dealloc of the allocator object itself - if possible.

yes, updated to use a singleton response allocator.

nnshah1 · 2025-01-16T00:51:30Z

python/tritonserver/_c/tritonserver_pybind.cc

+    // Release ownership of the response allocator.
+    std::unique_ptr<std::shared_ptr<struct TRITONSERVER_ResponseAllocator>>
+        allocator_ptr(reinterpret_cast<
+                      std::shared_ptr<struct TRITONSERVER_ResponseAllocator>*>(


actually - what do we use the buffer_userp for?

it was used to store the shared pointer to the response allocator object and passed to the release function, so the reference count can be increased/decreased as malloc/free that enables the allocator to be destructed only after the last allocated response memory is freed.

but it is no longer necessary with the singleton allocator.

nnshah1 · 2025-01-16T01:06:27Z

python/tritonserver/_c/tritonserver_pybind.cc

+      if (response_future_.get() != nullptr) {
+        throw AlreadyExistsError("cannot call GetNextResponse concurrently");
+      }
+      response_future_.reset(new py::object(py_future));


not sure I follow the logic here - why do we create a new object based on the passed in object? why not a reference on the original object?

second: how does the swap below work?

I tried holding a raw pointer to the original object, but the pointer can no longer be dereferenced (segmentation fault) after the GetNextResponse() returned, so the only safe way of holding on to the future object after GetNextResponse() returns is to increment the reference count on the future object, which is achieved by "copying" it to a new object.

the reason for using a unique pointer is to have a mechanism for checking if there is a future object pending to be set with the next response without accessing any Python variable, to avoid holding the GIL when doing the simple check. it can be achieved with a simple bool flag, but I think it is safer to have one variable than two to avoid setting one variable but forgetting about the other one.

~~Experiment: Increment the reference count on the py_future object, and see if the object can be remained valid after the function return.~~

TODO: The new will always allocate on heap, can we keep the variable on the rts, to avoid new/delete for every response?

Can we have some kind of small memory leak/growth pytest test that checks memory before, does some inferences, and checks memory after?

The implementation is updated to remove the pointer wrapper around response_future_, and more comments clarifying the flow among GetNextResponse() and AddNextResponse(): Avoid extra malloc when tracking response_future_
cc @nnshah1 @rmccorm4

Can we have some kind of small memory leak/growth pytest test that checks memory before, does some inferences, and checks memory after?

Do you think something like this makes sense? I think there will always be some small growth, but we can set the acceptable growth on the test case and have the test fail if it grows beyond what we set.

More discussion here: #426 (comment)

Nice find! Looks worth a try to me - I see it's used by quite a few big projects - https://github.com/search?q=pytest-memray&type=code

pydantic

datadog

huggingface

...

I'm curious how much time overhead it adds to run.

but we can set the acceptable growth on the test case and have the test fail if it grows beyond what we set.

Let's make sure whatever threshold we set isn't tailored to the specific test case or number of iterations, and doesn't get hit if we increased the iterations on the test case or something

nnshah1 · 2025-01-16T01:08:35Z

python/tritonserver/_c/tritonserver_pybind.cc

+    {
+      py::gil_scoped_acquire gil;
+      std::unique_ptr<py::object> response_future_local(nullptr);
+      response_future_.swap(response_future_local);


not sure I follow this - we create a new py object from null, swap it with the stored one (releasing it?) - then set result on the nullptr based object?...

Or ... I guess swap works the opposite way :)

we swap the null ptr for the one stored in the future ....

the guarantee here that the GetNextResponse() will not be called and receive a new future object (vs this future object) is this future object is not yet set with a result, so as soon as this future object is set with a result, we need to be ready for the next future object from GetNextResponse() that may take the place of this future object. thus, the solution here is to move the current future object from the class to local variable, making the class variable ready for the next future object before the current future object is set with a result.

python/tritonserver/_c/tritonserver_pybind.cc

nnshah1 · 2025-01-16T01:14:17Z

python/tritonserver/_api/_model.py

-            inference_request.response_queue,
-            raise_on_error,
-        )
+        self._server.infer_async(request)


how do we pass in the requested memory type to the C++ side?

I think we can add a new binding function to the request wrapper that allows setting a requested output memory type to the request wrapper object, then the response allocation function can access the requested memory type from the request wrapper object and allocate the correct memory type.

let's make sure the stories to follow up on customizing allocation are in place and scheduled. Priority would be to specify the memory type (device id, etc.)

rmccorm4 · 2025-01-28T00:35:46Z

python/test/test_api.py

    def test_unsupported_memory_type(self, server_options):
+        # TODO: Revisit this test when GPU memory support is added, i.e. the request


Do we have a ticket for GPU follow-up? Can add DLIS-XXXX here if so

Yes, added the number to the codes: [comment] Add ticket number to GPU follow-up TODOs

rmccorm4 · 2025-01-28T00:38:07Z

python/test/test_api.py

+    @pytest.mark.xfail(
+        run=False,
+        reason="Some request/response object may not be released which may cause server stop to fail",
+    )


Can you refresh my memory on the context of this one?

What happens when server.stop() is called? Crash, hang, timeout, etc?

Will it always happen, or only intermittently?

Does letting the server go out of scope naturally without a call to .stop() work nicely? Or is it silently having issues behind the scenes too?

What happens when server.stop() is called?

According to test_api_logs/test_stop.server.log, the server tried to shutdown, but it timeout at the end. There is no crash on the server log.

The pytest itself crashed afterward, in a segmentation fault.

Will it always happen, or only intermittently?

It happens 100% of the time.

Does letting the server go out of scope naturally without a call to .stop() work nicely?

There will be no crash on the pytest, but the server shutdown still timeout, according to the test_stop.server.log.

rmccorm4

Awesome work on this PR @kthui !

Overall changes and concepts to reduce thrashing on python GIL acquisition make sense to me - this was an awesome find.

Left a couple comments on benchmarking, memory checks, tests, etc

nnshah1 · 2025-01-28T14:40:12Z

python/tritonserver/_api/_model.py

-        parameters={}, output_memory_type=None,
-        output_memory_allocator=None, response_queue=None,
-        _serialized_inputs={})
+        parameters={}, output_memory_type=None, _serialized_inputs={})


Todo: for the future - we should find a way to document how to add a custom allocator - memory allocation is an area where customization is probably needed.

@kthui do we have follow on stories / epics identified for that?

nnshah1 · 2025-01-28T14:45:00Z

python/tritonserver/_api/_model.py

-                "asyncio.Queue must be used for async response iterator"
-            )
-
+        # The 'request' is referencing pointers managed by the 'inference_request', so


this note seems unnecessary here - would we add it instead to the asyncresponseiterator object? That's where we maintain that relationship is that correct?

nnshah1 · 2025-01-28T14:45:56Z

python/tritonserver/_api/_model.py

-                "queue.SimpleQueue must be used for response iterator"
-            )
-
+        # The 'request' is referencing pointers managed by the 'inference_request', so


see above - I think this more logically belongs with the comments for the Response Iterator - or could be omitted

nnshah1 · 2025-01-28T14:47:24Z

python/tritonserver/_api/_response.py

-            LogMessage(LogLevel.ERROR, message)
-            # catastrophic failure
-            raise e from None
+DeviceOrMemoryType = (


nit: this may be better defined in _memorybuffer.py and reused

nnshah1 · 2025-01-28T14:53:00Z

python/tritonserver/_api/_response.py

    ):
        result = InferenceResponse(
            model,
-            request.id,


question: why omit this / change this - the idea here is that in case there is a fundamental error (i.e. an empty response, or some other error) - we get at least the request.id as stored in the original request to tie this response to the error - as in line 118 we aren't guranteed a full response ...

nnshah1 · 2025-01-28T14:55:42Z

python/tritonserver/_api/_response.py

        response,
        flags: TRITONSERVER_ResponseCompleteFlag,
+        output_memory_type: Optional[DeviceOrMemoryType] = None,


maybe instead of passing output_memory_type alone - we could pass the entire original request object and also use it to set the response id - incase the response is none?

nnshah1 · 2025-01-28T14:59:55Z

python/tritonserver/_api/_response.py

        # TODO: support classification
        # values["classification_label"] = response.output_classification_label()

        return result
+
+
+class AsyncResponseIterator:


If I understand correctly - does this remove the ability for someone to set a queue / async queue in addition to the iterators?

nnshah1 · 2025-01-28T15:01:12Z

python/tritonserver/_api/_response.py

-            raise response.error
-        return response
-
-    def cancel(self) -> None:


aren't we missing cancel in the new iterator?

nnshah1 · 2025-01-28T15:01:35Z

python/tritonserver/_api/_response.py

-        self._complete = False
-        self._request = request
-        self._model = model
-        self._raise_on_error = raise_on_error


are we missing raise_on_error with new iterator?

nnshah1 · 2025-01-28T15:02:43Z

python/tritonserver/_api/_response.py

-            raise StopIteration
-        response = self._queue.get()
-        self._complete = response.final
-        if response.error is not None and self._raise_on_error:


this seems missing from the new code?

nnshah1 · 2025-01-28T15:19:33Z

python/tritonserver/_c/tritonserver_pybind.cc

@@ -868,16 +844,15 @@ class PyInferenceResponse
    ThrowIfError(TRITONSERVER_InferenceResponseOutput(
        triton_object_, index, &name, &datatype, &shape, &dim_count, &base,
        &byte_size, &memory_type, &memory_type_id, &userp));
+    // The base pointer is deallocated when the response is deallocated.


comment needed? are all pointers deallocated when response is deallocated?

nnshah1 · 2025-01-28T15:46:57Z

python/tritonserver/_c/tritonserver_pybind.cc

+    // The request wrapper needs to be kept alive until the release callback is
+    // invoked, so the Triton request can be deleted after the core/backend is
+    // done with it.
+    std::unique_ptr<std::shared_ptr<PyInferenceRequest>> request_ptr(


This seems a little confusing to me,

we are creating a new shared pointer to an existing shared pointer and then creating a unique ptr to it - but then releasing it back and casting to a void pointer ...

why not do

std::shared_ptr<PyInferenceRequest> request_ptr = request; // increment reference count void * user_p = reinterpret_cast<void*>(new std::shared_ptr<T>(request_ptr)); // convert to void * TRITONSERVER_InferenceRequestSetReleaseCallback(triton_object_, ReleaseCallback, user_p);

if we don't want to increase the reference count we can remove the assignment.

I think what is confusing is the mix of unique and shared pointer and new here -

or should we be using make_unique instead of new? maybe not supported for our c++ version.

IIUC, the unique pointer is used here to make sure the resource is being cleaned up if exception is thrown below.. But if that is the intention, I don't think the current code does the job as the ownership is released before calling into TRITONSERVER_InferenceRequestSetReleaseCallback which is not safe guarding the exception.

nnshah1 · 2025-01-28T15:58:08Z

python/tritonserver/_c/tritonserver_pybind.cc

+      struct TRITONSERVER_InferenceRequest* request, const uint32_t flags,
+      void* userp)
+  {
+    std::unique_ptr<std::shared_ptr<PyInferenceRequest>> request_ptr(


I almost prefer to be more verbose here,

std::shared_ptr<PyInferenceRequest> * request_ptr = reinterpret_cast<std::shared_ptr<PyInferenceRequest>*>(userp)); delete request_ptr;

@rmccorm4, @GuanLuo for style - I see the general goodness of unique_ptr - but it kind of obscures what's happening here ....

Yea, at this point just reinterpret cast and delete should be fine.

nnshah1 · 2025-01-28T16:00:35Z

python/tritonserver/_c/tritonserver_pybind.cc

-    // circular inclusion.
-    std::shared_ptr<PyInferenceRequest> request;
+      if (allocator != nullptr) {
+        ThrowIfError(TRITONSERVER_ResponseAllocatorDelete(allocator));


question - is it safe to call TRITONSERVER_ResponseAllocatorDelete(nullptr) ?

Generally for free / delete that is safe - just question if that is true here as well

nnshah1 · 2025-01-28T16:15:15Z

python/tritonserver/_c/tritonserver_pybind.cc

+  {
+    // The response allocator is shared among all instances of this class, so it
+    // needs to be initialized once and once only.
+    // TODO: Why 'numpy.ones(2**27)' gets stuck when the response allocator is


seems very odd like something is interacting between numpy and this allocator - and that doesn't seem to make sense to me ....

nnshah1 · 2025-01-28T16:28:49Z

python/tritonserver/_c/tritonserver_pybind.cc

-    auto cr = reinterpret_cast<CallbackResource*>(userp);
-    cr->release_fn(cr->request, flags, cr->user_object);
-    delete cr;
+    ThrowIfError(TRITONSERVER_BufferAttributesSetMemoryType(


do we want to throw if error or return the error?

Should return error as that is what the caller expects

nnshah1 · 2025-01-28T16:30:19Z

python/tritonserver/_c/tritonserver_pybind.cc

+      void* buffer_userp, size_t byte_size, TRITONSERVER_MemoryType memory_type,
+      int64_t memory_type_id)
+  {
+    free(buffer);


can we add assert here to ensure memory type and memory type id are cpu and 0?

nnshah1 · 2025-01-28T20:21:03Z

python/tritonserver/_c/tritonserver_pybind.cc

+  // Response management
+  void GetNextResponse(const py::object& py_future)
+  {
+    // Called exclusively from Python, so GIL is always held.


question: if we also acquire the GIL here will it hang or is safe to do as a sanity check?

nnshah1 · 2025-01-28T20:24:15Z

python/tritonserver/_c/tritonserver_pybind.cc

- public:
-  std::unique_ptr<CallbackResource> request_callback_resource_{nullptr};
+    // There are two cases:
+    // 1. The backend is faster than the frontend:


Suggested change

// 1. The backend is faster than the frontend:

// 1. Responses have been generated and queued before GetNextResponse() has been called

// 2. GetNextResponse() is called before responses have been generated and queued.

I would remove the notion of "faster" and "slower" - I think that can be confusing here - it's not necessarily about speed of execution (i.e. applicaiton may be purposefully not reading, handling something else, etc.) - faster and slower makes it seem like a rate matching issue - but's really just whether or not there are items available when GetNextResponse is called ....

also change the description to match the if / else order, so:

GetNextResponse is called before / when no responses have been queued.

nnshah1 · 2025-01-28T20:37:34Z

python/tritonserver/_c/tritonserver_pybind.cc

+      if (response_future_.ptr() != nullptr) {
+        throw AlreadyExistsError("cannot call GetNextResponse concurrently");
+      }
+      response_future_ = py_future;


question - does this increment the py_future ref count?

nnshah1 · 2025-01-28T20:42:15Z

python/tritonserver/_c/tritonserver_pybind.cc

+    std::pair<std::shared_ptr<PyInferenceResponse>, const uint32_t> py_response(
+        std::move(managed_ptr), std::move(flags));
+
+    // There are two cases:


Suggested change

// There are two cases:

// There are two cases:

// 1. GetNextResponse() hasn't been called and the future is not set.

// 2. GetNextResponse() has been called and future is set.

nnshah1 · 2025-01-28T20:46:09Z

python/tritonserver/_c/tritonserver_pybind.cc

+    // 'response_future_'.
+    {
+      py::gil_scoped_acquire gil;
+      py::object response_future_local(std::move(response_future_));


still a little unsure why we need this is GetNextResponse holds the gil - but I think we can explore that seperately (the explicit move and release of the response_future_)

nnshah1 · 2025-01-29T00:58:13Z

python/tritonserver/_c/tritonserver_pybind.cc

-            allocator, allocater_user_object));
-    response_callback_resource_.reset(new PyInferenceResponse::CallbackResource(
-        response, allocator_callback_resource_.get(), response_user_object));
+    // The caller is responsible for keeping the request wrapper alive until


this is maintained by the _api ? is that correc t- so user doesn't have to worry about it explicitly -double checking

nnshah1

generally looks good to me - had a few questions about removing the queues, cancel and request id from the response iterators / creation -

GuanLuo · 2025-01-29T11:41:12Z

python/tritonserver/_c/tritonserver_pybind.cc

+    // The request wrapper needs to be kept alive until the release callback is
+    // invoked, so the Triton request can be deleted after the core/backend is
+    // done with it.
+    std::unique_ptr<std::shared_ptr<PyInferenceRequest>> request_ptr(


IIUC, the unique pointer is used here to make sure the resource is being cleaned up if exception is thrown below.. But if that is the intention, I don't think the current code does the job as the ownership is released before calling into TRITONSERVER_InferenceRequestSetReleaseCallback which is not safe guarding the exception.

GuanLuo · 2025-01-29T11:41:57Z

python/tritonserver/_c/tritonserver_pybind.cc

+      struct TRITONSERVER_InferenceRequest* request, const uint32_t flags,
+      void* userp)
+  {
+    std::unique_ptr<std::shared_ptr<PyInferenceRequest>> request_ptr(


Yea, at this point just reinterpret cast and delete should be fine.

GuanLuo · 2025-01-29T11:56:53Z

python/tritonserver/_c/tritonserver_pybind.cc

+      ThrowIfError(TRITONSERVER_ResponseAllocatorSetBufferAttributesFunction(
+          allocator_.get(), ResponseAllocatorBufferAttributesFn));
+    });
+  }


In this case, I think you can move the allocator initialization and the related callback functions out to a separate class and make it a singleton, which then this SetResponseAllocator function simply grabs the Triton allocator pointer from the singleton

GuanLuo · 2025-01-29T11:58:08Z

python/tritonserver/_c/tritonserver_pybind.cc

-    auto cr = reinterpret_cast<CallbackResource*>(userp);
-    cr->release_fn(cr->request, flags, cr->user_object);
-    delete cr;
+    ThrowIfError(TRITONSERVER_BufferAttributesSetMemoryType(


Should return error as that is what the caller expects

GuanLuo · 2025-01-29T12:05:16Z

python/tritonserver/_c/tritonserver_pybind.cc

+      TRITONSERVER_MemoryType* actual_memory_type,
+      int64_t* actual_memory_type_id)
+  {
+    *buffer = malloc(byte_size * sizeof(uint8_t));


What is in server certainly won't work, it is tightly related to the server frontend for optimization which is why allocator is up for the user to define.

You probably can reuse the one for backend but that is kind of strange. Plus the one in backend may not be as well crafted as you expected.. Please evaluate whether it may sense to proceed in this way.

[PoC] Implement allocators and callbacks entirely in C++ binding

ad0c007

* fix request tensor lifecycle * remove prints for benchmarking * remove one h2h response copy * add allocator delete * schedule future.set_result to be called in event loop

kthui added the PR: perf A code change that improves performance label Jan 11, 2025

kthui self-assigned this Jan 11, 2025

Implement the PoC into production code

131078a

* Use tensor object in output * Enable infer() to use the improved binding * Pass the C pointer directly to Python * Move output device if differ from request setting * Copyright and pre-commit

kthui force-pushed the jacky-py-res-callback branch from 300b0b6 to 131078a Compare January 11, 2025 02:52

kthui added 4 commits January 11, 2025 17:23

Some optimizations and fixes

83f78f3

* Fix py_future object lifecycle * Fix request released after complete final

Fix resposne allocator lifecycle

279af74

Update test_binding.py

5e805da

Update test_api.py and fix from_dlpack with bytes type

2faf303

kthui requested review from GuanLuo, nnshah1, tanmayv25 and rmccorm4 January 15, 2025 19:42

kthui marked this pull request as ready for review January 15, 2025 19:43

nnshah1 reviewed Jan 15, 2025

View reviewed changes

python/test/test_binding.py Show resolved Hide resolved

nnshah1 reviewed Jan 15, 2025

View reviewed changes

python/test/test_binding.py Show resolved Hide resolved

nnshah1 reviewed Jan 15, 2025

View reviewed changes

python/tritonserver/_c/tritonserver_pybind.cc Show resolved Hide resolved

Updates to test_api.py

e1907bb

* Indicate allocator is internal on test_memory_fallback_to_cpu() * Remove test_memory_allocator_exception() as infer no longer use custom allocators * Update reason for skipping test_unsupported_memory_type() * Mark test_stop() with xfail instead of skip

nnshah1 reviewed Jan 16, 2025

View reviewed changes

python/tritonserver/_c/tritonserver_pybind.cc Show resolved Hide resolved

nnshah1 reviewed Jan 16, 2025

View reviewed changes

python/tritonserver/_c/tritonserver_pybind.cc Show resolved Hide resolved

nnshah1 reviewed Jan 16, 2025

View reviewed changes

Avoid extra malloc when tracking response_future_

c5ee140

rmccorm4 reviewed Jan 28, 2025

View reviewed changes

[comment] Add ticket number to GPU follow-up TODOs

5a307b8

nnshah1 reviewed Jan 28, 2025

View reviewed changes

nnshah1 reviewed Jan 29, 2025

View reviewed changes

GuanLuo reviewed Jan 29, 2025

View reviewed changes

		@@ -137,6 +137,7 @@ def test_memory_fallback_to_cpu(self, server_options):

		tritonserver.default_memory_allocators[tritonserver.MemoryType.GPU] = allocator

		def test_unsupported_memory_type(self, server_options):
		# TODO: Revisit this test when GPU memory support is added, i.e. the request

	// 1. The backend is faster than the frontend:
	// 1. Responses have been generated and queued before GetNextResponse() has been called
	// 2. GetNextResponse() is called before responses have been generated and queued.

perf: Python binding inference performance improvement #426

Are you sure you want to change the base?

perf: Python binding inference performance improvement #426

Conversation

kthui commented Jan 11, 2025 • edited Loading

What does the PR do?

Checklist

Commit Type:

Related PRs:

Where should the reviewer start?

Test plan:

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kthui Jan 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kthui Jan 17, 2025 • edited Loading

Choose a reason for hiding this comment

kthui Jan 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kthui Jan 28, 2025 • edited Loading

Choose a reason for hiding this comment

rmccorm4 Jan 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rmccorm4 left a comment

Choose a reason for hiding this comment

nnshah1 Jan 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kthui commented Jan 11, 2025 •

edited

Loading

kthui Jan 17, 2025 •

edited

Loading

kthui Jan 17, 2025 •

edited

Loading

kthui Jan 27, 2025 •

edited

Loading

kthui Jan 28, 2025 •

edited

Loading

rmccorm4 Jan 28, 2025 •

edited

Loading

nnshah1 Jan 28, 2025 •

edited

Loading