You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description
There is a race condition causes segment fault crash when a request is cancelled.
Triton Information
r25.01
Are you using the Triton container or did you build it yourself?
I built a RelWithDebInfo for debugging, and the issue can also be reproduced with the official Triton container.
To Reproduce
Start triton server in GDB with the simple model (docs/examples/model_repository/simple/config.pbtxt), suppose the model repository is /models
#!/usr/bin/env python
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of NVIDIA CORPORATION nor the names of its
# contributors may be used to endorse or promote products derived
# from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
import sys
import time
from functools import partial
import numpy as np
import tritonclient.grpc as grpcclient
from tritonclient.utils import InferenceServerException
if __name__ == "__main__":
# 2 ms cancellation timeout
client_timeout = 0.002
url = "localhost:8001"
try:
triton_client = grpcclient.InferenceServerClient(
url=url
)
except Exception as e:
print("context creation failed: " + str(e))
sys.exit()
model_name = "simple"
# Infer
inputs = []
outputs = []
inputs.append(grpcclient.InferInput("INPUT0", [1, 16], "INT32"))
inputs.append(grpcclient.InferInput("INPUT1", [1, 16], "INT32"))
# Create the data for the two input tensors. Initialize the first
# to unique integers and the second to all ones.
input0_data = np.arange(start=0, stop=16, dtype=np.int32)
input0_data = np.expand_dims(input0_data, axis=0)
input1_data = np.ones(shape=(1, 16), dtype=np.int32)
# Initialize the data
inputs[0].set_data_from_numpy(input0_data)
inputs[1].set_data_from_numpy(input1_data)
outputs.append(grpcclient.InferRequestedOutput("OUTPUT0"))
outputs.append(grpcclient.InferRequestedOutput("OUTPUT1"))
# Define the callback function. Note the last two parameters should be
# result and error. InferenceServerClient would povide the results of an
# inference as grpcclient.InferResult in result. For successful
# inference, error will be None, otherwise it will be an object of
# tritonclientutils.InferenceServerException holding the error details
def callback(user_data, result, error):
if error:
user_data.append(error)
else:
user_data.append(result)
# list to hold the results of inference.
user_data = []
# Inference call
for i in range(20):
triton_client.async_infer(
model_name=model_name,
inputs=inputs,
callback=partial(callback, user_data),
outputs=outputs,
client_timeout=client_timeout,
)
time.sleep(0.01)
# Wait until the results are available in user_data
time_out = 20
while (len(user_data) == 0) and time_out > 0:
time_out = time_out - 1
time.sleep(1)
print("results: ", len(user_data))
# Display and validate the available results
count = 1
for result_data in user_data:
# Check for the errors
if type(result_data) == InferenceServerException:
print("request -", count, result_data.message())
else:
print("request -", count, "PASS")
count += 1
print("PASS: Async infer")
A SIGSEGV with following stacks should be observed.
(gdb) bt
#0 ___pthread_mutex_lock (mutex=0x20) at ./nptl/pthread_mutex_lock.c:80
#1 0x0000564daff72377 in __gthread_mutex_lock (__mutex=0x20) at /usr/include/x86_64-linux-gnu/c++/13/bits/gthr-default.h:749
#2 __gthread_recursive_mutex_lock (__mutex=0x20) at /usr/include/x86_64-linux-gnu/c++/13/bits/gthr-default.h:811
#3 std::recursive_mutex::lock (this=0x20) at /usr/include/c++/13/mutex:120
#4 std::lock_guard<std::recursive_mutex>::lock_guard (__m=..., this=<synthetic pointer>)
at /usr/include/c++/13/bits/std_mutex.h:249
#5 triton::server::grpc::InferHandlerState<grpc::ServerAsyncResponseWriter<inference::ModelInferResponse>, inference::ModelInferRequest, inference::ModelInferResponse>::Context::IsCancelled (this=0x0) at /workspace/src/grpc/infer_handler.h:677
#6 triton::server::grpc::InferHandlerState<grpc::ServerAsyncResponseWriter<inference::ModelInferResponse>, inference::ModelInferRequest, inference::ModelInferResponse>::IsGrpcContextCancelled (this=0x7f1918001960) at /workspace/src/grpc/infer_handler.h:1120
#7 triton::server::grpc::ModelInferHandler::Process (this=0x564db1fcf070, state=0x7f1918001960, rpc_ok=true,
is_notification=true) at /workspace/src/grpc/infer_handler.cc:702
#8 0x0000564daff69127 in _ZZN6triton6server4grpc12InferHandlerIN9inference20GRPCInferenceService26WithAsyncMethod_ServerLiveINS4_27WithAsyncMethod_ServerReadyINS4_26WithAsyncMethod_ModelReadyINS4_30WithAsyncMethod_ServerMetadataINS4_29WithAsyncMethod_ModelMetadataINS4_26WithAsyncMethod_ModelInferINS4_32WithAsyncMethod_ModelStreamInferINS4_27WithAsyncMethod_ModelConfigINS4_31WithAsyncMethod_ModelStatisticsINS4_31WithAsyncMethod_RepositoryIndexINS4_35WithAsyncMethod_RepositoryModelLoadINS4_37WithAsyncMethod_RepositoryModelUnloadINS4_40WithAsyncMethod_SystemSharedMemoryStatusINS4_42WithAsyncMethod_SystemSharedMemoryRegisterINS4_44WithAsyncMethod_SystemSharedMemoryUnregisterINS4_38WithAsyncMethod_CudaSharedMemoryStatusINS4_40WithAsyncMethod_CudaSharedMemoryRegisterINS4_42WithAsyncMethod_CudaSharedMemoryUnregisterINS4_28WithAsyncMethod_TraceSettingINS4_27WithAsyncMethod_LogSettingsINS4_7ServiceEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEN4grpc25ServerAsyncResponseWriterINS3_18ModelInferResponseEEENS3_17ModelInferRequestES1C_E5StartEvENKUlvE_clEv (__closure=0x564db1fd41c8) at /workspace/src/grpc/infer_handler.h:1462
#9 0x00007f1d92a96db4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x00007f1d9274baa4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#11 0x00007f1d927d8a34 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100
Expected behavior
The crash should not happen when requests are cancelled.
Root cause analysis
For a single inference request, there are three threads involved.
gRPC handler thread A
gRPC handler thread B
Inference framework thread C to call InferResponseComplete when a inference is finished in framework
There is a race condition which causes state released while the cancellation event is yet processed.
T0: Cancellation event handled by thread A: set received notification as true (codes)
T1: Inference is finished by thread C: step changed to CANCELLED and a new event enqueued (codes)
T2: Event enqueued at T1 is processed by thread B: state object is released (codes)
T3: Cancellation event is continued to be processed by thread A with the dangling state object, and segfault happens (codes)
Fix proposal
Method SetReceivedNotification() to mark a state as received notification should also be protected by state->step_mtx_ to avoid the above race condition.
So the fix should be fairly simple:
diff --git a/src/grpc/infer_handler.cc b/src/grpc/infer_handler.cc
index 9c7eef48..f95922e2 100644
--- a/src/grpc/infer_handler.cc
+++ b/src/grpc/infer_handler.cc
@@ -697,6 +697,13 @@ ModelInferHandler::Process(
std::chrono::milliseconds(state->delay_process_ms_));
}
+ // SetReceivedNotification must be protected by state->step_mtx_ to avoid
+ // a race condition that state released before cancellation event is processed.
+ if (is_notification) {
+ state->context_->SetReceivedNotification(true);
+ }
+
// Handle notification for cancellation which can be raised
// asynchronously if detected on the network.
if (state->IsGrpcContextCancelled()) {
diff --git a/src/grpc/infer_handler.h b/src/grpc/infer_handler.h
index 86428a51..017fb5f0 100644
--- a/src/grpc/infer_handler.h
+++ b/src/grpc/infer_handler.h
@@ -1437,7 +1437,6 @@ InferHandler<
if (state->step_ == Steps::WAITING_NOTIFICATION) {
State* state_wrapper = state;
state = state_wrapper->state_ptr_;
- state->context_->SetReceivedNotification(true);
is_notification = true;
LOG_VERBOSE(1) << "Received notification for " << Name() << ", "
<< state->unique_id_;
I have verified the above fix works well.
The text was updated successfully, but these errors were encountered:
Description
There is a race condition causes segment fault crash when a request is cancelled.
Triton Information
r25.01
Are you using the Triton container or did you build it yourself?
I built a RelWithDebInfo for debugging, and the issue can also be reproduced with the official Triton container.
To Reproduce
Start triton server in GDB with the simple model (
docs/examples/model_repository/simple/config.pbtxt
), suppose the model repository is/models
Send requests to the server.
A SIGSEGV with following stacks should be observed.
Expected behavior
The crash should not happen when requests are cancelled.
Root cause analysis
For a single inference request, there are three threads involved.
InferResponseComplete
when a inference is finished in frameworkThere is a race condition which causes
state
released while the cancellation event is yet processed.Fix proposal
Method
SetReceivedNotification()
to mark a state as received notification should also be protected bystate->step_mtx_
to avoid the above race condition.So the fix should be fairly simple:
I have verified the above fix works well.
The text was updated successfully, but these errors were encountered: