-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shared memory io bottleneck? #7905
Comments
This problem causes the performance to be lower under multiple GPUs, 3700fpsvs2500. Any help? |
TensorRT model runs on GPU device. It expects the input data to be available in specific GPU buffers and return results on the GPU memory itself. Compute input time records the latency in moving the data from the source to the input GPU buffer which the model will consume. Compute output time records the latency in moving data from the GPU memory where the model wrote the results to user-requested memory.
This is not expected. System Shared Memory: Cuda shared Memory: Hence, the performance of cuda shared memory is better than system shared memory for a fixed request concurrency of 100.
This is a challenge. Can you share perf_analyzer numbers for multiple GPUs for both the cases? shared-memory = [cuda, system]. |
I am on vacation now so I cannot provide relevant information for the time being, but when using CUDA, the triton server does not necessarily use the same GPU. Can you provide a binding or give priority to using the GPU where the data is located for inference to save IO? |
Yes. This is a known area of performance optimization. |
Description
When using shared memory, the inference speed is much lower than that of CUDA shared memory. The trace log shows that the input & output time is greater than the infer time.
Triton Information
docker images
nvcr.io/nvidia/tritonserver:24.11-py3
To Reproduce
config:
docker-compose.yml:
Use the same deployment environment for model(remove .zip)conversion
docker compose run -it --rm tritonserver sh
Conversion command
/usr/src/tensorrt/bin/trtexec --onnx=yolo11n.onnx --saveEngine=model.plan --minShapes=images:1x3x128x128 --optShapes=images:32x3x128x128 --maxShapes=images:32x3x640x640 --memPoolSize=workspace:1024 --fp16 --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw --useCudaGraph
Testing with perf_analyzer
command
perf_analyzer -m yolo -b 1 --shared-memory cuda --output-shared-memory-size 846720 --shape images:3,384,640 --concurrency-range 100 -i Grpc
result:
trace.json summary:
share memory command:
perf_analyzer -m yolo -b 1 --shared-memory system --output-shared-memory-size 846720 --shape images:3,384,640 --concurrency-range 100 -i Grpc
result:
trace.json summary:
Expected behavior
Shared memory should be able to achieve the same throughput as shared cuda memory?
Worse, due to IO limitations when using multiple GPUs, the throughput is almost the same as that of a single GPU
Currently shared memory seems to be limited by certain IOs. What operations are included in the input & output in the trace log?
The text was updated successfully, but these errors were encountered: