You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
unexpected throughput results - Increasing instance group count VS deploying the count distributed on the same card using shared computing windows
#7956
Open
ariel291888 opened this issue
Jan 21, 2025
· 0 comments
Description
Hey there, we are working on openshift
When increasing instance group over a specific amount X (in my case it was 6), there is no improvement in the throughput, GPU utilization and the memory doesn't even surpass its half.
On the other hand, when deploying two pods on the same GPU (using shared computing windows and fractional memory partitioning- runai fractions), both the throughput and GPU utilization were much higher.
Triton Information
r23.08
We were using model analyzer to check the throughput – 24.10
Are you using the Triton container or did you build it yourself?
Both containers.
To Reproduce
Steps to reproduce the behavior.
Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well):
platform: "pytorch_libtorch"
max_batch_size: 128
input
{
name: “input"
data_type: TYPE_FP32
dims: [-1,1]
}
input
{
name: "lengths"
data_type: TYPE_INT64
dims: [1]
}
Expected behavior
My expectations are that the instances count approach would have at least the same throughput as running multiple pods on the same GPU approach.
The text was updated successfully, but these errors were encountered:
Description
Hey there, we are working on openshift
When increasing instance group over a specific amount X (in my case it was 6), there is no improvement in the throughput, GPU utilization and the memory doesn't even surpass its half.
On the other hand, when deploying two pods on the same GPU (using shared computing windows and fractional memory partitioning- runai fractions), both the throughput and GPU utilization were much higher.
Triton Information
r23.08
We were using model analyzer to check the throughput – 24.10
Are you using the Triton container or did you build it yourself?
Both containers.
To Reproduce
Steps to reproduce the behavior.
Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well):
platform: "pytorch_libtorch"
max_batch_size: 128
input
{
name: “input"
data_type: TYPE_FP32
dims: [-1,1]
}
input
{
name: "lengths"
data_type: TYPE_INT64
dims: [1]
}
output
{
name: "output"
data_type: TYPE_FP32
dims: [-1,-1]
}
instance_group {
count: 3
kind: KIND_GPU
}
dynamic_batching {
max_queue_delay_microseconds: 4000
}
Backend:”pytorch”
Expected behavior
My expectations are that the instances count approach would have at least the same throughput as running multiple pods on the same GPU approach.
The text was updated successfully, but these errors were encountered: