You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
The current Python backend does not yet support sequence inference. When will this be supported?
Why we need this feature
We are from antgroup ai-infra. The majority of our models are implemented using the Python backend. Currently, there is a critical demand for real-time streaming services, such as speech-to-text conversion. After reviewing the Triton community documentation, we found that the server framework, along with backends like ONNX Runtime, Pytorch and TensorRT, already supports this feature. However, it is surprising that the Python backend, which is the most widely used in our business, has not yet implemented this functionality. We are curious to know if the product team has any plans to address this in the future.
Future benefits:
If this feature is also supported on the Python backend, our vLLM will leverage sequence inference to achieve further performance optimizations.
The text was updated successfully, but these errors were encountered:
zhuichao001
changed the title
Python Backend support Sequence Inference
Python Backend support implicit state management for Sequence Inference
Feb 13, 2025
Is your feature request related to a problem? Please describe.
The current Python backend does not yet support sequence inference. When will this be supported?
Why we need this feature
We are from antgroup ai-infra. The majority of our models are implemented using the Python backend. Currently, there is a critical demand for real-time streaming services, such as speech-to-text conversion. After reviewing the Triton community documentation, we found that the server framework, along with backends like ONNX Runtime, Pytorch and TensorRT, already supports this feature. However, it is surprising that the Python backend, which is the most widely used in our business, has not yet implemented this functionality. We are curious to know if the product team has any plans to address this in the future.
Future benefits:
If this feature is also supported on the Python backend, our vLLM will leverage sequence inference to achieve further performance optimizations.
The text was updated successfully, but these errors were encountered: