Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python Backend support implicit state management for Sequence Inference #8006

Open
zhuichao001 opened this issue Feb 12, 2025 · 0 comments
Open

Comments

@zhuichao001
Copy link

zhuichao001 commented Feb 12, 2025

Is your feature request related to a problem? Please describe.
The current Python backend does not yet support sequence inference. When will this be supported?

Why we need this feature
We are from antgroup ai-infra. The majority of our models are implemented using the Python backend. Currently, there is a critical demand for real-time streaming services, such as speech-to-text conversion. After reviewing the Triton community documentation, we found that the server framework, along with backends like ONNX Runtime, Pytorch and TensorRT, already supports this feature. However, it is surprising that the Python backend, which is the most widely used in our business, has not yet implemented this functionality. We are curious to know if the product team has any plans to address this in the future.

Future benefits:
If this feature is also supported on the Python backend, our vLLM will leverage sequence inference to achieve further performance optimizations.

@zhuichao001 zhuichao001 changed the title Python Backend support Sequence Inference Python Backend support implicit state management for Sequence Inference Feb 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant