You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The returned output starts with the original text of the input. This is a waste of network width, especially when the input is very long. Can a flag be provided to control the return of only generated text output? thanks
The text was updated successfully, but these errors were encountered:
@BasicCoder Your one is also good, but it will be better if tensorrt_llm backend can cut input tokens out from the result, by itself. I'm using tensorrt_llm backend only (not ensembled one), because I separated my tokenizer to another server due to my business logics. We may not need to use additional Python backend if this feature is supported by tensorrt_llm backend itself.
The returned output starts with the original text of the input. This is a waste of network width, especially when the input is very long. Can a flag be provided to control the return of only generated text output? thanks
The text was updated successfully, but these errors were encountered: