You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello! How can I define protection on context length as it is done in vllm.entrypoints.openai.serving_engine.py? Right now, if I have input data of length 7000 tokens, maximum LLM context length equal to 8192 tokens, and request to generate 5000 tokens (max_tokens=5000), Triton will return no more than 8193 tokens (either generation will end when the stop token is reached, or the maximum model context length is exhausted). If the same request is sent to vLLM, it will return an error:
"This model's maximum context length is 8192 tokens. However, you requested 12000 tokens (7000 in the messages, 5000 in the completion). Please reduce the length of the messages or completion."
How can I implement the same behavior in Triton?
The text was updated successfully, but these errors were encountered:
ArtemBiliksin
changed the title
[Question] How can I make a limit on the length of the input context?
[Question] How can I make a limit on the length of the input context and the number of tokens to generate?
Feb 24, 2025
As far as I understand, vLLM has separate routing for each task (completion, embedding, etc.). The question is related to completion, so it is relevant to look at ServingCompletion, specifically the create_completion method. The necessary validation is performed using the _preprocess_completion method here. Tracing the function call chain leads to the _validate_input method, where the actual validation is implemented.
My intuition is that we need to add a similar method that utilizes the tokenizer and performs validation in the same way as _validate_input. It seems appropriate to add this in execute, before or after the LoRa request check.
@oandreeva-nv, could you please share your thoughts on this approach?
Hello! How can I define protection on context length as it is done in vllm.entrypoints.openai.serving_engine.py? Right now, if I have input data of length 7000 tokens, maximum LLM context length equal to 8192 tokens, and request to generate 5000 tokens (max_tokens=5000), Triton will return no more than 8193 tokens (either generation will end when the stop token is reached, or the maximum model context length is exhausted). If the same request is sent to vLLM, it will return an error:
"This model's maximum context length is 8192 tokens. However, you requested 12000 tokens (7000 in the messages, 5000 in the completion). Please reduce the length of the messages or completion."
How can I implement the same behavior in Triton?
The text was updated successfully, but these errors were encountered: