Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] How can I make a limit on the length of the input context and the number of tokens to generate? #8029

Open
ArtemBiliksin opened this issue Feb 24, 2025 · 2 comments

Comments

@ArtemBiliksin
Copy link

Hello! How can I define protection on context length as it is done in vllm.entrypoints.openai.serving_engine.py? Right now, if I have input data of length 7000 tokens, maximum LLM context length equal to 8192 tokens, and request to generate 5000 tokens (max_tokens=5000), Triton will return no more than 8193 tokens (either generation will end when the stop token is reached, or the maximum model context length is exhausted). If the same request is sent to vLLM, it will return an error:

"This model's maximum context length is 8192 tokens. However, you requested 12000 tokens (7000 in the messages, 5000 in the completion). Please reduce the length of the messages or completion."

How can I implement the same behavior in Triton?

@ArtemBiliksin ArtemBiliksin changed the title [Question] How can I make a limit on the length of the input context? [Question] How can I make a limit on the length of the input context and the number of tokens to generate? Feb 24, 2025
@Pavloveuge
Copy link

maybe @oandreeva-nv can help

@azsh1725
Copy link

As far as I understand, vLLM has separate routing for each task (completion, embedding, etc.). The question is related to completion, so it is relevant to look at ServingCompletion, specifically the create_completion method. The necessary validation is performed using the _preprocess_completion method here. Tracing the function call chain leads to the _validate_input method, where the actual validation is implemented.

My intuition is that we need to add a similar method that utilizes the tokenizer and performs validation in the same way as _validate_input. It seems appropriate to add this in execute, before or after the LoRa request check.

@oandreeva-nv, could you please share your thoughts on this approach?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants