[Question] How can I make a limit on the length of the input context and the number of tokens to generate? #8029

ArtemBiliksin · 2025-02-24T15:47:08Z

Hello! How can I define protection on context length as it is done in vllm.entrypoints.openai.serving_engine.py? Right now, if I have input data of length 7000 tokens, maximum LLM context length equal to 8192 tokens, and request to generate 5000 tokens (max_tokens=5000), Triton will return no more than 8193 tokens (either generation will end when the stop token is reached, or the maximum model context length is exhausted). If the same request is sent to vLLM, it will return an error:

"This model's maximum context length is 8192 tokens. However, you requested 12000 tokens (7000 in the messages, 5000 in the completion). Please reduce the length of the messages or completion."

How can I implement the same behavior in Triton?

The text was updated successfully, but these errors were encountered:

Pavloveuge · 2025-02-25T06:33:32Z

maybe @oandreeva-nv can help

azsh1725 · 2025-02-25T08:30:11Z

As far as I understand, vLLM has separate routing for each task (completion, embedding, etc.). The question is related to completion, so it is relevant to look at ServingCompletion, specifically the create_completion method. The necessary validation is performed using the _preprocess_completion method here. Tracing the function call chain leads to the _validate_input method, where the actual validation is implemented.

My intuition is that we need to add a similar method that utilizes the tokenizer and performs validation in the same way as _validate_input. It seems appropriate to add this in execute, before or after the LoRa request check.

@oandreeva-nv, could you please share your thoughts on this approach?

ArtemBiliksin changed the title ~~[Question] How can I make a limit on the length of the input context?~~ [Question] How can I make a limit on the length of the input context and the number of tokens to generate? Feb 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] How can I make a limit on the length of the input context and the number of tokens to generate? #8029

[Question] How can I make a limit on the length of the input context and the number of tokens to generate? #8029

ArtemBiliksin commented Feb 24, 2025

Pavloveuge commented Feb 25, 2025

azsh1725 commented Feb 25, 2025

[Question] How can I make a limit on the length of the input context and the number of tokens to generate? #8029

[Question] How can I make a limit on the length of the input context and the number of tokens to generate? #8029

Comments

ArtemBiliksin commented Feb 24, 2025

Pavloveuge commented Feb 25, 2025

azsh1725 commented Feb 25, 2025