-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
↔️ GRPO: Set max_model_len when initializing vLLM instance #2728
Conversation
We could do that, but memory limitation shouldn't come from generation. Or maybe are you using the same device from training and generation? |
Indeed, my use case is specifically running on a single consumer GPU. It might be wishful thinking, but with this patch I am able to run a training loop for a 1.5B model. |
Have you tried to reduce the vllm gpu memory usage? |
Yes, in fact that's what led me to this change. Lowering it reduces the space left for a KV cache, and vLLM prints:
So my understanding is that with this, a smaller KV cache is more efficiently utilized. Are there any downsides to setting this? We could make it opt-in through an arg if you think it can have negative implications. |
Now it makes sense. Can you add an arg in the config instead? |
|
Works for me, done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
I think it would be cleaner to add a
vllm_init_kwargs: Optional[dict] = field(
default_factory=lambda: {
"device": "auto",
"gpu_memory_utilization": 0.9,
},
metadata={
},
) |
That sounds good to me, but I'm not sure if it can be done in a backwards compatible manner (or if it's acceptable to make backwards-incompatible flag changes). @qgallouedec thoughts? |
This would be the best solution for single GPU (poor) training as people might want to tune other parameters as well as seen in this thread: https://x.com/robertshaw21/status/1885781591961571455 |
Now updated with |
@qgallouedec I'd recommend merging this before other PRs that change the vllm init call, as it most likely covers all their needs (also merging is hard) |
In fact, I'm in favor of explicitly stating the parameters for two main reasons:
As for backwards compatibility, GRPO is a new trainer and the lib is still in alpha, so there's no real need to ensure that. This can be discussed again in the future if this lack of flexibility is really a problem. But adding parameters one-by-one should be good for now |
Now reverted to just adding a single new arg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, merging when CI is green :)
What does this PR do?
By default, the vLLM model will be set up to support the max context of the input model.
However, during training we know we will only observe at most max_prompt_length + max_completion_length tokens, so we can use that to have a reduced memory footprint.
This is especially relevant when running on limited hardware as the improvement in memory can be significant.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.