-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Module Group Offloading #10503
Module Group Offloading #10503
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
I think this fits well in the offloading I'm working on in modular diffusers |
Maybe we should consolidate a bit - I will separate the offloading part into its own PR |
from .hooks import HookRegistry, ModelHook | ||
|
||
|
||
_COMMON_STACK_IDENTIFIERS = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be better to have this as an attribute within each model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually will remove this completely. This should be applicable on any model containing ModuleList or Sequential because we know for sure, atleast in Diffusers, that the call order of these layers are sequential and not in some weird access pattern.
So, will make the check to just look for the above two classes with isinstance
buffer.data = buffer.data.to(onload_device) | ||
|
||
|
||
def _apply_group_offloading_group_patterns( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these can be consolidated into a single function and use the offload_group_pattern. If we add something like a _group_offload_modules
to the Model class, we can just extend it with the offload_group_patterns
argument here.
src/diffusers/hooks/hooks.py
Outdated
return module | ||
|
||
|
||
class HookRegistry: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good 👍🏽
* cuda stream prefetch * remove breakpoints
Some more numbers after latest changes:
Continuing from our internal thread, we have positive signal that sequential CPU offloading can be done without any hit to time required for inference when using cuda streams for transfer. |
not doing so will lead to erroneous computations on the GPU and cause bad results
@DN6 I've tried addressing all review comments. LMK what you think about the changes added. There's now a I've added GPU memory tests with a dummy model. It is separate from the per-model tests because there are no memory savings for certain models and the test would fail -- the reason being that they're typically a single block and not bound by intermediate activation tensor size. @stevhliu Could you give this a look too for the docs please? I'm not quite sure if this is the best place to mention, but seemed ideal. LMK if you think it should be added elsewhere |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, perfect place for these docs! 🔥
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nicely done @a-r-r-o-w 🚀
@DN6 Based on our chat about handling |
enable_model_cpu_offload
onloads the entire transformer model at once. The minimal memory requirements for this is, therefore, determined by the size of the transformer. For large models, it is sometimes impossible to even load the memory on GPUenable_sequential_cpu_offload
has very minimal memory requirements, but is too slow because of lots of synchronous device transfers. We can speed this up with async cuda streams to "hide" the HtoD and DtoH transfer latency by overlapping with computation. The implementation with cuda sterams would be required to come fromaccelerate
since we rely on it for memory management in this case.*The benchmarks were run with a mistake in the offloading code. This caused text encoder to be on the GPU instead of being offloaded, making the comparison unfair to those runs marked without a
*
Benchmark
Some goals of this PR:
Fully compatible with torch.compileThere are a few recompiles triggered. Not really sure how to get away with itIn a way, these changes can enable both enable_model_cpu_offload and enable_sequential_cpu_offload because of the way
offload_group_patterns
can be leveraged, but that is not the goal.