Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

State of video generation in Diffusers #2592

Merged
merged 13 commits into from
Jan 27, 2025
Merged
10 changes: 10 additions & 0 deletions _blog.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5296,3 +5296,13 @@
- open-source-collab
- nlp
- cv

- local: video_gen
title: "State of open video generation models in Diffusers"
author: sayakpaul
thumbnail: /blog/assets/video_gen/thumbnail.png
date: Jan 23, 2025
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to be updated.

tags:
- diffusers
- guide
- video_gen
242 changes: 242 additions & 0 deletions video_gen.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,242 @@
---
title: "State of open video generation models in Diffusers"
thumbnail: /blog/assets/video_gen/thumbnail.png
authors:
- user: sayakpaul
- user: a-r-r-o-w
- user: dn6
---

# State of open video generation models in Diffusers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth opening with a video from one of the open video models, maybe even draw comparisons from where the video generation models were a year or two back vs now!

A good example could be the Will Smith benchmark!


OpenAI’s Sora demo marked a striking advance in AI-generated video last year and gave us a glimpse of the potential capabilities of video generation models. The impact was immediate and since that demo, the video generation space has become increasingly competitive with major players and startups producing their own highly capable models such as Google’s Veo2, Haliluo’s Minimax, Runway’s Gen3, Alpha Kling, Pika, and Luma Lab’s Dream Machine.

Open-source has also had its own surge of video generation models with CogVideoX, Mochi-1, Hunyuan, Allegro, and LTX Video. Is the video community having its “Stable Diffusion moment”?

This post will provide a brief overview of the state of video generation models, where we are with respect to open video generation models, and how the Diffusers team is planning to support their adoption at scale.

Specifically, we will discuss:

- Capabilities and limitations of video generation models
- Why video generation is hard
- Open video generation models
- Licensing
- Memory requirements
- Video generation with Diffusers
- Inference and optimizations
- Fine-tuning
- Looking ahead

## Today’s Video Generation Models and their Limitations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to disagree, but IMO, we should only keep the table here and the limitations can potentially go toward the end of the blogpost, makes it more easy to read.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do disagree. I think it's common to start with limitations to make the readers have a fuller context.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on the vibe you are going for, up to you since you're the author, to me it just feels odd to start with limitations since even a survey paper conveys the limitations towards the end.


As of today, the below models are amongst the most popular ones.

| **Provider** | **Model** | **Open/Closed** |
| --- | --- | --- |
| Meta | [MovieGen](https://ai.meta.com/research/movie-gen/) | Closed (with a detailed [technical report](https://ai.meta.com/research/publications/movie-gen-a-cast-of-media-foundation-models/)) |
| OpenAI | [Sora](https://sora.com/) | Closed |
| Google | [Veo 2](https://deepmind.google/technologies/veo/veo-2/) | Closed |
| RunwayML | [Gen 3 Alpha](https://runwayml.com/research/introducing-gen-3-alpha) | Closed |
| Pika Labs | [Pika 2.0](https://pika.art/login) | Closed |
| KlingAI | [Kling](https://www.klingai.com/) | Closed |
| Haliluo | [MiniMax](https://hailuoai.video/) | Closed |
| THUDM | [CogVideoX](https://huggingface.co/docs/diffusers/main/en/api/pipelines/cogvideox) | Open |
| Genmo | [Mochi-1](https://huggingface.co/docs/diffusers/main/en/api/pipelines/cogvideox) | Open |
| RhymesAI | [Allegro](https://huggingface.co/docs/diffusers/main/en/api/pipelines/allegro) | Open |
| Lightricks | [LTX Video](https://huggingface.co/docs/diffusers/main/en/api/pipelines/ltx_video) | Open |
| Tencent | [Hunyuan Video](https://huggingface.co/docs/diffusers/main/api/pipelines/hunyuan_video) | Open |

**Limitations**: Despite the continually increasing number of video generation models, their limitations are also manifold:

- **High Resource Requirements:** Producing high-quality videos requires large pretrained models, which are computationally expensive to develop and deploy. These costs arise from dataset collection, hardware requirements, extensive training iterations and experimentation. These costs make it hard to justify producing open-source and freely available models. Even though we don’t have a detailed technical report that shed light into the training resources used, [this post](https://www.factorialfunds.com/blog/under-the-hood-how-openai-s-sora-model-works) provides some reasonable estimates.
- Several open models suffer from limited generalization capabilities and underperform expectations of users. Models may require prompting in a certain way, or LLM-like prompts, or fail to generalize to out-of-distribution data, which are hurdles for widespread user adoption. For example, models like LTX-Video often need to be prompted in a very detailed and specific way for obtaining good quality generations.
- The high computational and memory demands of video generation result in significant generation latency. For local usage, this is often a roadblock. Most new open video models are inaccessible to community hardware without extensive memory optimizations and quantization approaches that affect both inference latency and quality of the generated videos.

## Why is Video Generation Hard?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for this, it would be nice to keep a positive outlook going in and then ground it towards the end.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.


There are several factors we’d like to see and control in videos:

- Adherence to Input Conditions (such as a text prompt, a starting image, etc.)
- Realism
- Aesthetics
- Motion Dynamics
- Spatio-Temporal Consistency and Coherence
- FPS
- Duration

With image generation models, we usually only care about the first three aspects. However, for video generation we now have to consider motion quality, coherence and consistency over time, potentially with multiple subjects. Finding the right balance between good data, right inductive priors, and training methodologies to suit these additional requirements has proved to be more challenging than other modalities.

## Open Video Generation Models

Text-to-video generation models have similar components as their text-to-image counterparts:

- Text encoders for providing rich representations of the input text prompt
- A denoising network
- An encoder and decoder to convert between pixel and latent space
- A non-parametric scheduler responsible for managing all the timestep-related calculations and the denoising step

The latest generation of video models share a core feature where the denoising network processes 3D video tokens that capture both spatial and temporal information. The video encoder-decoder system, responsible for producing and decoding these tokens, employs both spatial and temporal compression. While decoding the latents typically demands the most memory, these models offer frame-by-frame decoding options to reduce memory usage.

Text conditioning is incorporated through either joint attention (introduced in [Stable Diffusion 3](https://arxiv.org/abs/2403.03206)) or cross-attention. T5 has emerged as the preferred text encoder across most models, with HunYuan being an exception in its use of both CLIP-L and LLaMa 3.

The denoising network itself builds on the DiT architecture developed by [William Peebles and Saining Xie](https://arxiv.org/abs/2212.09748), while incorporating various design elements from [PixArt](https://arxiv.org/abs/2310.00426).

### Licensing

The table below provides a list of the checkpoints of the most popular open video generation models, along with their licenses. Mochi-1, despite being a large and high-quality model, comes with an Apache 2.0 license!

| **Model Name** | **License** |
| --- | --- |
| [`THUDM/CogVideoX1.5-5B`](https://huggingface.co/THUDM/CogVideoX1.5-5B) | [Link](https://huggingface.co/THUDM/CogVideoX-5b/blob/main/LICENSE) |
| [`THUDM/CogVideoX1.5-5B-I2V`](https://huggingface.co/THUDM/CogVideoX1.5-5B-I2V) | [Link](https://huggingface.co/THUDM/CogVideoX-5b/blob/main/LICENSE) |
| [`THUDM/CogVideoX-5b`](https://huggingface.co/THUDM/CogVideoX-5b) | [Link](https://huggingface.co/THUDM/CogVideoX-5b/blob/main/LICENSE) |
| [`THUDM/CogVideoX-5b-I2V`](https://huggingface.co/THUDM/CogVideoX-5b-I2V) | [Link](https://huggingface.co/THUDM/CogVideoX-5b/blob/main/LICENSE) |
| [`THUDM/CogVideoX-2b`](https://huggingface.co/THUDM/CogVideoX-2b) | Apache 2.0 |
| [`genmo/mochi-1-preview`](https://huggingface.co/genmo/mochi-1-preview) | Apache 2.0 |
| [`rhymes-ai/Allegro`](https://huggingface.co/rhymes-ai/Allegro) | Apache 2.0 |
| [`tencent/HunyuanVideo`](https://huggingface.co/tencent/HunyuanVideo) | [Link](https://huggingface.co/tencent/HunyuanVideo/blob/main/LICENSE) |
| [`Lightricks/LTX-Video`](https://huggingface.co/Lightricks/LTX-Video) | [Link](https://huggingface.co/Lightricks/LTX-Video/blob/main/License.txt) |

### **Memory requirements**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be beneficial to add inference examples for all/ some models that you mention here, to ground that diffusers is the place to go for inference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe even with video snippets embedded from those as well - so that people can visually experience them as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be beneficial to add inference examples for all/ some models that you mention here, to ground that diffusers is the place to go for inference.

It will make it unnecessarily verbose. We will do some snippets but will keep it for only one model as we're already citing the docs for the other models. This is a TODO and will be addressed by @DN6.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will make it unnecessarily verbose. We will do some snippets but will keep it for only one model as we're already citing the docs for the other models.

Not really, you can just wrap them up into <details> so that it is collapsed by default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will make it redundant a bit IMO, as the code doesn't change much. So, showing for a single model is sufficient, I think.


The memory requirements for any model can be computed by adding the following:

- Memory required for weights
- Maximum memory required for storing intermediate activation states

Memory required by weights can be lowered via - quantization, downcasting to lower dtypes, or offloading to CPU. Memory required for activations states can also be lowered but is usually a more involved process, which is out of the scope of this blog.

It is possible to run any video model with extremely low memory, but it comes at the cost of time required for inference. If the time required by an optimization technique is more than what a user considers reasonable, it is not feasible to run inference. Diffusers provides many such optimizations that are opt-in and can be chained together.

In the table below, we provide the memory requirements for three popular video generation models with reasonable defaults:

| **Model Name** | **Memory (GB)** |
| --- | --- |
| HunyuanVideo | 60.09 |
| LTX-Video | 17.75 |
| CogVideoX (1.5 5B) | 36.51 |

These numbers were obtained with the following settings on an 80GB A100 machine (full script [here](https://gist.github.com/sayakpaul/2bc49a30cf76cea07914104d28b1fb86)):

- `torch.bfloat16` dtype
- `num_frames`: 121, `height`: 512, `width`: 768
- `max_sequence_length`: 128
- `num_inference_dtype`: 50

These requirements are quite staggering, making these models difficult to run on consumer hardware. As mentioned above, with Diffusers, users can enable different optimizations to suit their needs. The following table provides memory requirements for widely used models with sensible optimizations enabled (that do not compromise on quality or time required for inference). We studied this with the HunyuanVideo model as it’s sufficiently large to show the benefits of the optimizations in a progressive manner.

| Base | 60.10 GB |
| --- | --- |
| VAE tiling | 43.58 GB |
| CPU offloading | 28.87 GB |
| 8Bit | 49.9 GB |
| 8Bit + CPU offloading* | 35.66 GB |
| 8Bit + VAE tiling | 36.92 GB |
| 8Bit + CPU offloading + VAE tiling | 26.18 GB |
| 4Bit | 42.96 GB |
| 4Bit + CPU offloading | 21.99 GB |
| 4Bit + VAE tiling | 26.42 GB |
| 4Bit + CPU offloading + VAE tiling | 14.15 GB |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sayakpaul Have we made note of the time required for each of these methods? IMO it would be helpful for users to understand the tradeoffs that come with each and the expected slowdown

It would also set the stage to tease the new banger feature, of prefetched offloading, coming soon, which uses the memory of sequential cpu offloading (so around ~3 GB) without compromising speed. CPU RAM requirements are the same as any other offloading methods. LMK what you think

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reason why I didn't because:

  1. Video generation is time-consuming, especially HunyuanVideo. Not sure if most users care about the inference latency taking a hit because of memory optims.
  2. We don't have the other features merged yet so, I didn't feel comfortable benchmarking them.

If you feel strongly about the timing note, feel free to add the changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the Comfy community side atleast, I know that people do care about the time required and try to work with settings that reduce the overall time required (lower resolution/frames + latent upscaling, sage attention, fp8 matmul, etc. because they have support for some good memory optims already). So, I think it will be beneficial to mention time here because if we only cared about reducing memory, everyone would just default to something like sequential cpu offloading.

Could you provide me with the benchmark script from where you got the current numbers? Will run the same and measure time as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here:

Code
from diffusers import HunyuanVideoTransformer3DModel, HunyuanVideoPipeline
from diffusers import BitsAndBytesConfig as BitsAndBytesConfig
import argparse
import json
import torch 

prompt = "A cat walks on the grass, realistic. The scene resembles a real-life footage and should look as if it was shot in a sunny day."

def load_pipeline(args):
    if args.bit4_bnb:
        quant_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")
        transformer = HunyuanVideoTransformer3DModel.from_pretrained(
            "hunyuanvideo-community/HunyuanVideo",
            subfolder="transformer",
            quantization_config=quant_config,
            torch_dtype=torch.bfloat16,
        )
    elif args.bit8_bnb:
        quant_config = BitsAndBytesConfig(load_in_8bit=True)
        transformer = HunyuanVideoTransformer3DModel.from_pretrained(
            "hunyuanvideo-community/HunyuanVideo",
            subfolder="transformer",
            quantization_config=quant_config,
            torch_dtype=torch.bfloat16,
        )
    else:
        transformer = HunyuanVideoTransformer3DModel.from_pretrained(
            "hunyuanvideo-community/HunyuanVideo", subfolder="transformer", torch_dtype=torch.bfloat16
        )
    
    pipe = HunyuanVideoPipeline.from_pretrained(
        "hunyuanvideo-community/HunyuanVideo", transformer=transformer, torch_dtype=torch.float16
    )
    
    if not args.enable_model_cpu_offload:
        pipe = pipe.to("cuda")
    else:
        pipe.enable_model_cpu_offload()
    
    if args.vae_tiling:
        pipe.vae.enable_tiling()
    return pipe


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--enable_model_cpu_offload", type=int, choices=[0, 1])
    parser.add_argument("--vae_tiling", type=int, choices=[0, 1])
    parser.add_argument("--bit4_bnb", type=int, choices=[0, 1])
    parser.add_argument("--bit8_bnb", type=int, choices=[0, 1])
    args = parser.parse_args()

    # Construct output path based on argument values
    output_path = f"4bit@{args.bit4_bnb}_8bit@{args.bit8_bnb}_tiling@{args.vae_tiling}_offload@{args.enable_model_cpu_offload}.json"

    pipe = load_pipeline(args)

    _ = pipe(
        prompt, 
        height=512, 
        width=768, 
        num_frames=121, 
        generator=torch.manual_seed(0),
        num_inference_steps=50
    )

    memory = torch.cuda.max_memory_allocated() / (1024 ** 3)

    # Serialize memory usage info to JSON
    memory_data = {
        "prompt": prompt,
        "height": 512,
        "width": 768,
        "num_frames": 121,
        "num_inference_steps": 50,
        "gpu_memory_usage_gb": memory,
        "enable_model_cpu_offload": args.enable_model_cpu_offload,
        "vae_tiling": args.vae_tiling,
        "bit4_bnb": args.bit4_bnb,
        "bit8_bnb": args.bit8_bnb
    }

    with open(output_path, "w") as json_file:
        json.dump(memory_data, json_file, indent=4)

    print(f"Serialized to {output_path=}")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep the settings similar, though. If we have reduce the number of frames, resolution, etc. I'd make a separate note and not change the settings during benchmarking.

Copy link
Member

@a-r-r-o-w a-r-r-o-w Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the results with time required for each method + FP8-layerwise-upcasting since the PR was merged.

| **Setting**                                        | **Memory**    | **Time** |
|:--------------------------------------------------:|:-------------:|:--------:|
| BF16 Base                                          | 60.10 GB      |  863s    |
| BF16 + CPU offloading                              | 28.87 GB      |  917s    |
| BF16 + VAE tiling                                  | 43.58 GB      |  870s    |
| 8-bit BnB                                          | 49.90 GB      |  983s    |
| 8-bit BnB + CPU offloading*                        | 35.66 GB      | 1041s    |
| 8-bit BnB + VAE tiling                             | 36.92 GB      |  997s    |
| 8-bit BnB + CPU offloading + VAE tiling            | 26.18 GB      | 1260s    |
| 4-bit BnB                                          | 42.96 GB      |  867s    |
| 4-bit BnB + CPU offloading                         | 21.99 GB      |  953s    |
| 4-bit BnB + VAE tiling                             | 26.42 GB      |  889s    |
| 4-bit BnB + CPU offloading + VAE tiling            | 14.15 GB      |  995s    |
| FP8 Upcasting                                      | 51.70 GB      |  856s    |
| FP8 Upcasting + CPU offloading                     | 21.99 GB      |  983s    |
| FP8 Upcasting + VAE tiling                         | 35.17 GB      |  867s    |
| FP8 Upcasting + CPU offloading + VAE tiling        | 20.44 GB      | 1013s    |
| BF16 + Group offload (blocks=8) + VAE tiling       | 15.67 GB      |  925s    |
| BF16 + Group offload (blocks=1) + VAE tiling       |  7.72 GB      |  881s    |
| BF16 + Group offload (leaf) + VAE tiling           |  6.66 GB      |  887s    | 
| FP8 Upcasting + Group offload (leaf) + VAE tiling  |  6.56 GB      |  885s    |

Still haven't added Groupwise-offloading yet since I had another idea about optimizing it for further reducing memory. I will for sure be able to send the numbers for it by later today. Will push the changes directly EOD

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Aryan!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the updated benchmark code (did not modify the original parts and just kept to the fp8 and group offloading additions)

code
from diffusers import HunyuanVideoTransformer3DModel, HunyuanVideoPipeline
from diffusers import BitsAndBytesConfig
import argparse
import json
import torch 
import time
from diffusers.utils import export_to_video
from diffusers.hooks.group_offloading import apply_group_offloading
from diffusers.utils.logging import set_verbosity_debug

set_verbosity_debug()

prompt = "A cat walks on the grass, realistic. The scene resembles a real-life footage and should look as if it was shot in a sunny day."

def load_pipeline(args):
    if args.bit4_bnb:
        quant_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")
        transformer = HunyuanVideoTransformer3DModel.from_pretrained(
            "hunyuanvideo-community/HunyuanVideo",
            subfolder="transformer",
            quantization_config=quant_config,
            torch_dtype=torch.bfloat16,
        )
    elif args.bit8_bnb:
        quant_config = BitsAndBytesConfig(load_in_8bit=True)
        transformer = HunyuanVideoTransformer3DModel.from_pretrained(
            "hunyuanvideo-community/HunyuanVideo",
            subfolder="transformer",
            quantization_config=quant_config,
            torch_dtype=torch.bfloat16,
        )
    else:
        transformer = HunyuanVideoTransformer3DModel.from_pretrained(
            "hunyuanvideo-community/HunyuanVideo", subfolder="transformer", torch_dtype=torch.bfloat16
        )
    
    if args.layerwise_casting:
        transformer.enable_layerwise_casting(storage_dtype=torch.float8_e4m3fn, compute_dtype=torch.bfloat16)
    
    pipe = HunyuanVideoPipeline.from_pretrained(
        "hunyuanvideo-community/HunyuanVideo", transformer=transformer, torch_dtype=torch.float16
    )
    
    if not args.enable_model_cpu_offload:
        if args.group_offloading == "0":
            pipe = pipe.to("cuda")
    else:
        pipe.enable_model_cpu_offload()
    
    if args.vae_tiling:
        pipe.vae.enable_tiling()
    return pipe


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--enable_model_cpu_offload", type=int, choices=[0, 1])
    parser.add_argument("--vae_tiling", type=int, choices=[0, 1])
    parser.add_argument("--bit4_bnb", type=int, choices=[0, 1])
    parser.add_argument("--bit8_bnb", type=int, choices=[0, 1])
    parser.add_argument("--layerwise_casting", type=int, choices=[0, 1])
    parser.add_argument("--group_offloading", type=str, choices=["0", "1", "8", "leaf_level"])
    args = parser.parse_args()

    # Construct output path based on argument values
    output_path = f"group_offloading@{args.group_offloading}_4bit@{args.bit4_bnb}_8bit@{args.bit8_bnb}_tiling@{args.vae_tiling}_offload@{args.enable_model_cpu_offload}_layerwise@{args.layerwise_casting}.json"

    pipe = load_pipeline(args)

    if args.group_offloading != "0":
        apply_group_offloading(
            pipe.text_encoder,
            offload_type="leaf_level",
            offload_device=torch.device("cpu"),
            onload_device=torch.device("cuda"),
            force_offload=True,
            non_blocking=True,
            use_stream=True,
        )
        apply_group_offloading(
            pipe.text_encoder_2,
            offload_type="leaf_level",
            offload_device=torch.device("cpu"),
            onload_device=torch.device("cuda"),
            force_offload=True,
            non_blocking=True,
            use_stream=True,
        )
        apply_group_offloading(
            pipe.transformer,
            offload_type="block_level" if args.group_offloading in ["1", "8"] else "leaf_level",
            num_blocks_per_group=8 if args.group_offloading == "8" else 1 if args.group_offloading == "1" else None,
            offload_device=torch.device("cpu"),
            onload_device=torch.device("cuda"),
            force_offload=True,
            non_blocking=True,
            use_stream=True,
        )
        pipe.vae.to("cuda")
    
        # warmup for prefetch hooks to figure out layer execution order
        _ = pipe(prompt, height=64, width=64, num_frames=9, num_inference_steps=2)

    t1 = time.time()
    video = pipe(
        prompt, 
        height=512, 
        width=768, 
        num_frames=121,
        generator=torch.manual_seed(0),
        num_inference_steps=30,
    )
    t2 = time.time()

    video = video.frames[0]
    export_to_video(video, output_path[:-5] + ".mp4", fps=30)

    memory = torch.cuda.max_memory_allocated() / (1024 ** 3)

    # Serialize memory usage info to JSON
    memory_data = {
        "prompt": prompt,
        "height": 512,
        "width": 768,
        "num_frames": 121,
        "num_inference_steps": 50,
        "gpu_memory_usage_gb": memory,
        "inference_time": round(t2 - t1, 2),
        "enable_model_cpu_offload": args.enable_model_cpu_offload,
        "vae_tiling": args.vae_tiling,
        "bit4_bnb": args.bit4_bnb,
        "bit8_bnb": args.bit8_bnb
    }

    with open(output_path, "w") as json_file:
        json.dump(memory_data, json_file, indent=4)

    print(f"Serialized to {output_path=}")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BF16, 121 frames, 512x768 resolution in under 7 GB (further reduced to under 5 GB with flash attention and optimized feed-forward (huggingface/diffusers#10623). Did we cook or did we cook? 👨‍🍳


*8Bit models in `bitsandbytes` cannot be moved to CPU from GPU, unlike the 4Bit ones.

We used the same settings as above to obtain these numbers. Quantization was performed with the [`bitsandbytes` library](https://huggingface.co/docs/bitsandbytes/main/en/index) (Diffusers [supports three different quantization backends](https://huggingface.co/docs/diffusers/main/en/quantization/overview) as of now). Also note that due to numerical precision loss, quantization can impact the quality of the outputs, effects of which are more prominent in videos than images.

## Video Generation with Diffusers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More in-line with the suggestion above, I'd recommend moving this above optimisations/ memory etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved.


<div align="center">
<iframe src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/video_gen/hunyuan-output.mp4" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
</div>

There are three broad categories of generation possible when working with video models:

1. Text to Video
2. Image or Image Control condition + Text to Video
3. Video or Video Control condition + Text to Video

### Suite of optimizations
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DN6 if you could take care of the code, that would be helpful!


Video generation can be quite difficult on resource-constrained devices and time-consuming even on beefier GPUs. Diffusers provides a suite of utilities that help to optimize both the runtime and memory consumption of these models. These optimizations fall under the following categories:

- **Quantization**: The model weights are quantized to lower precision data types, which lowers the VRAM requirements of models.
- **Offloading**: Different layers of a model can be loaded on the GPU when required for computation on-the-fly and then offloaded back to CPU. This saves a significant amount of memory during inference.
- **Chunked Inference**: By splitting inference across non-embedding dimensions of input latent tensors, the memory overheads from intermediate activation states can be reduced. Common use of this technique is often seen in encoder/decoder slicing/tiling.
- **Re-use of Attention & MLP states**: Computation of certain denoising steps can be skipped and past states can be re-used, if certain conditions are satisfied for particular algorithms, to speed up the generation process with minimal quality loss.

Note that in the above four options, as of now, we only support the first two. Support for the rest of the two will be merged in soon. If you’re interested to follow along the progress, here are the PRs:

- TODO:
- TODO:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are very close to merging PAB, which should cover attention & MLP state re-use. For chunked inference, slicing/tiling/FreeNoise-split-inference are great examples already.

For offloading, we currently only have group offloading pending (which might take a while to review and merge), but the PR is 90% ready IMO, so we can mention it -- especially because it has no speed overheads while drastically reducing memory requirements.

So, IMO we should not mention these few lines ("..., we only support the first two")

Copy link
Member Author

@sayakpaul sayakpaul Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to perform those changes directly here. I would go with:

So, IMO we should not mention these few lines ("..., we only support the first two")

And make it clear what's upcoming (the ones you have opened PRs for).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@a-r-r-o-w I have taken care of the edits. LMK if that works for you.


The list of memory optimizations discussed here will soon become non-exhaustive, so, we suggest you to always keep an eye on the Diffusers repository to stay updated.

We can also apply optimizations during training. The two most well-known techniques applied to video models include:

- **Timestep distillation**: This involves teaching the model to denoise the noisy latents faster in lesser amount of inference steps, in a recursive fashion. For example, if a model takes 32 steps to generate good videos, it can be augmented to try and predict the final outputs in only 16-steps, or 8-steps, or even 2-steps! This may be accompanied by loss in quality depending on how fewer steps are used. Some examples of timestep-distilled models include [Flux.1-Schnell](https://huggingface.co/black-forest-labs/FLUX.1-schnell/) and [FastHunyuan](https://huggingface.co/FastVideo/FastHunyuan).
- **Guidance distillation**: [Classifier-Free Guidance](https://arxiv.org/abs/2207.12598) is a technique widely used in diffusion models that enhances generation quality. This, however, doubles the generation time because it involves two full forward passes through the models per inference step, followed by an interpolation step. By teaching models to predict the output of both forward passes and interpolation at the cost of one forward pass, this method can enable much faster generation. Some examples of guidance-distilled models include [Flux.1-Dev](https://huggingface.co/black-forest-labs/FLUX.1-dev) and [HunyuanVideo](https://huggingface.co/docs/diffusers/main/api/pipelines/hunyuan_video).
- Architectural compression through distillation as done in [SSD1B](https://huggingface.co/segmind/SSD-1B).

We refer the readers to [this guide](https://huggingface.co/docs/diffusers/main/en/using-diffusers/text-img2vid) for a detailed take on video generation and the current possibilities in Diffusers.

### Fine-tuning

We’ve created [`finetrainers`](https://github.com/a-r-r-o-w/finetrainers) - a repository that allows you to easily fine-tune the latest generation of open video models. For example, here is how you would fine-tune CogVideoX with LoRA:

```bash
# Download a dataset
huggingface-cli download \
--repo-type dataset Wild-Heart/Disney-VideoGeneration-Dataset \
--local-dir video-dataset-disney

# Then launch training
accelerate launch train.py \
--model_name="cogvideox" --pretrained_model_name_or_path="THUDM/CogVideoX1.5-5B" \
--data_root="video-dataset-disney" \
--video_column="videos.txt" \
--caption_column="prompt.txt" \
--training_type="lora" \
--seed=42 \
--mixed_precision="bf16" \
--batch_size=1 \
--train_steps=1200 \
--rank=128 \
--lora_alpha=128 \
--target_modules to_q to_k to_v to_out.0 \
--gradient_accumulation_steps 1 \
--gradient_checkpointing \
--checkpointing_steps 500 \
--checkpointing_limit 2 \
--enable_slicing \
--enable_tiling \
--optimizer adamw \
--lr 3e-5 \
--lr_scheduler constant_with_warmup \
--lr_warmup_steps 100 \
--lr_num_cycles 1 \
--beta1 0.9 \
--beta2 0.95 \
--weight_decay 1e-4 \
--epsilon 1e-8 \
--max_grad_norm 1.0

# ...
# (Full training command removed for brevity)
```

For more details, check out the repository [here](https://github.com/a-r-r-o-w/finetrainers).

## Looking ahead

As it has become quite apparent that video generation models will continue to grow in 2025, Diffusers users can expect more optimization-related goodies. Our goal is to also make it easy and accessible to do video model fine-tuning which is why we will continue to grow the `finetrainers` library. LoRA training is just the beginning, but there’s more to come - Control LoRAs, Distillation algorithms, ControlNets, Adapters, and more. We would love to welcome contributions from the community as we go 🤗

We will also continue to collaborate with model publishers, fine-tuners, and anyone from the community willing to help us take the state of video generation to the next level and bring you the latest and the greatest in the domain.

## Resources

We cited a number of links throughout the post. To make sure you don’t miss out on the most important ones, we provide a list below:

- [Video generation guide](https://huggingface.co/docs/diffusers/main/en/using-diffusers/text-img2vid)
- [Quantization support in Diffusers](https://huggingface.co/docs/diffusers/main/en/quantization/overview)
- [General LoRA guide in Diffusers](https://huggingface.co/docs/diffusers/main/en/tutorials/using_peft_for_inference)
- [Memory optimization guide for CogVideoX](https://huggingface.co/docs/diffusers/main/en/api/pipelines/cogvideox#memory-optimization) (it applies to other video models, too)
- [`finetrainers`](https://github.com/a-r-r-o-w/finetrainers) for fine-tuning
Loading