-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I combined your code with diffusers stable diffusion and trained a model #10
Comments
Hi, @lxj616 |
@hxngiee I trained the model using examples/research_projects/dreambooth_inpaint/train_dreambooth_inpaint.py from thedarkzeno or patil-suraj And because we are doing video, load the dataset as (b, c, f, h, w) instead of (b, c, h, w), and everything else are taken care of by the original script author, about how to do fp16/accelerate/8-bit adam please view the README of the dreambooth subfolder, they are mostly out of the box usable If you need more explanations, I could also share my train_dreambooth.py, but I wrote very messy bad code and does not even rename it LOL, lots of hardcoding hacky tricks, I guess you'll end up rewrite the original train_dreambooth_inpaint.py and that's faster than to debug mine |
@lxj616 cool |
@hxngiee It's the text2image model with new temporal layers, the text2image model is the stable diffusion, the new layer needs to be trained similar to dreambooth example, since you ask how to train a model in diffusers ... Not finetuning the text2image model, the backbone is frozen, only training the new layers maybe you wish to read train_dreambooth_inpaint.py to understand how to train this video model, but don't get the ideas wrong, we are talking make-a-video not dreambooth |
@lxj616 Thank you for your reply. I understand what you did. To make a video, adding temporal consistency layer and train them similar to dreambooth. Pseudo3DConv and pseudo 3d attention were effective in training video diffusion model. Thanks for sharing your finding and I will look at the code closely! |
@lxj616 nice! yea, i still need to complete https://github.com/lucidrains/classifier-free-guidance-pytorch , and then integrate this all into dalle2-pytorch should be all complete in early january. if you cannot wait, try using this |
@lxj616 cat is looking quite majestic 🤣 let's make it move in 2023 |
nice~ |
Amazing job. |
Could it be possible to do so? |
@chavinlo I dropped my messy script at https://gist.github.com/lxj616/5134368f44aca837304530695ee100ea But I bet it would be quicker if you modify the original train_dreambooth.py from diffusers than to debug mine, I barely make it run on my specific environment, it has 99.9% chance not gonna run on your system LOL |
Thanks. Could it be also possible to release the webdataset making code? |
I've read your blog about VRAM limitations. If you need more compute, I can give you an A100 to experiment. |
@chavinlo Thanks for asking but 24GB is enough for testing if I pre-compute the embeddings and save them into webdataset, since I see you got A100 (perhaps 40GB vram), you do not need my webdataset making code, you can just load a video and vae encode it on the fly (which is much more easier to use), my webdataset making is actually done in python interactive shell and did not log a python script because I thought it was one-time thing per dataset, I may need to log everything down on my next attempt ... |
@lxj616 Thanks. One more question, in the preprocess function, you treat npz as if it had all the videos? because it itterates through it, and adds all the frames of npz_i into f8_list, and keeps doing it until there are no more npz_i left? Finally, does example['f8'] contains all the video's frames? or just a single video's frame? |
@chavinlo One npz contains all video frames of one single video, the loop is dealing with a batch, and the final example['f8'] is a batch of video frames with shape (b c f h w), where f is the frame length |
@lxj616 Hello again, I got training working with bs 1 and 25 frames. Although I had to convert the model to BFloat16, because I got OOM with fp32 (80GB+), and loss=nan with fp16. I see that you mentioned that you used fp16 and 8bit. How did you managed to use them? I can't use 8bit with my current setup because it won't work with bf16. |
Also, bf16 uses 44GB, but when using grad checkpointing, it decreases to 11GB |
@chavinlo Hmmm... I never met this problem, I just use the original code and when running |
@lxj616 @lucidrains @Samge0 @hxngiee @chavinlo Hello, I'm starting a startup using lxj616's Make-a-stable-diffusion-video repository as one of the models for the text2video product, similar to what MidJourney does with text2image. Our long term goal is to allow anybody to create a Hollywood movie in 1 hour. If it succeeds, it could be one of the biggest companies in the world. If any of you are interested in becoming a cofounder for an equal split of the company, I've explained our short and long term plans at https://youtu.be/lbhUB1GyYZE |
Hello to you too, I don't know how to reply to you because there are many things you might wish to dig in and learn further before boldly go on a long adventure, I saw your comment 14 days ago asking what |
Thank you for the reply. All logic lead me to the fact that I should learn
it as well. Honestly it makes me a bit mad that I will need 6 months of
every day to learn all of this, but it is what it is. I made the CNN minist
digit recognition, so that's something.
…On Sun, Jan 15, 2023, 03:48 lxj616 ***@***.***> wrote:
@lxj616 <https://github.com/lxj616> @lucidrains
<https://github.com/lucidrains> @Samge0 <https://github.com/Samge0>
@hxngiee <https://github.com/hxngiee> @chavinlo
<https://github.com/chavinlo> Hello, I'm starting a startup using
lxj616's Make-a-stable-diffusion-video repository as one of the models for
the text2video product, similar to what MidJourney does with text2image.
Our long term goal is to allow anybody to *create a Hollywood movie in 1
hour*. If it succeeds, it could be one of the biggest companies in the
world.
If any of you are interested in becoming a cofounder for an equal split of
the company, I've explained our short and long term plans at
https://youtu.be/lbhUB1GyYZE
Hello to you too, I don't know how to reply to you because there are many
things you might wish to dig in and learn further before boldly go on a
long adventure, I saw your comment 14 days ago asking what
pretrained_model_name_or_path to use, and honestly I don't think I can
answer that in simple words too, for it's not that simple as you might
think, however you are welcome to ask and please understand we can not
reply you every time if we don't know how to respond properly, like this
time, and last time maybe
—
Reply to this email directly, view it on GitHub
<#10 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ARXZ4MG6TZQUZLOVEMAXQY3WSNQQDANCNFSM6AAAAAATKIXZFQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Not really though |
Oh, if I'm able to start making something in a month that would be very interesting. |
@lxj616 Can you share the prompt used for training timelapse? |
landscape cloudscape photo (for landscape videos) |
Nvidia just recently published https://arxiv.org/pdf/2304.08818.pdf And in "3.1.1 Temporal Autoencoder Finetuning" they claimed to "finetune vae decoder on video data with a (patch-wise) temporal discriminator built from 3D convolutions" This could reduce flickering artifacts as they claim Since you are the top awesome AI expert among the opensource community, could you make a opensource demo implementation on this even only a few important lines ? You are the only guy I know who can do this, sorry if I bother you and thanks in advance |
+1 They also mentioned using both a larger parameter size and a temporal superresolution based on Stable Diffusion 2.0 superresolution For the latter, I think they used SDXL |
It's interesting; they are treating diffusion models like GANs. They used a discriminator to train them. |
The VAE is trained with a discriminator, which is how it is normally trained, not the diffusion model. |
https://github.com/lxj616/make-a-stable-diffusion-video
Used your Pseudo3DConv and pseudo 3d attention here:
https://github.com/lxj616/make-a-stable-diffusion-video/blob/main/src/diffusers/models/resnet_pseudo3d.py#L8
https://github.com/lxj616/make-a-stable-diffusion-video/blob/main/src/diffusers/models/attention_pseudo3d.py#L432
Thank you for opensource the Pseudo3D code, it seems to be working
The text was updated successfully, but these errors were encountered: