-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
train_dreambooth_lora_sdxl_advanced.py and train_dreambooth_lora_sdxl.py do not load previously saved checkpoints correctly #6366
Comments
please can you provide your code? |
I support this issue, I was wondering why anyone else wasn't having this problem so I was trying to test in a collab since I use a custom code, now I was finally able to test it and it happens there too, you don't need any code, just do the training as stated in the documentation and then resume:
I got the same errors than @prushik and also I can see it in the images, training with just one image: First run first validation: its just obvious that it started as a clean training even though it loads the checkpoint and the state without errors. |
My code is almost entirely unchanged from the example script, the only change I have made is to the DreamBoothDataset class:
This is just to ensure that only .pngs are loaded as images and prompts are loaded from corresponding .txt files. (Yes, I should have just used jsonl files, but my data was already formatted this way and changing the code seemed easier than figuring out how the jsonl should be formatted) |
I have some more information about the issue. I took a look at the differences between the saved checkpoint-x/pytorch_lora_weights.safetensors file and the file final trained pytorch_lora_weights.safetensors file, and found some discrepancies between the layer names. Since there are a lot of layers, I just chose a small subset that should be comparable. There is nothing special about the layers I chose to look at, they were chosen at random to be representative of the problem with every layer in the generated file. a lora produced in a single training run from examples/dreambooth/train_dreambooth_lora_sdxl.py contains the following:
However, a lora that has been produced after being resumed from a saved checkpoint has the following layers:
Note that each layer now has a lora_A or lora_B and a lora_A_1 or lora_B_1 version. Now the question of where this incorrect _1 version is created. If I compare a checkpoint-x/pytorch_lora_weights.safetensors that was saved during a first run of the training script (one that was not resumed), then it does NOT contain the _1 versions of each key. However, all checkpoint-x/pytorch_lora_weights.safetensors files produced after training is resumed DO contain the _1 versions. So it looks like the issue is introduced upon loading of the checkpoint, not saving the checkpoint. Looking through the training script, this seems to be accomplished with the line: |
@prushik if you want someone from diffusers to look at this problem you should tag someone from the team in the issue, otherwise they probably won't see it. The load_state and save_state have hooks to functions in the same script, for example the save function has an error with the edit: my diffusers repo wasn't updated, the current version has the fix and also is not an error but it enables to load peft models when peft is not installed. |
I finally could look into it, I fixed it in my code but I don't know the correct way to fix it in the official training script since this probably need a change in the The problem as you were suspecting is that the option to resume tries to load the lora model again in the What I did was to force the load of the state_dicts on the Hope it helps. |
Awesome!
I saw that the checkpoint gets loaded with the adapter_name of "default_1" and there is already an adapter called "default" loaded at that point, and it looked like "default" didn't actually have any parameters in it. But after looking further in diffusers code it seemed like the "default_1" adapter_name was intentional (diffusers/src/diffusers/utils/peft_utils.py : get_adapter_name called from diffusers/src/diffusers/loaders/lora.py : load_lora_into_unet).
I'm trying to understand this, it would be awesome to have a workaround. Is this change just in the training script in load_model_hook? Can we just replace LoraLoaderMixin.load_lora_into_unet with peft.inject_adapter_in_model? Thank you for all your help! |
I was wrong with the https://gist.github.com/asomoza/2a7514caceffdbc28f11da5e7f74561c |
Cannot comment for the advanced script. Cc: @linoytsaban (as Poli is on leave). But for the SDXL LoRA script, have you tried pulling in the latest changes? We have had issues like #6087 but we have fixed them. Could you please ensure you're using the latest version of the script, please? |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Is this still an issue? Cc: @linoytsaban |
Is this still a problem? Cc: @linoytsaban |
No longer an issue I believe 👍🏻 as the changes made in #6225 were also put in place for the advanced scripts (both sdxl and the new sd 1.5 implementation) |
Going to close then. But feel free to re-open. |
Describe the bug
both examples/dreambooth/train_dreambooth_lora_sdxl.py and examples/advanced_diffusion_training/train_dreambooth_lora_sdxl_advanced.py seem to have an issue when resuming training from a previously saved checkpoint.
Training and saving checkpoints seems to work correctly, however, when resuming from a previously saved checkpoint, the following messages are produced at script startup:
Resuming from checkpoint checkpoint-10
Training appears to continue normally, however, all new checkpoints saved after this will be significantly larger than the previous checkpoints:
Once training with a resumed checkpoint is completed, there will be a large dump of layer names with a message saying that the model contains layers that do not match. (Full error message below)
To me, this looks like the checkpoints are being loaded incorrectly and ignored, and then a new adapter is being trained from scratch, and then both versions, old and new, are saved in the final lora.
Reproduction
To reproduce this issue, follow the following steps:
--checkpointing_steps
(preferably set to a low number to reproduce this issue quickly).--resume_from_checkpoint latest
or--resume_from_checkpoint checkpoint-x
.Logs
My full command-line with all arguments looks like this: python train_dreambooth_lora_sdxl.py --pretrained_model_name_or_path ../../../models/colossus_v5.3 --instance_data_dir /media/nvme/datasets/combined/ --output_dir xqc --resolution 1024 --instance_prompt 'a photo of hxq' --train_text_encoder --num_train_epochs 1 --train_batch_size 1 --gradient_checkpointing --checkpointing_steps 5 --gradient_accumulation_steps 1 --learning_rate 0.0001 --resume_from_checkpoint latest
Error produced during inference with the affected lora (truncated because of length):
(xl) localhost /media/nvme/xl # uname -a
Linux localhost 6.1.9-noinitramfs #4 SMP PREEMPT_DYNAMIC Fri Feb 10 03:01:14 -00 2023 x86_64 Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz GenuineIntel GNU/Linux
(xl) localhost /media/nvme/xl # diffusers-cli env
Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.
diffusers
version: 0.25.0.dev0The text was updated successfully, but these errors were encountered: