-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DPO Training using G4dn.12xlarge instance on AWS Sagemaker. #1169
Comments
Could you reformat the code to be properly indented and python formatted? It's a bit hard to read :) What distribution strategy are you using for training? Zero? FSDP? |
Hi @lvwerra, sorry for the format of the code. For the distribution strategy used for training should the default method because I don't think I defined it. Thank you so much for your response!! This is the formatted code used for training:
|
I believe adding a fix similar than jondurbin@7d431ea (from a fork of TRL) should fix the issue, @danieljohnxon can you confirm? |
Hi @younesbelkada, I am still running into the same error "Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:3!" |
I am running with the formatted code shown above where device_map = 'auto' |
@danieljohnxon can you try to put the ref model and the active model on different devices? e.g.: |
Hi @younesbelkada, I managed to run the code without encountering the "Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:3!" error after changing the device_map configuration as you mentioned. However, I am now encountering a new error. Here is my full code and the error logs. Am I possibly making any mistakes in my DPO training? I'd greatly appreciate your guidance! Full Code:
Latest Logs:
|
HI @danieljohnxon |
Hi @younesbelkada, thank you for the quick response! I changed the device_map, but I'm facing the "PartialState() not defined" error. Do I need to define that somewhere else apart from importing the accelerator? I followed this link, https://stackoverflow.com/questions/76225595/nameerror-name-partialstate-is-not-defined-error-while-training-hugging-face, to resolve the error by changing the transformer version to 4.28.0, but encountered another issue, as shown below. Moreover, does my code utilize all 4 GPUs for training because the G4dn.12xlarge (4x 16GB) has 4 intact GPUs? Once again, thank you so much for your guidance and knowledge! Updated Code:
Requirement.txt
Error:
Error (After I changed the transformer versions to 4.28.0):
|
hi @danieljohnxon ! |
Hi @younesbelkada, managed to run it but facing another issue right now. So sorry for all the errors, and really appreciate your time in helping to resolve them! Error Logs:
|
Hi @danieljohnxon ! ImportError: Using `low_cpu_mem_usage=True` or a `device_map` requires
Accelerate: `pip install accelerate`" You need to install accelerate to make it work! |
Hi @younesbelkada, thank you once again for spotting that! I can run the code, but it seems the backend code is loading the model onto only one of my four GPUs, leading to a memory error. (Still comes back to the same error as seen in the beginning) May I know if there is a way to utilize all the GPUs within my instance? So that I don't run into memory error when loading the model or during training.
|
If you want to distribute your model you might need to use FSDP or DeepSpeed (which you can via |
Hi @lvwerra and @younesbelkada, thank you for the advice and support. I have tried running my script using the deepspeed library, but I am still encountering a memory issue. The instance I am using is g4dn.12xlarge, which has (4x 16GB) GPUs, so it should not run out of memory when loading the Llama2-13B model with QLORA. Would you guys mind helping me to review my code and provide some guidance? I am really lost and confused at the moment, and your help would be greatly appreciated. Thank you so much for your time and support! Notebook's code: (To call the training script)
Latest Training Script:
Error Logs:
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
Hello, everyone! I have fine-tuned the Llama2-13B with QLoRA and merged the LoRA weights into the base model. Currently, I would like to perform DPO training on this fine-tuned model, but I'm encountering an issue when loading the model for training. Could someone help me with this? Really appreciate it and thank you guys so much!
This is my code for the DPO training:
The logs of the training job
I decided to make changes to the device_map from "auto" to {"": Accelerator().local_process_index}. However, it would run out of memory.
These are the logs after changing the device_map from auto to Accelerator().local_process_index
The text was updated successfully, but these errors were encountered: