Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DPO Training using G4dn.12xlarge instance on AWS Sagemaker. #1169

Closed
danieljohnxon opened this issue Jan 2, 2024 · 18 comments
Closed

DPO Training using G4dn.12xlarge instance on AWS Sagemaker. #1169

danieljohnxon opened this issue Jan 2, 2024 · 18 comments
Labels
🏋 DPO Related to DPO

Comments

@danieljohnxon
Copy link

danieljohnxon commented Jan 2, 2024

Hello, everyone! I have fine-tuned the Llama2-13B with QLoRA and merged the LoRA weights into the base model. Currently, I would like to perform DPO training on this fine-tuned model, but I'm encountering an issue when loading the model for training. Could someone help me with this? Really appreciate it and thank you guys so much!

This is my code for the DPO training:

def reinforcement_function(args):

# 1. Load a pretrained model

bnb_config = BitsAndBytesConfig(

load_in_4bit=True,

bnb_4bit_use_double_quant=True,

bnb_4bit_quant_type="nf4",

bnb_4bit_compute_dtype=torch.bfloat16,

)

model = AutoModelForCausalLM.from_pretrained(

args.model_path,

use_cache=False if args.gradient_checkpointing else True, # this is needed for gradient checkpointing

device_map="auto",

quantization_config=bnb_config,

)

model_ref = AutoModelForCausalLM.from_pretrained(

args.model_path,

use_cache=False if args.gradient_checkpointing else True, # this is needed for gradient checkpointing

device_map="auto",

quantization_config=bnb_config,

)

model = create_peft_model(

model, gradient_checkpointing=args.gradient_checkpointing, bf16=args.bf16

)

model_ref = create_peft_model(

model_ref, gradient_checkpointing=args.gradient_checkpointing, bf16=args.bf16

)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-hf")

tokenizer.pad_token = tokenizer.eos_token

# 2. initialize training arguments:

output_dir = "/tmp/llama2"

training_args = TrainingArguments(

per_device_train_batch_size= args.per_device_train_batch_size,

per_device_eval_batch_size= args.per_device_eval_batch_size,

num_train_epochs=args.epochs,

gradient_accumulation_steps= args.gradient_accumulation_steps,

gradient_checkpointing= args.gradient_checkpointing,

learning_rate= args.learning_rate,

evaluation_strategy= "epoch",

output_dir= output_dir,

# report_to= args.report_to,

lr_scheduler_type= args.lr_scheduler_type,

warmup_steps= args.warmup_steps,

optim= args.optimizer_type,

bf16=args.bf16,

remove_unused_columns=False,

run_name="dpo_llama2",

# logging strategies

logging_dir=f"{output_dir}/logs",

logging_strategy="steps",

logging_steps=10,

save_strategy="no",

)

# Load the dataset

train_set = load_from_disk(args.train_dataset_path)

val_set = load_from_disk(args.val_dataset_path)

# 3. initialize the DPO trainer

dpo_trainer = DPOTrainer(

model,

model_ref,

args=training_args,

beta= args.beta,

train_dataset=train_set,

eval_dataset=val_set,

tokenizer=tokenizer,

max_length= args.max_length,

)

sagemaker_save_dir="/opt/ml/model/"

# 4. train

dpo_trainer.train()

# save int 4 model

dpo_trainer.model.save_pretrained(output_dir, safe_serialization=False) #Model weights, configuration

# clear memory

del model

del trainer

torch.cuda.empty_cache()

# load PEFT model in fp16

model = AutoPeftModelForCausalLM.from_pretrained(

output_dir,

low_cpu_mem_usage=True,

torch_dtype=torch.float16,

)

# Merge LoRA and base model and save

model = model.merge_and_unload()

model.save_pretrained(

sagemaker_save_dir, safe_serialization=True, max_shard_size="2GB"

)

# save tokenizer for easy inference

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-hf")

tokenizer.save_pretrained(sagemaker_save_dir)

The logs of the training job

2024-01-02T02:42:17.655+08:00 Login successful

2024-01-02T02:42:19.656+08:00 Loading checkpoint shards: 0%| | 0/14 [00:00<?, ?it/s]

2024-01-02T02:42:20.656+08:00 Loading checkpoint shards: 7%|▋ | 1/14 [00:00<00:09, 1.37it/s]

2024-01-02T02:42:20.656+08:00 Loading checkpoint shards: 14%|█▍ | 2/14 [00:01<00:08, 1.41it/s]

2024-01-02T02:42:21.656+08:00 Loading checkpoint shards: 21%|██▏ | 3/14 [00:02<00:07, 1.42it/s]

2024-01-02T02:42:22.657+08:00 Loading checkpoint shards: 29%|██▊ | 4/14 [00:02<00:07, 1.42it/s]

2024-01-02T02:42:22.657+08:00 Loading checkpoint shards: 36%|███▌ | 5/14 [00:03<00:06, 1.42it/s]

2024-01-02T02:42:23.657+08:00 Loading checkpoint shards: 43%|████▎ | 6/14 [00:04<00:05, 1.42it/s]

2024-01-02T02:42:24.657+08:00 Loading checkpoint shards: 50%|█████ | 7/14 [00:04<00:04, 1.42it/s]

2024-01-02T02:42:25.658+08:00 Loading checkpoint shards: 57%|█████▋ | 8/14 [00:05<00:04, 1.42it/s]

2024-01-02T02:42:25.658+08:00 Loading checkpoint shards: 64%|██████▍ | 9/14 [00:06<00:03, 1.44it/s]

2024-01-02T02:42:26.658+08:00 Loading checkpoint shards: 71%|███████▏ | 10/14 [00:07<00:02, 1.44it/s]

2024-01-02T02:42:27.658+08:00 Loading checkpoint shards: 79%|███████▊ | 11/14 [00:07<00:02, 1.44it/s]

2024-01-02T02:42:27.658+08:00 Loading checkpoint shards: 86%|████████▌ | 12/14 [00:08<00:01, 1.45it/s]

2024-01-02T02:42:28.659+08:00 Loading checkpoint shards: 93%|█████████▎| 13/14 [00:09<00:00, 1.45it/s]

2024-01-02T02:42:28.659+08:00 Loading checkpoint shards: 100%|██████████| 14/14 [00:09<00:00, 1.60it/s]

2024-01-02T02:42:28.659+08:00 Loading checkpoint shards: 100%|██████████| 14/14 [00:09<00:00, 1.47it/s]

2024-01-02T02:42:29.659+08:00 Loading checkpoint shards: 0%| | 0/14 [00:00<?, ?it/s]

2024-01-02T02:42:30.659+08:00 Loading checkpoint shards: 7%|▋ | 1/14 [00:00<00:10, 1.29it/s]

2024-01-02T02:42:31.660+08:00 Loading checkpoint shards: 14%|█▍ | 2/14 [00:01<00:09, 1.30it/s]

2024-01-02T02:42:32.660+08:00 Loading checkpoint shards: 21%|██▏ | 3/14 [00:02<00:08, 1.30it/s]

2024-01-02T02:42:32.660+08:00 Loading checkpoint shards: 29%|██▊ | 4/14 [00:03<00:07, 1.30it/s]

2024-01-02T02:42:33.660+08:00 Loading checkpoint shards: 36%|███▌ | 5/14 [00:03<00:06, 1.30it/s]

2024-01-02T02:42:34.661+08:00 Loading checkpoint shards: 43%|████▎ | 6/14 [00:04<00:06, 1.30it/s]

2024-01-02T02:42:35.661+08:00 Loading checkpoint shards: 50%|█████ | 7/14 [00:05<00:05, 1.30it/s]

2024-01-02T02:42:35.661+08:00 Loading checkpoint shards: 57%|█████▋ | 8/14 [00:06<00:04, 1.30it/s]

2024-01-02T02:42:36.661+08:00 Loading checkpoint shards: 64%|██████▍ | 9/14 [00:06<00:03, 1.31it/s]

2024-01-02T02:42:37.662+08:00 Loading checkpoint shards: 71%|███████▏ | 10/14 [00:07<00:03, 1.32it/s]

2024-01-02T02:42:38.662+08:00 Loading checkpoint shards: 79%|███████▊ | 11/14 [00:08<00:02, 1.32it/s]

2024-01-02T02:42:38.662+08:00 Loading checkpoint shards: 86%|████████▌ | 12/14 [00:09<00:01, 1.33it/s]

2024-01-02T02:42:39.662+08:00 Loading checkpoint shards: 93%|█████████▎| 13/14 [00:09<00:00, 1.33it/s]

2024-01-02T02:42:40.662+08:00 Loading checkpoint shards: 100%|██████████| 14/14 [00:10<00:00, 1.47it/s]

2024-01-02T02:42:40.663+08:00 Loading checkpoint shards: 100%|██████████| 14/14 [00:10<00:00, 1.34it/s]

2024-01-02T02:42:40.663+08:00 Found 7 modules to quantize: ['k_proj', 'up_proj', 'down_proj', 'o_proj', 'q_proj', 'gate_proj', 'v_proj']

2024-01-02T02:44:17.684+08:00 trainable params: 62,586,880 || all params: 6,734,566,400 || trainable%: 0.9293379303528732

2024-01-02T02:44:17.684+08:00 Found 7 modules to quantize: ['k_proj', 'up_proj', 'down_proj', 'o_proj', 'q_proj', 'gate_proj', 'v_proj']

2024-01-02T02:45:55.706+08:00 trainable params: 62,586,880 || all params: 6,734,566,400 || trainable%: 0.9293379303528732

2024-01-02T02:45:55.706+08:00 tokenizer_config.json: 0%| | 0.00/776 [00:00<?, ?B/s]

2024-01-02T02:45:55.706+08:00 tokenizer_config.json: 100%|██████████| 776/776 [00:00<00:00, 7.15MB/s]

2024-01-02T02:45:56.706+08:00 tokenizer.model: 0%| | 0.00/500k [00:00<?, ?B/s]

2024-01-02T02:45:56.706+08:00 tokenizer.model: 100%|██████████| 500k/500k [00:00<00:00, 60.3MB/s]

2024-01-02T02:45:56.706+08:00 tokenizer.json: 0%| | 0.00/1.84M [00:00<?, ?B/s]

2024-01-02T02:45:56.706+08:00 tokenizer.json: 100%|██████████| 1.84M/1.84M [00:00<00:00, 5.99MB/s]

2024-01-02T02:45:56.706+08:00 tokenizer.json: 100%|██████████| 1.84M/1.84M [00:00<00:00, 5.97MB/s]

2024-01-02T02:45:57.707+08:00 special_tokens_map.json: 0%| | 0.00/414 [00:00<?, ?B/s]

2024-01-02T02:45:57.707+08:00 special_tokens_map.json: 100%|██████████| 414/414 [00:00<00:00, 4.50MB/s]

2024-01-02T02:45:57.707+08:00 /opt/conda/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py:270: UserWarning: When using DPODataCollatorWithPadding, you should set `max_prompt_length` in the DPOTrainer's init it will be set to `128` by default, but you should do it yourself in the future. warnings.warn(

2024-01-02T02:46:22.712+08:00 Map: 0%| | 0/50 [00:00<?, ? examples/s]

2024-01-02T02:46:22.712+08:00 Map: 50%|█████ | 25/50 [00:00<00:00, 241.80 examples/s]

2024-01-02T02:46:23.713+08:00 Map: 100%|██████████| 50/50 [00:00<00:00, 242.88 examples/s]

2024-01-02T02:46:47.718+08:00 Map: 0%| | 0/20 [00:00<?, ? examples/s]

2024-01-02T02:46:48.718+08:00 0%| | 0/6 [00:00<?, ?it/s]

2024-01-02T02:47:13.724+08:00 Traceback (most recent call last): File "/opt/ml/code/DPO_2.py", line 393, in <module>

2024-01-02T02:47:13.724+08:00 main() File "/opt/ml/code/DPO_2.py", line 390, in main

2024-01-02T02:47:13.724+08:00 reinforcement_function(args) File "/opt/ml/code/DPO_2.py", line 361, in reinforcement_function

2024-01-02T02:47:13.724+08:00 dpo_trainer.train() File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train

2024-01-02T02:47:13.724+08:00 return inner_training_loop( File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop

2024-01-02T02:47:13.724+08:00 tr_loss_step = self.training_step(model, inputs) File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2654, in training_step

2024-01-02T02:47:13.724+08:00 loss = self.compute_loss(model, inputs) File "/opt/conda/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 981, in compute_loss

2024-01-02T02:47:13.724+08:00 loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train") File "/opt/conda/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 950, in get_batch_loss_metrics

2024-01-02T02:47:13.724+08:00 losses, chosen_rewards, rejected_rewards = self.dpo_loss( File "/opt/conda/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 790, in dpo_loss

2024-01-02T02:47:13.724+08:00 logits = pi_logratios - ref_logratios

2024-01-02T02:47:13.724+08:00 RuntimeError:

2024-01-02T02:47:13.724+08:00 Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:3!

2024-01-02T02:47:14.724+08:00 0%| | 0/6 [00:25<?, ?it/s]

2024-01-02T02:47:15.724+08:00 2024-01-01 18:47:14,771 sagemaker-training-toolkit INFO Waiting for the process to finish and give a return code.

2024-01-02T02:47:15.725+08:00 2024-01-01 18:47:14,771 sagemaker-training-toolkit INFO Done waiting for a return code. Received 1 from exiting process.

2024-01-02T02:47:15.725+08:00 2024-01-01 18:47:14,772 sagemaker-training-toolkit ERROR Reporting training FAILURE

2024-01-02T02:47:15.725+08:00 2024-01-01 18:47:14,772 sagemaker-training-toolkit ERROR ExecuteUserScriptError:

2024-01-02T02:47:15.725+08:00 ExitCode 1

2024-01-02T02:47:15.725+08:00 ErrorMessage "RuntimeError Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:3! 0%| | 0/6 [00:25<?, ?it/s]"

2024-01-02T02:47:15.725+08:00 Command "/opt/conda/bin/python3.10 DPO_2.py --epochs 1 --hf_token xxxx--learning_rate 3e-05 --per_device_eval_batch_size 1 --per_device_train_batch_size 1"

2024-01-02T02:47:15.725+08:00 2024-01-01 18:47:14,772 sagemaker-training-toolkit ERROR Encountered exit_code 1

I decided to make changes to the device_map from "auto" to {"": Accelerator().local_process_index}. However, it would run out of memory.

These are the logs after changing the device_map from auto to Accelerator().local_process_index

2024-01-02T03:02:12.820+08:00 Login successful

2024-01-02T03:02:12.820+08:00 Loading checkpoint shards: 0%| | 0/14 [00:00<?, ?it/s]

2024-01-02T03:02:14.821+08:00 Loading checkpoint shards: 7%|▋ | 1/14 [00:01<00:15, 1.15s/it]

2024-01-02T03:02:14.821+08:00 Loading checkpoint shards: 14%|█▍ | 2/14 [00:01<00:10, 1.12it/s]

2024-01-02T03:02:15.821+08:00 Loading checkpoint shards: 21%|██▏ | 3/14 [00:02<00:08, 1.24it/s]

2024-01-02T03:02:16.822+08:00 Loading checkpoint shards: 29%|██▊ | 4/14 [00:03<00:07, 1.30it/s]

2024-01-02T03:02:16.822+08:00 Loading checkpoint shards: 36%|███▌ | 5/14 [00:03<00:06, 1.34it/s]

2024-01-02T03:02:17.822+08:00 Loading checkpoint shards: 43%|████▎ | 6/14 [00:04<00:05, 1.36it/s]

2024-01-02T03:02:18.822+08:00 Loading checkpoint shards: 50%|█████ | 7/14 [00:05<00:05, 1.38it/s]

2024-01-02T03:02:19.823+08:00 Loading checkpoint shards: 57%|█████▋ | 8/14 [00:06<00:04, 1.39it/s]

2024-01-02T03:02:19.823+08:00 Loading checkpoint shards: 64%|██████▍ | 9/14 [00:06<00:03, 1.40it/s]

2024-01-02T03:02:20.823+08:00 Loading checkpoint shards: 71%|███████▏ | 10/14 [00:07<00:02, 1.41it/s]

2024-01-02T03:02:21.823+08:00 Loading checkpoint shards: 79%|███████▊ | 11/14 [00:08<00:02, 1.42it/s]

2024-01-02T03:02:21.823+08:00 Loading checkpoint shards: 86%|████████▌ | 12/14 [00:08<00:01, 1.43it/s]

2024-01-02T03:02:22.823+08:00 Loading checkpoint shards: 93%|█████████▎| 13/14 [00:09<00:00, 1.43it/s]

2024-01-02T03:02:23.824+08:00 Loading checkpoint shards: 100%|██████████| 14/14 [00:10<00:00, 1.57it/s]

2024-01-02T03:02:23.824+08:00 Loading checkpoint shards: 100%|██████████| 14/14 [00:10<00:00, 1.39it/s]

2024-01-02T03:02:23.824+08:00 Loading checkpoint shards: 0%| | 0/14 [00:00<?, ?it/s]

2024-01-02T03:02:24.824+08:00 Loading checkpoint shards: 7%|▋ | 1/14 [00:00<00:10, 1.27it/s]

2024-01-02T03:02:25.824+08:00 Loading checkpoint shards: 14%|█▍ | 2/14 [00:01<00:09, 1.28it/s]

2024-01-02T03:02:25.825+08:00 Loading checkpoint shards: 21%|██▏ | 3/14 [00:02<00:08, 1.27it/s]

2024-01-02T03:02:26.825+08:00 Loading checkpoint shards: 29%|██▊ | 4/14 [00:03<00:07, 1.27it/s]

2024-01-02T03:02:27.825+08:00 Loading checkpoint shards: 36%|███▌ | 5/14 [00:03<00:07, 1.27it/s]

2024-01-02T03:02:28.825+08:00 Loading checkpoint shards: 43%|████▎ | 6/14 [00:04<00:06, 1.28it/s]

2024-01-02T03:02:29.826+08:00 Loading checkpoint shards: 50%|█████ | 7/14 [00:05<00:05, 1.28it/s]

2024-01-02T03:02:29.826+08:00 Loading checkpoint shards: 57%|█████▋ | 8/14 [00:06<00:04, 1.28it/s]

2024-01-02T03:02:30.826+08:00 Loading checkpoint shards: 64%|██████▍ | 9/14 [00:07<00:03, 1.29it/s]

2024-01-02T03:02:31.826+08:00 Loading checkpoint shards: 71%|███████▏ | 10/14 [00:07<00:03, 1.30it/s]

2024-01-02T03:02:32.827+08:00 Loading checkpoint shards: 79%|███████▊ | 11/14 [00:08<00:02, 1.31it/s]

2024-01-02T03:02:32.827+08:00 Loading checkpoint shards: 86%|████████▌ | 12/14 [00:09<00:01, 1.31it/s]

2024-01-02T03:02:33.827+08:00 Loading checkpoint shards: 93%|█████████▎| 13/14 [00:10<00:00, 1.31it/s]

2024-01-02T03:02:34.827+08:00 Loading checkpoint shards: 100%|██████████| 14/14 [00:10<00:00, 1.44it/s]

2024-01-02T03:02:34.827+08:00 Loading checkpoint shards: 100%|██████████| 14/14 [00:10<00:00, 1.32it/s]

2024-01-02T03:02:34.827+08:00 Traceback (most recent call last): File "/opt/ml/code/DPO_2.py", line 393, in <module>

2024-01-02T03:02:34.827+08:00 main() File "/opt/ml/code/DPO_2.py", line 390, in main

2024-01-02T03:02:34.828+08:00 reinforcement_function(args) File "/opt/ml/code/DPO_2.py", line 306, in reinforcement_function

2024-01-02T03:02:34.828+08:00 model = create_peft_model( File "/opt/ml/code/DPO_2.py", line 247, in create_peft_model

2024-01-02T03:02:34.828+08:00 model = prepare_model_for_kbit_training( File "/opt/conda/lib/python3.10/site-packages/peft/utils/other.py", line 81, in prepare_model_for_kbit_training

2024-01-02T03:02:34.828+08:00 param.data = param.data.to(torch.float32)

2024-01-02T03:02:34.828+08:00 torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 626.00 MiB (GPU 0; 14.58 GiB total capacity; 13.63 GiB already allocated; 455.56 MiB free; 14.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

2024-01-02T03:02:35.828+08:00 2024-01-01 19:02:35,027 sagemaker-training-toolkit INFO Waiting for the process to finish and give a return code.

2024-01-02T03:02:35.828+08:00 2024-01-01 19:02:35,027 sagemaker-training-toolkit INFO Done waiting for a return code. Received 1 from exiting process.

2024-01-02T03:02:35.828+08:00 2024-01-01 19:02:35,028 sagemaker-training-toolkit ERROR Reporting training FAILURE

2024-01-02T03:02:35.828+08:00 2024-01-01 19:02:35,028 sagemaker-training-toolkit ERROR ExecuteUserScriptError:

2024-01-02T03:02:35.828+08:00 ExitCode 1

2024-01-02T03:02:35.828+08:00 ErrorMessage "torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 626.00 MiB (GPU 0; 14.58 GiB total capacity; 13.63 GiB already allocated; 455.56 MiB free; 14.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF"

2024-01-02T03:02:35.828+08:00 Command "/opt/conda/bin/python3.10 DPO_2.py --epochs 1 --hf_token xxxx--learning_rate 3e-05 --per_device_eval_batch_size 1 --per_device_train_batch_size 1"

2024-01-02T03:02:35.828+08:00 2024-01-01 19:02:35,028 sagemaker-training-toolkit ERROR Encountered exit_code 1
@danieljohnxon
Copy link
Author

cc @younesbelkada @kashif @lvwerra

@lvwerra
Copy link
Member

lvwerra commented Jan 4, 2024

Could you reformat the code to be properly indented and python formatted? It's a bit hard to read :)

What distribution strategy are you using for training? Zero? FSDP?

@danieljohnxon
Copy link
Author

Hi @lvwerra, sorry for the format of the code. For the distribution strategy used for training should the default method because I don't think I defined it. Thank you so much for your response!!

This is the formatted code used for training:

def reinforcement_function(args):
    # 1. Load a pretrained model
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
    )

    model = AutoModelForCausalLM.from_pretrained(
        args.model_path,
        use_cache=False if args.gradient_checkpointing else True,
        device_map="auto",
        quantization_config=bnb_config,
    )

    model_ref = AutoModelForCausalLM.from_pretrained(
        args.model_path,
        use_cache=False if args.gradient_checkpointing else True,
        device_map="auto",
        quantization_config=bnb_config,
    )

    model = create_peft_model(
        model,
        gradient_checkpointing=args.gradient_checkpointing,
        bf16=args.bf16
    )

    model_ref = create_peft_model(
        model_ref,
        gradient_checkpointing=args.gradient_checkpointing,
        bf16=args.bf16
    )

    tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-hf")
    tokenizer.pad_token = tokenizer.eos_token

    # 2. initialize training arguments:
    output_dir = "/tmp/llama2"
    training_args = TrainingArguments(
        per_device_train_batch_size=args.per_device_train_batch_size,
        per_device_eval_batch_size=args.per_device_eval_batch_size,
        num_train_epochs=args.epochs,
        gradient_accumulation_steps=args.gradient_accumulation_steps,
        gradient_checkpointing=args.gradient_checkpointing,
        learning_rate=args.learning_rate,
        evaluation_strategy="epoch",
        output_dir=output_dir,
        lr_scheduler_type=args.lr_scheduler_type,
        warmup_steps=args.warmup_steps,
        optim=args.optimizer_type,
        bf16=args.bf16,
        remove_unused_columns=False,
        run_name="dpo_llama2",
        logging_dir=f"{output_dir}/logs",
        logging_strategy="steps",
        logging_steps=10,
        save_strategy="no",
    )

    # Load the dataset
    train_set = load_from_disk(args.train_dataset_path)
    val_set = load_from_disk(args.val_dataset_path)

    # 3. initialize the DPO trainer
    dpo_trainer = DPOTrainer(
        model,
        model_ref,
        args=training_args,
        beta=args.beta,
        train_dataset=train_set,
        eval_dataset=val_set,
        tokenizer=tokenizer,
        max_length=args.max_length,
    )

    sagemaker_save_dir = "/opt/ml/model/"

    # 4. train
    dpo_trainer.train()

    # save int 4 model
    dpo_trainer.model.save_pretrained(output_dir, safe_serialization=False)

    # clear memory
    del model
    del trainer
    torch.cuda.empty_cache()

    # load PEFT model in fp16
    model = AutoPeftModelForCausalLM.from_pretrained(
        output_dir,
        low_cpu_mem_usage=True,
        torch_dtype=torch.float16,
    )

    # Merge LoRA and base model and save
    model = model.merge_and_unload()
    model.save_pretrained(
        sagemaker_save_dir,
        safe_serialization=True,
        max_shard_size="2GB",
    )

    # save tokenizer for easy inference
    tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-hf")
    tokenizer.save_pretrained(sagemaker_save_dir)

@younesbelkada
Copy link
Contributor

I believe adding a fix similar than jondurbin@7d431ea (from a fork of TRL) should fix the issue, @danieljohnxon can you confirm?

@danieljohnxon
Copy link
Author

Hi @younesbelkada, I am still running into the same error "Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:3!"

@danieljohnxon
Copy link
Author

I am running with the formatted code shown above where device_map = 'auto'

@younesbelkada
Copy link
Contributor

@danieljohnxon can you try to put the ref model and the active model on different devices?

e.g.: device_map={"":0} for the first model anddevice_map={"":1} for the second model

@danieljohnxon
Copy link
Author

danieljohnxon commented Jan 9, 2024

Hi @younesbelkada, I managed to run the code without encountering the "Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:3!" error after changing the device_map configuration as you mentioned. However, I am now encountering a new error. Here is my full code and the error logs. Am I possibly making any mistakes in my DPO training? I'd greatly appreciate your guidance!

Full Code:

import argparse
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    set_seed,
    default_data_collator,
    BitsAndBytesConfig,
    Trainer,
    TrainingArguments,
)

import torch
import bitsandbytes as bnb
import accelerate
from accelerate import Accelerator
from huggingface_hub import login, HfFolder
from dataclasses import dataclass, field
from typing import Dict, Optional
from datasets import Dataset, load_dataset, load_from_disk
from peft import LoraConfig, AutoPeftModelForCausalLM
from trl import DPOTrainer
from trl.extras import BestOfNSampler

def parse_args():
    """Parse the arguments."""
    parser = argparse.ArgumentParser()

    parser.add_argument(
        "--epochs",
        type=int,
        default=3,
        help="Number of epochs to train for."
    )

    parser.add_argument(
        "--model_path",
        type=str,
        default=os.environ['SM_CHANNEL_MODEL'],
        help="Path to fine-tuned model."
    )

    parser.add_argument(
        "--train_dataset_path",
        type=str,
        default=os.environ['SM_CHANNEL_TRAINING'],
        help="Path to training dataset."
    )

    parser.add_argument(
        "--val_dataset_path",
        type=str,
        default=os.environ['SM_CHANNEL_VAL'],
        help="Path to validation dataset."
    )

    parser.add_argument(
        '--beta',
        type=float,
        default=0.1,
        help='The beta parameter for DPO loss'
    )

    parser.add_argument(
        '--learning_rate',
        type=float,
        default=5e-4,
        help='Optimizer learning rate'
    )

    parser.add_argument(
        '--lr_scheduler_type',
        type=str,
        default='cosine',
        help='The lr scheduler type'
    )

    parser.add_argument(
        '--warmup_steps',
        type=int,
        default=100,
        help='The number of warmup steps'
    )

    parser.add_argument(
        '--weight_decay',
        type=float,
        default=0.05,
        help='The weight decay'
    )

    parser.add_argument(
        '--optimizer_type',
        type=str,
        default='paged_adamw_32bit',
        help='The optimizer type'
    )

    parser.add_argument(
        '--per_device_train_batch_size',
        type=int,
        default=4,
        help='Train batch size per device'
    )

    parser.add_argument(
        '--per_device_eval_batch_size',
        type=int,
        default=4,
        help='Eval batch size per device'
    )

    parser.add_argument(
        '--gradient_accumulation_steps',
        type=int,
        default=8,
        help='The number of gradient accumulation steps'
    )

    parser.add_argument(
        '--gradient_checkpointing',
        type=bool,
        default=True,
        help='Whether to use gradient checkpointing'
    )

    parser.add_argument(
        '--lora_alpha',
        type=float,
        default=16,
        help='The lora alpha parameter'
    )

    parser.add_argument(
        '--lora_dropout',
        type=float,
        default=0.1,
        help='The lora dropout parameter'
    )

    parser.add_argument(
        '--lora_r',
        type=int,
        default=16,
        help='The lora r parameter'
    )

    parser.add_argument(
        '--max_length',
        type=int,
        default=760,
        help='The maximum sequence length'
    )

    parser.add_argument(
        '--max_prompt_length',
        type=int,
        default=512,
        help='The maximum prompt length'
    )

    parser.add_argument(
        "--bf16",
        type=bool,
        default=True if torch.cuda.get_device_capability()[0] == 8 else False,
        help="Whether to use bf16."
    )

    parser.add_argument(
        '--logging_steps',
        type=int,
        default=10,
        help='The logging frequency'
    )

    parser.add_argument(
        '--save_steps',
        type=int,
        default=100,
        help='The saving frequency'
    )

    parser.add_argument(
        '--log_freq',
        type=int,
        default=1,
        help='The logging frequency'
    )

    parser.add_argument(
        "--hf_token",
        type=str,
        default=HfFolder.get_token(),
        help="Path to dataset."
    )

    parser.add_argument(
        '--ignore_bias_buffers',
        action='store_true',
        default=False,
        help="Fix for DDP issues with LM bias/mask buffers - invalid scalar type, inplace operation. See https://github.com/huggingface/transformers/issues/22482#issuecomment-1595790992"
    )

    args, _ = parser.parse_known_args()

    if args.hf_token:
        print(f"Logging into the Hugging Face Hub with token {args.hf_token[:10]}...")
        login(token=args.hf_token)

    return args

def print_trainable_parameters(model, use_4bit=False):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        num_params = param.numel()
        # if using DS Zero 3 and the weights are initialized empty
        if num_params == 0 and hasattr(param, "ds_numel"):
            num_params = param.ds_numel

        all_param += num_params
        if param.requires_grad:
            trainable_params += num_params
    if use_4bit:
        trainable_params /= 2
    print(
        f"all params: {all_param:,d} || trainable params: {trainable_params:,d} || trainable%: {100 * trainable_params / all_param}"
    )

def find_all_linear_names(model):
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, bnb.nn.Linear4bit):
            names = name.split(".")
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

    if "lm_head" in lora_module_names:  # needed for 16-bit
        lora_module_names.remove("lm_head")
    return list(lora_module_names)

def create_peft_model(model, gradient_checkpointing=True, bf16=True):
    from peft import (
        get_peft_model,
        LoraConfig,
        TaskType,
        prepare_model_for_kbit_training,
    )
    from peft.tuners.lora import LoraLayer

    # prepare int-4 model for training
    model = prepare_model_for_kbit_training(
        model, use_gradient_checkpointing=gradient_checkpointing
    )
    if gradient_checkpointing:
        model.gradient_checkpointing_enable()

    # get lora target modules
    modules = find_all_linear_names(model)
    print(f"Found {len(modules)} modules to quantize: {modules}")

    peft_config = LoraConfig(
        r=16,
        lora_alpha=16,
        target_modules=modules,
        lora_dropout=0.1,
        bias="none",
        task_type=TaskType.CAUSAL_LM,
    )

    model = get_peft_model(model, peft_config)

    # pre-process the model by upcasting the layer norms in float 32 for
    for name, module in model.named_modules():
        if isinstance(module, LoraLayer):
            if bf16:
                module = module.to(torch.bfloat16)
        if "norm" in name:
            module = module.to(torch.float32)
        if "lm_head" in name or "embed_tokens" in name:
            if hasattr(module, "weight"):
                if bf16 and module.weight.dtype == torch.float32:
                    module = module.to(torch.bfloat16)

    model.print_trainable_parameters()
    return model

def reinforcement_function(args):
    # 1. Load a pretrained model
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
    )

    model = AutoModelForCausalLM.from_pretrained(
        args.model_path,
        use_cache=False if args.gradient_checkpointing else True,  # this is needed for gradient checkpointing
        device_map={"":0}, 
        quantization_config=bnb_config,
    )
    
    model_ref = AutoModelForCausalLM.from_pretrained(
        args.model_path,
        use_cache=False if args.gradient_checkpointing else True,  # this is needed for gradient checkpointing
        device_map={"":1}, 
        quantization_config=bnb_config,
    )

    model = create_peft_model(
        model, gradient_checkpointing=args.gradient_checkpointing, bf16=args.bf16
    )
    
    model_ref = create_peft_model(
        model_ref, gradient_checkpointing=args.gradient_checkpointing, bf16=args.bf16
    )    
    
    tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-hf")
    tokenizer.pad_token = tokenizer.eos_token

    # 2. initialize training arguments:
    output_dir = "/tmp/llama2"
    training_args = TrainingArguments(
        per_device_train_batch_size= args.per_device_train_batch_size,
        per_device_eval_batch_size= args.per_device_eval_batch_size,
        num_train_epochs=args.epochs,
        gradient_accumulation_steps= args.gradient_accumulation_steps,
        gradient_checkpointing= args.gradient_checkpointing,
        learning_rate= args.learning_rate,
        evaluation_strategy= "epoch",
        output_dir= output_dir,
        lr_scheduler_type= args.lr_scheduler_type,
        warmup_steps= args.warmup_steps,
        optim= args.optimizer_type,
        bf16=args.bf16,
        remove_unused_columns=False,
        run_name="dpo_llama2",
        # logging strategies
        logging_dir=f"{output_dir}/logs",
        logging_strategy="steps",
        logging_steps=10,
        save_strategy="no",
    )

    # Load the dataset
    train_set = load_from_disk(args.train_dataset_path)
    val_set = load_from_disk(args.val_dataset_path)

    # 3. initialize the DPO trainer
    dpo_trainer = DPOTrainer(
        model,
        model_ref,
        args=training_args,
        beta= args.beta,
        train_dataset=train_set,
        eval_dataset=val_set,
        tokenizer=tokenizer,
        max_length= args.max_length,
        max_prompt_length=args.max_prompt_length,
    )

    sagemaker_save_dir="/opt/ml/model/"
    
    # 4. train
    dpo_trainer.train()

    # save int 4 model
    dpo_trainer.model.save_pretrained(output_dir, safe_serialization=False) #Model weights, configuration

    # clear memory
    del model
    del dpo_trainer
    torch.cuda.empty_cache()

    # load PEFT model in fp16
    model = AutoPeftModelForCausalLM.from_pretrained(
        output_dir,
        low_cpu_mem_usage=True,
        torch_dtype=torch.float16,
    )

    # Merge LoRA and base model and save
    model = model.merge_and_unload()        
    model.save_pretrained(
        sagemaker_save_dir, safe_serialization=True, max_shard_size="2GB"
    )

    # save tokenizer for easy inference
    tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-hf")
    tokenizer.save_pretrained(sagemaker_save_dir)

def main():
    args = parse_arge()
    reinforcement_function(args)

if __name__ == "__main__":
    main()

Latest Logs:

2024-01-09T11:56:41.379+08:00	Loading checkpoint shards: 86%|████████▌ | 12/14 [00:09<00:01, 1.29it/s]

2024-01-09T11:56:42.379+08:00	Loading checkpoint shards: 93%|█████████▎| 13/14 [00:10<00:00, 1.30it/s]

2024-01-09T11:56:43.380+08:00	Loading checkpoint shards: 100%|██████████| 14/14 [00:10<00:00, 1.44it/s]

2024-01-09T11:56:43.380+08:00	Loading checkpoint shards: 100%|██████████| 14/14 [00:10<00:00, 1.30it/s]

2024-01-09T11:56:43.380+08:00	Found 7 modules to quantize: ['up_proj', 'v_proj', 'k_proj', 'q_proj', 'down_proj', 'gate_proj', 'o_proj']

2024-01-09T11:58:20.401+08:00	trainable params: 62,586,880 || all params: 6,734,566,400 || trainable%: 0.9293379303528732

2024-01-09T11:58:21.402+08:00	Found 7 modules to quantize: ['up_proj', 'v_proj', 'k_proj', 'q_proj', 'down_proj', 'gate_proj', 'o_proj']

2024-01-09T11:59:59.424+08:00	trainable params: 62,586,880 || all params: 6,734,566,400 || trainable%: 0.9293379303528732

2024-01-09T11:59:59.424+08:00	tokenizer_config.json: 0%| | 0.00/776 [00:00<?, ?B/s]

2024-01-09T11:59:59.424+08:00	tokenizer_config.json: 100%|██████████| 776/776 [00:00<00:00, 6.46MB/s]

2024-01-09T11:59:59.424+08:00	tokenizer.model: 0%| | 0.00/500k [00:00<?, ?B/s]

2024-01-09T11:59:59.424+08:00	tokenizer.model: 100%|██████████| 500k/500k [00:00<00:00, 133MB/s]

2024-01-09T12:00:00.425+08:00	tokenizer.json: 0%| | 0.00/1.84M [00:00<?, ?B/s]

2024-01-09T12:00:01.425+08:00	tokenizer.json: 100%|██████████| 1.84M/1.84M [00:01<00:00, 1.56MB/s]

2024-01-09T12:00:01.425+08:00	tokenizer.json: 100%|██████████| 1.84M/1.84M [00:01<00:00, 1.55MB/s]

2024-01-09T12:00:02.425+08:00	special_tokens_map.json: 0%| | 0.00/414 [00:00<?, ?B/s]

2024-01-09T12:00:02.425+08:00	special_tokens_map.json: 100%|██████████| 414/414 [00:00<00:00, 3.56MB/s]

2024-01-09T12:00:28.431+08:00	Map: 0%| | 0/50 [00:00<?, ? examples/s]

2024-01-09T12:00:29.431+08:00	Map: 50%|█████ | 25/50 [00:00<00:00, 239.87 examples/s]

2024-01-09T12:00:29.431+08:00	Map: 100%|██████████| 50/50 [00:00<00:00, 241.95 examples/s]

2024-01-09T12:00:55.437+08:00	Map: 0%| | 0/20 [00:00<?, ? examples/s]

2024-01-09T12:00:55.437+08:00	Traceback (most recent call last): File "/opt/ml/code/DPO_2.py", line 402, in <module>

2024-01-09T12:00:55.437+08:00	main() File "/opt/ml/code/DPO_2.py", line 399, in main

2024-01-09T12:00:55.437+08:00	reinforcement_function(args) File "/opt/ml/code/DPO_2.py", line 354, in reinforcement_function

2024-01-09T12:00:55.437+08:00	dpo_trainer = DPOTrainer( File "/opt/conda/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 391, in __init__

2024-01-09T12:00:55.437+08:00	self.ref_model = self.accelerator.prepare_model(self.ref_model, evaluation_mode=True) File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1281, in prepare_model

2024-01-09T12:00:55.437+08:00	raise ValueError(

2024-01-09T12:00:55.437+08:00	ValueError

2024-01-09T12:00:55.437+08:00	: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. Make sure you loaded the model on the correct device using for example `device_map={'':torch.cuda.current_device()}you're training on. Make sure you loaded the model on the correct device using for example `device_map={'':torch.cuda.current_device() or device_map={'':torch.xpu.current_device()}

2024-01-09T12:00:56.438+08:00	2024-01-09 04:00:56,311 sagemaker-training-toolkit INFO Waiting for the process to finish and give a return code.

2024-01-09T12:00:56.438+08:00	2024-01-09 04:00:56,311 sagemaker-training-toolkit INFO Done waiting for a return code. Received 1 from exiting process.

2024-01-09T12:00:56.438+08:00	2024-01-09 04:00:56,311 sagemaker-training-toolkit ERROR Reporting training FAILURE

2024-01-09T12:00:56.438+08:00	2024-01-09 04:00:56,311 sagemaker-training-toolkit ERROR ExecuteUserScriptError:

2024-01-09T12:00:56.438+08:00	ExitCode 1

2024-01-09T12:00:56.438+08:00	ErrorMessage "raise ValueError( ValueError You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. Make sure you loaded the model on the correct device using for example `device_map={'':torch.cuda.current_device()}you're training on. Make sure you loaded the model on the correct device using for example `device_map={'':torch.cuda.current_device() or device_map={'':torch.xpu.current_device()}"

2024-01-09T12:00:56.438+08:00	Command "/opt/conda/bin/python3.10 DPO_2.py --epochs 1 --hf_token xxx --learning_rate 3e-05 --per_device_eval_batch_size 1 --per_device_train_batch_size 1"

2024-01-09T12:00:56.438+08:00	2024-01-09 04:00:56,311 sagemaker-training-toolkit ERROR Encountered exit_code 1

@younesbelkada
Copy link
Contributor

HI @danieljohnxon
Thanks for getting back !
Actually you can use a much simpler solution
1- Load only the active model with device_map={"": PartialState().process_index} and pass model_ref=None in DPOTrainer init
2- pass the peft_config to DPOTrainer's init
That way each GPU will have a copy of the active model, and the lora adapters will be disabled to get the reference logits. Let me know if this fixes your issue

@danieljohnxon
Copy link
Author

Hi @younesbelkada, thank you for the quick response! I changed the device_map, but I'm facing the "PartialState() not defined" error. Do I need to define that somewhere else apart from importing the accelerator? I followed this link, https://stackoverflow.com/questions/76225595/nameerror-name-partialstate-is-not-defined-error-while-training-hugging-face, to resolve the error by changing the transformer version to 4.28.0, but encountered another issue, as shown below.

Moreover, does my code utilize all 4 GPUs for training because the G4dn.12xlarge (4x 16GB) has 4 intact GPUs?

Once again, thank you so much for your guidance and knowledge!

Updated Code:

def reinforcement_function(args):
    # 1. Load a pretrained model
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
    )

    model = AutoModelForCausalLM.from_pretrained(
        args.model_path,
        use_cache=False if args.gradient_checkpointing else True,  # this is needed for gradient checkpointing
        device_map={"": PartialState().process_index},
        quantization_config=bnb_config,
    )

    # get lora target modules
    modules = find_all_linear_names(model)
    print(f"Found {len(modules)} modules to quantize: {modules}")

    peft_config = LoraConfig(
        r=16,
        lora_alpha=16,
        target_modules=modules,
        lora_dropout=0.1,
        bias="none",
        task_type="CAUSAL_LM",
    )

    tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-hf")
    tokenizer.pad_token = tokenizer.eos_token

    # 2. initialize training arguments:
    output_dir = "/tmp/llama2"
    training_args = TrainingArguments(
        per_device_train_batch_size=args.per_device_train_batch_size,
        per_device_eval_batch_size=args.per_device_eval_batch_size,
        num_train_epochs=args.epochs,
        gradient_accumulation_steps=args.gradient_accumulation_steps,
        gradient_checkpointing=args.gradient_checkpointing,
        learning_rate=args.learning_rate,
        evaluation_strategy="epoch",
        output_dir=output_dir,
        lr_scheduler_type=args.lr_scheduler_type,
        warmup_steps=args.warmup_steps,
        optim=args.optimizer_type,
        bf16=args.bf16,
        remove_unused_columns=False,
        run_name="dpo_llama2",
        # logging strategies
        logging_dir=f"{output_dir}/logs",
        logging_strategy="steps",
        logging_steps=10,
        save_strategy="no",
    )

    # Load the dataset
    train_set = load_from_disk(args.train_dataset_path)
    val_set = load_from_disk(args.val_dataset_path)

    # 3. initialize the DPO trainer
    dpo_trainer = DPOTrainer(
        model,
        None,  # model_ref
        args=training_args,
        beta=args.beta,
        train_dataset=train_set,
        eval_dataset=val_set,
        tokenizer=tokenizer,
        max_length=args.max_length,
        max_prompt_length=args.max_prompt_length,
        peft_config=peft_config,
    )
....................

Requirement.txt

transformers>=4.31.0
git+https://github.com/huggingface/trl.git
peft==0.4.0
git+https://github.com/huggingface/accelerate
bitsandbytes==0.40.2
diffusers>=0.24.0
tokenizers>=0.11.1

Error:

2024-01-09T16:05:51.252+08:00	Traceback (most recent call last): File "/opt/ml/code/DPO_2.py", line 416, in <module>

2024-01-09T16:05:51.252+08:00	main() File "/opt/ml/code/DPO_2.py", line 413, in main

2024-01-09T16:05:51.252+08:00	reinforcement_function(args) File "/opt/ml/code/DPO_2.py", line 303, in reinforcement_function

2024-01-09T16:05:51.252+08:00	device_map={"":PartialState().process_index},#{"":0}, #"auto", #{"": Accelerator().local_process_index},

2024-01-09T16:05:51.252+08:00	NameError: name 'PartialState' is not defined

2024-01-09T16:05:52.252+08:00	2024-01-09 08:05:51,849 sagemaker-training-toolkit INFO Waiting for the process to finish and give a return code.

2024-01-09T16:05:52.252+08:00	2024-01-09 08:05:51,849 sagemaker-training-toolkit INFO Done waiting for a return code. Received 1 from exiting process.

2024-01-09T16:05:52.253+08:00	2024-01-09 08:05:51,849 sagemaker-training-toolkit ERROR Reporting training FAILURE

2024-01-09T16:05:52.253+08:00	2024-01-09 08:05:51,849 sagemaker-training-toolkit ERROR ExecuteUserScriptError:

2024-01-09T16:05:52.253+08:00	ExitCode 1

2024-01-09T16:05:52.253+08:00	ErrorMessage "NameError: name 'PartialState' is not defined"

Error (After I changed the transformer versions to 4.28.0):

INFO: pip is looking at multiple versions of trl to determine which version is compatible with other requirements. This could take a while.
ERROR: Cannot install -r requirements.txt (line 2), -r requirements.txt (line 3) and transformers==4.28.0 because these package versions have conflicting dependencies.
The conflict is caused by:
    The user requested transformers==4.28.0
    peft 0.4.0 depends on transformers
    trl 0.7.9.dev0 depends on transformers>=4.31.0
To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict
ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts
[notice] A new release of pip is available: 23.1.2 -> 23.3.2
[notice] To update, run: pip install --upgrade pip
2024-01-09 08:24:18,961 sagemaker-training-toolkit INFO     Waiting for the process to finish and give a return code.
2024-01-09 08:24:18,962 sagemaker-training-toolkit INFO     Done waiting for a return code. Received 1 from exiting process.
2024-01-09 08:24:18,962 sagemaker-training-toolkit ERROR    Reporting training FAILURE
2024-01-09 08:24:18,962 sagemaker-training-toolkit ERROR    InstallRequirementsError:
ExitCode 1
ErrorMessage ""
Command "/opt/conda/bin/python3.10 -m pip install -r requirements.txt"
2024-01-09 08:24:18,962 sagemaker-training-toolkit ERROR    Encountered exit_code 1

@younesbelkada
Copy link
Contributor

hi @danieljohnxon !
Thanks!
You should import PartialState from accelerate - from accelerate import PartialState and you can use the latest transformers

@danieljohnxon
Copy link
Author

Hi @younesbelkada, managed to run it but facing another issue right now. So sorry for all the errors, and really appreciate your time in helping to resolve them!

Error Logs:

Login successful

config.json:   0%|          | 0.00/610 [00:00<?, ?B/s]

config.json: 100%|██████████| 610/610 [00:00<00:00, 5.60MB/s]

╭───────────────────── Traceback (most recent call last) ──────────────────────╮

│ /opt/ml/code/DPO_2.py:418 in <module>                                        │

│                                                                              │

│   415 │   reinforcement_function(args)                                       │

│   416                                                                        │

│   417 if __name__ == "__main__":                                             │

│ ❱ 418 │   main()                                                             │

│   419                                                                        │

│                                                                              │

│ /opt/ml/code/DPO_2.py:415 in main                                            │

│                                                                              │

│   412                                                                        │

│   413 def main():                                                            │

│   414 │   args = parse_arge()                                                │

│ ❱ 415 │   reinforcement_function(args)                                       │

│   416                                                                        │

│   417 if __name__ == "__main__":                                             │

│   418 │   main()                                                             │

│                                                                              │

│ /opt/ml/code/DPO_2.py:301 in reinforcement_function                          │

│                                                                              │

│   298 │   │   bnb_4bit_compute_dtype=torch.bfloat16,                         │

│   299 │   )                                                                  │

│   300 │                                                                      │

│ ❱ 301 │   model = AutoModelForCausalLM.from_pretrained(                      │

│   302 │   │   args.model_path,                                               │

│   303 │   │   use_cache=False if args.gradient_checkpointing else True,  # t │

│   304 │   │   device_map={"":PartialState().process_index},#{"":0}, #"auto", │

│                                                                              │

│ /opt/conda/lib/python3.10/site-packages/transformers/models/auto/auto_factor │

│ y.py:566 in from_pretrained                                                  │

│                                                                              │

│   563 │   │   │   )                                                          │

│   564 │   │   elif type(config) in cls._model_mapping.keys():                │

│   565 │   │   │   model_class = _get_model_class(config, cls._model_mapping) │

│ ❱ 566 │   │   │   return model_class.from_pretrained(                        │

│   567 │   │   │   │   pretrained_model_name_or_path, *model_args, config=con │

│   568 │   │   │   )                                                          │

│   569 │   │   raise ValueError(                                              │

│                                                                              │

│ /opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py:2863  │

│ in from_pretrained                                                           │

│                                                                              │

│   2860 │   │   │   │   │   "DeepSpeed Zero-3 is not compatible with `low_cpu │

│   2861 │   │   │   │   )                                                     │

│   2862 │   │   │   elif not is_accelerate_available():                       │

│ ❱ 2863 │   │   │   │   raise ImportError(                                    │

│   2864 │   │   │   │   │   "Using `low_cpu_mem_usage=True` or a `device_map` │

│   2865 │   │   │   │   )                                                     │

│   2866                                                                       │

╰──────────────────────────────────────────────────────────────────────────────╯

ImportError: Using `low_cpu_mem_usage=True` or a `device_map` requires

Accelerate: `pip install accelerate`

2024-01-09 10:18:12,907 sagemaker-training-toolkit INFO     Waiting for the process to finish and give a return code.

2024-01-09 10:18:12,907 sagemaker-training-toolkit INFO     Done waiting for a return code. Received 1 from exiting process.

2024-01-09 10:18:12,907 sagemaker-training-toolkit ERROR    Reporting training FAILURE

2024-01-09 10:18:12,907 sagemaker-training-toolkit ERROR    ExecuteUserScriptError:

ExitCode 1

ErrorMessage "│   569 │   │   raise ValueError(                                              │

│                                                                              │

│ /opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py:2863  │

│ in from_pretrained                                                           │

│   2860 │   │   │   │   │   "DeepSpeed Zero-3 is not compatible with `low_cpu │

│   2861 │   │   │   │   )                                                     │

│   2862 │   │   │   elif not is_accelerate_available():                       │

│ ❱ 2863 │   │   │   │   raise ImportError(                                    │

│   2864 │   │   │   │   │   "Using `low_cpu_mem_usage=True` or a `device_map` │

│   2865 │   │   │   │   )                                                     │

│   2866                                                                       │

╰──────────────────────────────────────────────────────────────────────────────╯

ImportError: Using `low_cpu_mem_usage=True` or a `device_map` requires

Accelerate: `pip install accelerate`"

Command "/opt/conda/bin/python3.10 DPO_2.py --epochs 1 --hf_token xxx --learning_rate 3e-05 --per_device_eval_batch_size 1 --per_device_train_batch_size 1"

2024-01-09 10:18:12,907 sagemaker-training-toolkit ERROR    Encountered exit_code 1

 

2024-01-09 10:18:31 Uploading - Uploading generated training model

2024-01-09 10:18:31 Failed - Training job failed

---------------------------------------------------------------------------

UnexpectedStatusException                 Traceback (most recent call last)

Cell In[48], line 2

      1 #Starting the train job with our uploaded datasets as input

----> 2 huggingface_estimator.fit({'training':training_dpo_input_path, 'val':val_dpo_input_path, 'model':TUNED_MODEL_PATH}, wait=True)

 

File /opt/conda/lib/python3.10/site-packages/sagemaker/workflow/pipeline_context.py:346, in runnable_by_pipeline.<locals>.wrapper(*args, **kwargs)

    342         return context

    344     return _StepArguments(retrieve_caller_name(self_instance), run_func, *args, **kwargs)

--> 346 return run_func(*args, **kwargs)

 

File /opt/conda/lib/python3.10/site-packages/sagemaker/estimator.py:1341, in EstimatorBase.fit(self, inputs, wait, logs, job_name, experiment_config)

   1339 self.jobs.append(self.latest_training_job)

   1340 if wait:

-> 1341     self.latest_training_job.wait(logs=logs)

 

File /opt/conda/lib/python3.10/site-packages/sagemaker/estimator.py:2677, in _TrainingJob.wait(self, logs)

   2675 # If logs are requested, call logs_for_jobs.

   2676 if logs != "None":

-> 2677     self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)

   2678 else:

   2679     self.sagemaker_session.wait_for_job(self.job_name)

 

File /opt/conda/lib/python3.10/site-packages/sagemaker/session.py:5506, in Session.logs_for_job(self, job_name, wait, poll, log_type, timeout)

   5485 def logs_for_job(self, job_name, wait=False, poll=10, log_type="All", timeout=None):

   5486     """Display logs for a given training job, optionally tailing them until job is complete.

   5487

   5488     If the output is a tty or a Jupyter cell, it will be color-coded

   (...)

   5504         exceptions.UnexpectedStatusException: If waiting and the training job fails.

   5505     """

-> 5506     _logs_for_job(self, job_name, wait, poll, log_type, timeout)

 

File /opt/conda/lib/python3.10/site-packages/sagemaker/session.py:7634, in _logs_for_job(sagemaker_session, job_name, wait, poll, log_type, timeout)

   7631             last_profiler_rule_statuses = profiler_rule_statuses

   7633 if wait:

-> 7634     _check_job_status(job_name, description, "TrainingJobStatus")

   7635     if dot:

   7636         print()

 

File /opt/conda/lib/python3.10/site-packages/sagemaker/session.py:7687, in _check_job_status(job, desc, status_key_name)

   7681 if "CapacityError" in str(reason):

   7682     raise exceptions.CapacityError(

   7683         message=message,

   7684         allowed_statuses=["Completed", "Stopped"],

   7685         actual_status=status,

   7686     )

-> 7687 raise exceptions.UnexpectedStatusException(

   7688     message=message,

   7689     allowed_statuses=["Completed", "Stopped"],

   7690     actual_status=status,

   7691 )

 

UnexpectedStatusException: Error for Training job huggingface-DPO-2024-01-09-10-11-57-2024-01-09-10-11-58-403: Failed. Reason: AlgorithmError: ExecuteUserScriptError:

ExitCode 1

ErrorMessage "│   569 │   │   raise ValueError(                                              │

│                                                                              │

│ /opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py:2863  │

│ in from_pretrained                                                           │

│   2860 │   │   │   │   │   "DeepSpeed Zero-3 is not compatible with `low_cpu │

│   2861 │   │   │   │   )                                                     │

│   2862 │   │   │   elif not is_accelerate_available():                       │

│ ❱ 2863 │   │   │   │   raise ImportError(                                    │

│   2864 │   │   │   │   │   "Using `low_cpu_mem_usage=True` or a `device_map` │

│   2865 │   │   │   │   )                                                     │

│   2866                                                 , exit code: 1

@younesbelkada
Copy link
Contributor

Hi @danieljohnxon !
As the error states:

ImportError: Using `low_cpu_mem_usage=True` or a `device_map` requires

Accelerate: `pip install accelerate`"

You need to install accelerate to make it work!

@danieljohnxon
Copy link
Author

danieljohnxon commented Jan 10, 2024

Hi @younesbelkada, thank you once again for spotting that! I can run the code, but it seems the backend code is loading the model onto only one of my four GPUs, leading to a memory error. (Still comes back to the same error as seen in the beginning)

May I know if there is a way to utilize all the GPUs within my instance? So that I don't run into memory error when loading the model or during training.

2024-01-10T15:40:26.566+08:00                Map: 0%| | 0/50 [00:00<?, ? examples/s]

 

2024-01-10T15:40:26.567+08:00                Map: 54%|█████▍ | 27/50 [00:00<00:00, 259.41 examples/s]

 

2024-01-10T15:40:26.567+08:00                Map: 0%| | 0/20 [00:00<?, ? examples/s]

 

2024-01-10T15:40:26.567+08:00                You are using 8-bit optimizers with a version of `bitsandbytes` < 0.41.1. It is recommended to update your version as a major bug has been fixed in 8-bit optimizers.

 

2024-01-10T15:40:26.567+08:00                You are using 8-bit optimizers with a version of `bitsandbytes` < 0.41.1. It is recommended to update your version as a major bug has been fixed in 8-bit optimizers.

 

2024-01-10T15:40:26.567+08:00                0%| | 0/1 [00:00<?, ?it/s]

 

2024-01-10T15:40:27.567+08:00                NCCL version 2.16.2+cuda11.8

 

2024-01-10T15:41:22.578+08:00                Traceback (most recent call last): File "/opt/ml/code/DPO_2.py", line 417, in <module>

 

2024-01-10T15:41:22.578+08:00                main() File "/opt/ml/code/DPO_2.py", line 414, in main

 

2024-01-10T15:41:22.578+08:00                reinforcement_function(args) File "/opt/ml/code/DPO_2.py", line 385, in reinforcement_function

 

2024-01-10T15:41:22.578+08:00                dpo_trainer.train() File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train

 

2024-01-10T15:41:22.578+08:00                return inner_training_loop( File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1854, in _inner_training_loop

 

2024-01-10T15:41:22.578+08:00                tr_loss_step = self.training_step(model, inputs) File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2735, in training_step

 

2024-01-10T15:41:22.578+08:00                loss = self.compute_loss(model, inputs) File "/opt/conda/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 1053, in compute_loss

 

2024-01-10T15:41:22.578+08:00                loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train") File "/opt/conda/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 994, in get_batch_loss_metrics

 

2024-01-10T15:41:22.578+08:00                ) = self.concatenated_forward(model, batch) File "/opt/conda/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 964, in concatenated_forward

 

2024-01-10T15:41:22.578+08:00                all_logps = self.get_batch_logps( File "/opt/conda/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 927, in get_batch_logps

 

2024-01-10T15:41:22.578+08:00                per_token_logps = torch.gather(logits.log_softmax(-1), dim=2, index=labels.unsqueeze(2)).squeeze(2)

 

2024-01-10T15:41:22.578+08:00                torch.cuda

 

2024-01-10T15:41:22.578+08:00                .OutOfMemoryError: CUDA out of memory. Tried to allocate 624.00 MiB (GPU 0; 14.58 GiB total capacity; 12.63 GiB already allocated; 199.56 MiB free; 14.17 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

 

2024-01-10T15:41:23.578+08:00                0%| | 0/1 [00:56<?, ?it/s]

 

2024-01-10T15:41:24.579+08:00                2024-01-10 07:41:23,759 sagemaker-training-toolkit INFO Waiting for the process to finish and give a return code.

 

2024-01-10T15:41:24.579+08:00                2024-01-10 07:41:23,759 sagemaker-training-toolkit INFO Done waiting for a return code. Received 1 from exiting process.

 

2024-01-10T15:41:24.579+08:00                2024-01-10 07:41:23,759 sagemaker-training-toolkit ERROR Reporting training FAILURE

 

2024-01-10T15:41:24.579+08:00                2024-01-10 07:41:23,760 sagemaker-training-toolkit ERROR ExecuteUserScriptError:

 

2024-01-10T15:41:24.579+08:00                ExitCode 1

 

2024-01-10T15:41:24.579+08:00                ErrorMessage ".OutOfMemoryError: CUDA out of memory. Tried to allocate 624.00 MiB (GPU 0; 14.58 GiB total capacity; 12.63 GiB already allocated; 199.56 MiB free; 14.17 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 0%| | 0/1 [00:56<?, ?it/s]"

 

2024-01-10T15:41:24.579+08:00                Command "/opt/conda/bin/python3.10 DPO_2.py --epochs 1 --hf_token xxx --learning_rate 3e-05 --per_device_eval_batch_size 1 --per_device_train_batch_size 1"

 

2024-01-10T15:41:24.579+08:00                2024-01-10 07:41:23,760 sagemaker-training-toolkit ERROR Encountered exit_code 1

@lvwerra
Copy link
Member

lvwerra commented Jan 11, 2024

If you want to distribute your model you might need to use FSDP or DeepSpeed (which you can via accelerate). See the docs: https://huggingface.co/docs/accelerate/v0.26.0/en/usage_guides/deepspeed#deepspeed

@danieljohnxon
Copy link
Author

danieljohnxon commented Jan 12, 2024

Hi @lvwerra and @younesbelkada, thank you for the advice and support. I have tried running my script using the deepspeed library, but I am still encountering a memory issue. The instance I am using is g4dn.12xlarge, which has (4x 16GB) GPUs, so it should not run out of memory when loading the Llama2-13B model with QLORA.

Would you guys mind helping me to review my code and provide some guidance? I am really lost and confused at the moment, and your help would be greatly appreciated. Thank you so much for your time and support!

Notebook's code: (To call the training script)

# Hyperparameters, which are passed into the training job
hyperparameters = {
    'epochs': 1,                               # number of epochs
    'learning_rate': 3e-5,                     # learning rate used during training
    'per_device_train_batch_size': 1,          # batch size for training
    'per_device_eval_batch_size': 1,           # batch size for evaluation
    'hf_token': "xxx",                         # huggingface token to access llama 2
}

job_name = f'huggingface-DPO-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'

metric_definitions = [
    # ...........
    # ............
]

huggingface_estimator = HuggingFace(
    entry_point='DPO_development.py',          # reinforcement learning script
    source_dir='script_DPO',                   # directory that includes all the files needed for training
    instance_type='ml.g4dn.12xlarge',           # instances type used for the training job
    instance_count=1,                           # the number of instances used for training
    base_job_name=job_name,                     # the name of the training job
    role=aws_role,                              # Iam role used in the training job to access AWS resources, e.g., S3
    volume_size=300,                            # the size of the EBS volume in GB
    transformers_version='4.28',               # the transformers version used in the training job
    pytorch_version='2.0',                      # the pytorch_version version used in the training job
    py_version='py310',                         # the python version used in the training job
    max_run=432000,                             # Maximum runtime for the training job
    metric_definitions=metric_definitions,      # Regex to Map Metrics
    hyperparameters=hyperparameters,            # the hyperparameters passed to the training job
    environment={"HUGGINGFACE_HUB_CACHE": "/tmp/.cache"},  # set env variable to cache models in /tmp
)

huggingface_estimator.fit({'training': training_dpo_input_path, 'val': val_dpo_input_path}, wait=True)

Latest Training Script:

................
................

def reinforcement_function(args):
    deepspeed_config = {
        "train_batch_size": "auto",  # Adjust based on your model, GPUs, and memory
        "train_micro_batch_size_per_gpu": "auto",  # Adjust for optimal performance
        "gradient_accumulation_steps": "auto",
        "optimizer": {
            "type": "AdamW",  # Specify your desired optimizer
            "params": {
                "lr": "auto",  # Set learning rate here
                "betas": [0.9, 0.999],
                "eps": 1e-8
            }
        },
        "fp16": {
            "enabled": "auto",
            "loss_scale": 0,
            "initial_scale_power": 16
        },
        "bf16": {
            "enabled": "auto"
        },
        "zero_optimization": {
            "stage": 2,
            "offload_param": {
                "device": "cpu"
            },
            "offload_optimizer": {
                "device": "cpu"
            }
        }
    }

    # 1. Load a pretrained model
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
    )

    model = AutoModelForCausalLM.from_pretrained(
        args.model_path,
        use_cache=False if args.gradient_checkpointing else True,  # this is needed for gradient checkpointing
        device_map={"": Accelerator().local_process_index},  # "auto",#{"":PartialState().process_index},
        quantization_config=bnb_config,
    )

    peft_config = LoraConfig(
        r=16,
        lora_alpha=16,
        target_modules=['gate_proj', 'q_proj', 'o_proj', 'k_proj', 'v_proj', 'up_proj', 'down_proj'],
        lora_dropout=0.1,
        bias="none",
        task_type="CAUSAL_LM",
    )

    tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-hf")
    tokenizer.pad_token = tokenizer.eos_token

    # 2. initialize training arguments:
    output_dir = "/tmp/llama2"
    training_args = TrainingArguments(
        per_device_train_batch_size=args.per_device_train_batch_size,
        per_device_eval_batch_size=args.per_device_eval_batch_size,
        num_train_epochs=args.epochs,
        gradient_accumulation_steps=args.gradient_accumulation_steps,
        gradient_checkpointing=args.gradient_checkpointing,
        learning_rate=args.learning_rate,
        evaluation_strategy="epoch",
        output_dir=output_dir,
        lr_scheduler_type=args.lr_scheduler_type,
        warmup_steps=args.warmup_steps,
        optim=args.optimizer_type,
        bf16=args.bf16,
        remove_unused_columns=False,
        run_name="dpo_llama2",
        # logging strategies
        logging_dir=f"{output_dir}/logs",
        logging_strategy="steps",
        logging_steps=10,
        save_strategy="no",
        deepspeed=deepspeed_config,
    )

    # Load the dataset
    train_set = load_from_disk(args.train_dataset_path)
    val_set = load_from_disk(args.val_dataset_path)

    # 3. initialize the DPO trainer
    dpo_trainer = DPOTrainer(
        model,
        None,  # model_ref
        args=training_args,
        beta=args.beta,
        train_dataset=train_set,
        eval_dataset=val_set,
        tokenizer=tokenizer,
        max_length=args.max_length,
        max_prompt_length=args.max_prompt_length,
        peft_config=peft_config,
        precompute_ref_log_probs=True,
    )

    sagemaker_save_dir = "/opt/ml/model/"

    # 4. train
    dpo_trainer.train()

    # save int 4 model
    dpo_trainer.model.save_pretrained(output_dir, safe_serialization=False)  # Model weights, configuration
............
............

Error Logs:

Traceback (most recent call last): File "/opt/ml/code/DPO_development.py", line 431, in <module>

2024-01-12T19:26:34.786+08:00                main() File "/opt/ml/code/DPO_development.py", line 428, in main

2024-01-12T19:26:34.786+08:00                reinforcement_function(args) File "/opt/ml/code/DPO_development.py", line 399, in reinforcement_function

2024-01-12T19:26:34.786+08:00                dpo_trainer.train() File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train

2024-01-12T19:26:34.786+08:00                return inner_training_loop( File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1854, in _inner_training_loop

2024-01-12T19:26:34.786+08:00                tr_loss_step = self.training_step(model, inputs) File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2735, in training_step

2024-01-12T19:26:34.786+08:00                loss = self.compute_loss(model, inputs) File "/opt/conda/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 1053, in compute_loss

2024-01-12T19:26:34.786+08:00                loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train") File "/opt/conda/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 994, in get_batch_loss_metrics

2024-01-12T19:26:34.786+08:00                ) = self.concatenated_forward(model, batch) File "/opt/conda/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 958, in concatenated_forward

2024-01-12T19:26:34.787+08:00                all_logits = model( File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl

2024-01-12T19:26:34.787+08:00                return forward_call(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward

2024-01-12T19:26:34.787+08:00                outputs = self.parallel_apply(replicas, inputs, kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply

2024-01-12T19:26:34.787+08:00                return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply

2024-01-12T19:26:34.787+08:00                output.reraise() File "/opt/conda/lib/python3.10/site-packages/torch/_utils.py", line 644, in reraise

2024-01-12T19:26:34.787+08:00                raise exception

2024-01-12T19:26:34.787+08:00                torch.cuda.OutOfMemoryError: Caught OutOfMemoryError in replica 0 on device 0.

2024-01-12T19:26:34.787+08:00                Original Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker output = module(*input, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/peft/peft_model.py", line 1073, in forward return self.base_model( File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 103, in forward return self.model.forward(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1181, in forward outputs = self.model( File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1058, in forward layer_outputs = self._gradient_checkpointing_func( File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint return CheckpointFunction.apply(function, preserve, *args) File "/opt/conda/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply return super().apply(*args, **kwargs) # type: ignore[misc] File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward outputs = run_function(*args) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 810, in forward hidden_states = self.mlp(hidden_states) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 268, in forward down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/peft/tuners/lora/bnb.py", line 288, in forward result = self.base_layer(x, *args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 256, in forward out = bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.quant_state) File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 577, in matmul_4bit return MatMul4Bit.apply(A, B, out, bias, quant_state) File "/opt/conda/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply return super().apply(*args, **kwargs) # type: ignore[misc] File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 516, in forward output = torch.nn.functional.linear(A, F.dequantize_4bit(B, quant_state).to(A.dtype).t(), bias)

2024-01-12T19:26:34.787+08:00                torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 270.00 MiB (GPU 0; 14.58 GiB total capacity; 12.86 GiB already allocated; 193.56 MiB free; 14.17 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

2024-01-12T19:26:34.787+08:00                0%| | 0/1 [00:40<?, ?it/s]

2024-01-12T19:26:35.788+08:00                2024-01-12 11:26:35,758 sagemaker-training-toolkit INFO Waiting for the process to finish and give a return code.

2024-01-12T19:26:35.788+08:00                2024-01-12 11:26:35,758 sagemaker-training-toolkit INFO Done waiting for a return code. Received 1 from exiting process.

2024-01-12T19:26:35.788+08:00                2024-01-12 11:26:35,759 sagemaker-training-toolkit ERROR Reporting training FAILURE

2024-01-12T19:26:35.788+08:00                2024-01-12 11:26:35,759 sagemaker-training-toolkit ERROR ExecuteUserScriptError:

2024-01-12T19:26:35.788+08:00                ExitCode 1

2024-01-12T19:26:35.788+08:00                ErrorMessage "torch.cuda.OutOfMemoryError: Caught OutOfMemoryError in replica 0 on device 0. Original Traceback (most recent call last) File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker output = module(*input, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/peft/peft_model.py", line 1073, in forward return self.base_model( File "/opt/conda/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 103, in forward return self.model.forward(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1181, in forward outputs = self.model( File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1058, in forward layer_outputs = self._gradient_checkpointing_func( File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint return CheckpointFunction.apply(function, preserve, *args) File "/opt/conda/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply return super().apply(*args, **kwargs) # type: ignore[misc] File "/opt/conda/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward outputs = run_function(*args) File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 810, in forward hidden_states = self.mlp(hidden_states) File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 268, in forward down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) File "/opt/conda/lib/python3.10/site-packages/peft/tuners/lora/bnb.py", line 288, in forward result = self.base_layer(x, *args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 256, in forward out = bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.quant_state) File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 577, in matmul_4bit return MatMul4Bit.apply(A, B, out, bias, quant_state) File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 516, in forward output = torch.nn.functional.linear(A, F.dequantize_4bit(B, quant_state).to(A.dtype).t(), bias) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 270.00 MiB (GPU 0; 14.58 GiB total capacity; 12.86 GiB already allocated; 193.56 MiB free; 14.17 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 0%| | 0/1 [00:40<?, ?it/s]"

2024-01-12T19:26:35.788+08:00                Command "/opt/conda/bin/python3.10 DPO_development.py --epochs 1 --hf_token xxx --learning_rate 3e-05 --per_device_eval_batch_size 1 --per_device_train_batch_size 1"

2024-01-12T19:26:35.788+08:00                2024-01-12 11:26:35,759 sagemaker-training-toolkit ERROR Encountered exit_code 1

@danieljohnxon
Copy link
Author

Based on the Hugging Face documentation, it seems like I need to run the script using the DeepSpeed launcher. However, from my code, it seems not possible. Does anyone have advice on this?

image

@kashif kashif added the 🏋 DPO Related to DPO label Jan 30, 2024
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

@github-actions github-actions bot closed this as completed Mar 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏋 DPO Related to DPO
Projects
None yet
Development

No branches or pull requests

4 participants