-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
loss gradient when run hidden_states = hidden_states.to(torch.float32) #6675
Labels
solved
This problem has been already solved
Comments
hanlinxuy
added
bug
Something isn't working
pending
This problem is yet to be addressed
labels
Jan 16, 2025
I reproduce this with llamafactory main brach without any modification. System Info
Reproductionjust modify the code in transformers lib for printing class Qwen2RMSNorm(nn.Module):
def __init__(self, hidden_size, eps=1e-6):
"""
Qwen2RMSNorm is equivalent to T5LayerNorm
"""
super().__init__()
self.weight = nn.Parameter(torch.ones(hidden_size))
self.variance_epsilon = eps
def forward(self, hidden_states):
input_dtype = hidden_states.dtype
print("norm0", hidden_states.requires_grad, hidden_states.mean())
hidden_states = hidden_states.to(torch.float32)
print("norm1", hidden_states.requires_grad, hidden_states.mean())
variance = hidden_states.pow(2).mean(-1, keepdim=True)
print("norm2", hidden_states.requires_grad, hidden_states.mean())
hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
import ipdb;ipdb.set_trace()
return self.weight * hidden_states.to(input_dtype) WANDB_DISABLED=true NCCL_SOCKET_IFNAME=eth0 FORCE_TORCHRUN=1 CUDA_VISIBLE_DEVICES=0 llamafactory-cli train qwen2_full_pt.yaml ### model
model_name_or_path: ./Qwen2.5-1.5B-Instruct
trust_remote_code: true
flash_attn: disabled
### method
stage: pt
finetuning_type: full
do_train: true
deepspeed: examples/deepspeed/ds_z1_config.json # choices: [ds_z0_config.json, ds_z2_config.json, ds_z3_config.json]
### dataset
dataset: wiki_demo
template: qwen
cutoff_len: 2048
max_steps: 10000
overwrite_cache: true
preprocessing_num_workers: 16
print_param_status: true
### output
output_dir: saves/qwen2-1b5/full/pt
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
pure_bf16: true
### train
per_device_train_batch_size: 4
gradient_accumulation_steps: 32
learning_rate: 1.0e-5
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
ddp_timeout: 180000000
### eval
val_size: 1
per_device_eval_batch_size: 4
eval_strategy: steps
eval_steps: 100
include_tokens_per_second: true
Othersthe log shows after hidden_states = hidden_states.to(torch.float32), the gradient lost.
|
set |
hiyouga
added
solved
This problem has been already solved
and removed
bug
Something isn't working
pending
This problem is yet to be addressed
labels
Jan 17, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Reminder
System Info
llamafactory
version: 0.9.2.dev0Reproduction
I modified qwen to performed some experiment as powerinfer v2 (or says turbo sparse), but it is weird that the gradient of input_layernorm broken.
I can not understand why hidden_states = hidden_states.to(torch.float32) will break the gradient property. Anyone who can help?
Others
No response
The text was updated successfully, but these errors were encountered: