When learning with Transformer, loss becomes nan after backpropagation. #37

sooftware · 2020-07-23T16:38:04Z

Currently, Seq2seq and Transformer have two models implemented, and after backpropagation when learning with Transformer, the phenomenon of loss becoming nan continues. I have tried debugging, but I have not yet confirmed which part is wrong. If you have had a similar experience or have any guesses, I would appreciate it if you could help me.

affjljoo3581 · 2020-09-13T00:06:25Z

Applying attention masks with -np.inf may lead nan in both outputs and gradients (of course). You can simply test with:

model = SpeechTransformer(
    num_classes=10, d_model=64, input_dim=80, d_ff=256,
    num_encoder_layers=3, num_decoder_layers=3)

inputs = torch.rand((32, 16, 80), dtype=torch.float)
targets = torch.randint(0, 10, (32, 16), dtype=torch.long)
lengths = torch.empty((32,), dtype=torch.long).fill_(80)

with torch.no_grad():
    predicted = model(inputs, lengths, targets)
    print(np.isnan(predicted.numpy()).any())

Using -np.inf in masked_fill_:

True

Using -1e9 in masked_fill_:

False

The implementations of Transformer model usually choose the constant (and bounded) values instead of unusual ones (e.g. np.inf, float('inf')).

Sometimes you can see some with -np.inf, but note that they are written for evaluation, not training. Of course there is no problem with using -np.inf in inference.

sooftware · 2020-09-13T06:51:41Z

I never thought there was a bug in that part! Thank you, I'll try it out!

sooftware · 2020-09-13T07:22:43Z

After experimenting, loss becomes nan again. There was this problem, but there seems to be another problem.

sooftware · 2020-09-13T07:54:49Z

And if you refer to this repo, it works normally when it is -np.inf. Further checks are likely to be needed on that part.

affjljoo3581 · 2020-09-13T09:41:15Z

Let's check why this repository works well with -np.inf. First, create two dummy tensors as follows:

pred = np.random.rand(64, 32, 1024)
pred = np.where(pred < 0.999999, pred, np.nan)
pred = torch.tensor(pred, dtype=torch.float)

target = np.random.randint(0, 1024, (64, 32), dtype=np.long)
target = np.where(np.any(np.isnan(pred), axis=-1), 0, target)
target = torch.tensor(target, dtype=torch.long)

pred tensor contains nan in random positions. Of course, target contains the information to ignore the nan elements. Let's compare the losses from two criterions (LabelSmoothedCrossEntropyLoss and cal_loss in this repository).

LabelSmoothedCrossEntropyLoss

loss = LabelSmoothedCrossEntropyLoss(
    num_classes=1024,
    ignore_index=0,
    smoothing=0.1,
    architecture='transformer',
    reduction='mean')
print(loss(pred.view(-1, pred.size(-1)), target.view(-1)))

Output: tensor(nan)

cal_loss in kaituoxu/Speech-Transformer

print(cal_loss(pred.view(-1, pred.size(2)), target.view(-1),
               smoothing=0.1))

Output: tensor(6.9713)

Why is it happend? Actually, they both work well without label-smoothing. The problem is in reducing the loss tensor.

LabelSmoothedCrossEntropyLoss

# ...
with torch.no_grad():
    label_smoothed = torch.zeros_like(logit)
    label_smoothed.fill_(self.smoothing / (self.num_classes - 1))
    label_smoothed.scatter_(1, target.data.unsqueeze(1), self.confidence)
    label_smoothed[target == self.ignore_index, :] = 0
return self.reduction_method(-label_smoothed * logit)
# ...

cal_loss in kaituoxu/Speech-Transformer

# ...
non_pad_mask = gold.ne(IGNORE_ID)
n_word = non_pad_mask.sum().item()
loss = -(one_hot * log_prb).sum(dim=1)
loss = loss.masked_select(non_pad_mask).sum() / n_word
# ...

While your code reduces the smoothed logits, cal_loss uses maksed_select to exclude nan values (precisely, IGNORE_ID elements). label_smooth contains non-zero nan weights (that is, nan values should be multiplied with non-zero weights in label_smooth) and it may lead nan of total loss.

So if you want to use -np.inf in attention masks, you should change the code as below:

with torch.no_grad():
    label_smoothed = torch.zeros_like(logit)
    label_smoothed.fill_(self.smoothing / (self.num_classes - 1))
    label_smoothed.scatter_(1, target.data.unsqueeze(1), self.confidence)
    # label_smoothed[target == self.ignore_index, :] = 0

    score = (-label_smoothed * logit).sum(1)
    score = score.masked_select(target != self.ignore_index)
return self.reduction_method(score)

Output: tensor(6.9704)

sooftware · 2020-09-13T09:54:28Z

oh thanks to let me know.
But loss is still nan.
I'd appreciate it if you could tell me where you can guess.

affjljoo3581 · 2020-09-13T10:51:07Z

I've never seen RampUpLR scheduling before. Can you explain the concept of ramp up lr decay?

sooftware · 2020-09-13T10:55:37Z

Never mind.
I checked that it had nothing to do with turning it off and turning it on.

affjljoo3581 · 2020-09-13T11:09:09Z

No. Basically Transformer model with post-LN needs learning rate warm-up. You need to consider that. I don't have any dataset of this project so I cannot test your code accurately. When does the loss diverge? Can you show me the training logs in detail?

sooftware · 2020-09-13T11:11:57Z

Can you come to gitter and talk to me in real time?

sooftware · 2021-01-04T16:15:17Z

마스킹에 문제있는 것을 확인 => 디버깅중

resolved issue #37

sooftware added the help wanted Extra attention is needed label Jul 23, 2020

sooftware pinned this issue Jul 23, 2020

This was referenced Sep 12, 2020

Change to use valid number for masking #47

Closed

Change to use valid number for masking #48

Merged

sooftware closed this as completed in #48 Sep 13, 2020

sooftware reopened this Sep 13, 2020

sooftware added a commit that referenced this issue Jan 4, 2021

resolved issue #37

a3edd92

sooftware mentioned this issue Jan 4, 2021

resolved issue #37 #75

Merged

sooftware added a commit that referenced this issue Jan 4, 2021

Merge pull request #75 from sooftware/transformer-debug

dcd2a6a

resolved issue #37

sooftware closed this as completed Jan 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When learning with Transformer, loss becomes nan after backpropagation. #37

When learning with Transformer, loss becomes nan after backpropagation. #37

sooftware commented Jul 23, 2020

affjljoo3581 commented Sep 13, 2020

sooftware commented Sep 13, 2020

sooftware commented Sep 13, 2020

sooftware commented Sep 13, 2020

affjljoo3581 commented Sep 13, 2020

sooftware commented Sep 13, 2020

affjljoo3581 commented Sep 13, 2020

sooftware commented Sep 13, 2020

affjljoo3581 commented Sep 13, 2020

sooftware commented Sep 13, 2020

sooftware commented Jan 4, 2021

When learning with Transformer, loss becomes nan after backpropagation. #37

When learning with Transformer, loss becomes nan after backpropagation. #37

Comments

sooftware commented Jul 23, 2020

affjljoo3581 commented Sep 13, 2020

sooftware commented Sep 13, 2020

sooftware commented Sep 13, 2020

sooftware commented Sep 13, 2020

affjljoo3581 commented Sep 13, 2020

sooftware commented Sep 13, 2020

affjljoo3581 commented Sep 13, 2020

sooftware commented Sep 13, 2020

affjljoo3581 commented Sep 13, 2020

sooftware commented Sep 13, 2020

sooftware commented Jan 4, 2021