You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the paper "Large Batch Optimization for Deep Learning: Training BERT in 76 minutes", the learning rate depends on the batchsize. However, I find that the learning rate is also related to the model size. For larger bert model, the learning rate should be smaller, to ensure the training steady, e.g. bert tiny can use much larger learning rate than bert large. Is this right?
The text was updated successfully, but these errors were encountered:
In the paper "Large Batch Optimization for Deep Learning: Training BERT in 76 minutes", the learning rate depends on the batchsize. However, I find that the learning rate is also related to the model size. For larger bert model, the learning rate should be smaller, to ensure the training steady, e.g. bert tiny can use much larger learning rate than bert large. Is this right?
The text was updated successfully, but these errors were encountered: