What is the relationship between learning rate and BERT model size (especially the depth) #55

wsn555 · 2020-07-28T12:37:49Z

In the paper "Large Batch Optimization for Deep Learning: Training BERT in 76 minutes", the learning rate depends on the batchsize. However, I find that the learning rate is also related to the model size. For larger bert model, the learning rate should be smaller, to ensure the training steady, e.g. bert tiny can use much larger learning rate than bert large. Is this right?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the relationship between learning rate and BERT model size (especially the depth) #55

What is the relationship between learning rate and BERT model size (especially the depth) #55

wsn555 commented Jul 28, 2020

What is the relationship between learning rate and BERT model size (especially the depth) #55

What is the relationship between learning rate and BERT model size (especially the depth) #55

Comments

wsn555 commented Jul 28, 2020