Help needed for fine-tuning training params #780

sbmaruf · 2021-04-07T05:22:33Z

sbmaruf
Apr 7, 2021

I am finetuning T5 on the downstream tasks in PyTorch using hugging-face lib.
While training I could not find some small details. I am sorry if the information are already some places in this repo.

What optimizer the original t5 used for fine-tuning.
Is there any weight-decay in training? If I use weight-decay, should I exclude "layer_norm" and "bias" variables?
What is the scheduler for training. I am trying with linear_schedule_with_warmup one. But I can also rewrite the original one used in this repository. But I don't know what it is. --> UPD: I see for fine-tuning it uses, constant_0_001.gin.
In fine-tuning, for t5, did you see any effect on batch-size? Since for other lm this is a crucial parameter. --> UPD : tokens_per_batch=1048576
What is the good learning rate (estimated) for fine-tuning? --> UPD: got it, constant_0_001.gin.

UPD: Already got most of the info in the README.md.

Answered by craffel

Apr 7, 2021

Hi, for the remaining questions:

Adafactor
No
Batch size makes almost no difference; what matters is the total number of tokens seen over the course of fine-tuning.

View full answer

craffel · 2021-04-07T14:10:44Z

craffel
Apr 7, 2021
Maintainer

Hi, for the remaining questions:

Adafactor
No
Batch size makes almost no difference; what matters is the total number of tokens seen over the course of fine-tuning.

3 replies

sbmaruf Apr 7, 2021
Author

what matters is the total number of tokens seen over the course of fine-tuning.
--> Do you mean tokens_per_batch argument?

In that case, how should I calculate the tokens_per_batch? Let's say my maximum source input sequence length is 512. Now I have 3 samples, each contains 128, 256 and 64 number of tokens. We pad each of the samples until it becomes 512. Now tokens_per_batch is 512*3=1536 or 128+256+64=448?

craffel Apr 7, 2021
Maintainer

You should choose tokens_per_batch according to what fits on your accelerator (GPU/TPU)'s memory. What matters is the total number of tokens seen, i.e. num_steps*tokens_per_batch. So training for 128 steps on a batch size of 512 tokens should result in similar performance to training for 32 steps with a batch size of 2048 tokens.

sbmaruf Apr 7, 2021
Author

Thank you for the reply. Now it's clear.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help needed for fine-tuning training params #780

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Help needed for fine-tuning training params #780

sbmaruf Apr 7, 2021

Replies: 1 comment · 3 replies

craffel Apr 7, 2021 Maintainer

sbmaruf Apr 7, 2021 Author

craffel Apr 7, 2021 Maintainer

sbmaruf Apr 7, 2021 Author

sbmaruf
Apr 7, 2021

Replies: 1 comment 3 replies

craffel
Apr 7, 2021
Maintainer

sbmaruf Apr 7, 2021
Author

craffel Apr 7, 2021
Maintainer

sbmaruf Apr 7, 2021
Author