Question of training on conversational datasets #284

qibin0506 · 2024-07-24T02:00:17Z

qibin0506
Jul 24, 2024

I trained with 1000 pieces of conversational data, and the loss was very small, but the effect was poor. I took 10 of data for training, and after about 10 epochs, the effect was very good. What is the reason for this and how to solve it?

My config like this

CFG = {
    'vocab_size': 21128,
    'ctx_len': 512,
    'embed_dim': 768,
    'n_heads': 12,
    'n_layers': 12,
    'drop_rate': 0.1
}

Answered by rasbt

Jul 24, 2024

Is training loss, and I also think it was overfitting, but when I train on a large dataset, the loss may also be small, but the effect was poor.

Do you have a portion of the data for validation? How do the training/validation loss curves look like?

Yes, I am using BERT's tokenizer, and I pretrained on this dataset from scratch.

I think in this case the dataset of 10k conversational examples may be too small for pretraining. Unless you used a different dataset for pretraining?

View full answer

rasbt · 2024-07-24T11:53:07Z

rasbt
Jul 24, 2024
Maintainer

Hi @qibin0506 ,

Let's see what's going on here... first, a few questions:

You mention

I took 10 of data for training,

does this mean you only had 10 training examples? In this case, when you say the loss was very small, do you mean the training or validation loss? One possible explanation is that with only 10 training examples, you model was overfitting to the training data a lot.

Looking at

'vocab_size': 21128,

it seems you have decreased the vocabulary size. Did you still use the gpt-2 tokenizer? If you change the vocabulary, you'd have to create a new tokenizer and pretrain the model with it before you do finetuning. This could be another explanation that would explain the bad performance.

0 replies

qibin0506 · 2024-07-24T12:21:48Z

qibin0506
Jul 24, 2024
Author

Thank you for your reply @rasbt .

Is training loss, and I also think it was overfitting, but when I train on a large dataset, the loss may also be small, but the effect was poor.
Yes, I am using BERT's tokenizer, and I pretrained on this dataset from scratch.

3 replies

rasbt Jul 24, 2024
Maintainer

Is training loss, and I also think it was overfitting, but when I train on a large dataset, the loss may also be small, but the effect was poor.

Do you have a portion of the data for validation? How do the training/validation loss curves look like?

Yes, I am using BERT's tokenizer, and I pretrained on this dataset from scratch.

I think in this case the dataset of 10k conversational examples may be too small for pretraining. Unless you used a different dataset for pretraining?

Answer selected by qibin0506

qibin0506 Jul 25, 2024
Author

OK, maybe I need to pretrain on large document dataset and then fine tune on the conversation dataset, right?

rasbt Jul 25, 2024
Maintainer

Yes, that would be one day. Or, alternatively, I suggest starting with a pretrained model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question of training on conversational datasets #284

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Question of training on conversational datasets #284

qibin0506 Jul 24, 2024

Replies: 2 comments · 3 replies

rasbt Jul 24, 2024 Maintainer

qibin0506 Jul 24, 2024 Author

rasbt Jul 24, 2024 Maintainer

qibin0506 Jul 25, 2024 Author

rasbt Jul 25, 2024 Maintainer

qibin0506
Jul 24, 2024

Replies: 2 comments 3 replies

rasbt
Jul 24, 2024
Maintainer

qibin0506
Jul 24, 2024
Author

rasbt Jul 24, 2024
Maintainer

qibin0506 Jul 25, 2024
Author

rasbt Jul 25, 2024
Maintainer