Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core / SFTTrainer] Fix breaking change #1229

Merged
merged 7 commits into from
Jan 17, 2024
Merged

Conversation

younesbelkada
Copy link
Contributor

What does this PR do?

fixes a breaking change on a recent PR: #1188
Fixes: #1216

In fact, if one uses the default data collator + remove_unused_columns=False, the training script fails for some datasets such as imdb as the collator will try to encode all the columns, including "text". The fix should be to force-set remove_unused_columns=True, only in the case users do not pass a data_collator for non-packing datasets and explain to the users who they should proceed in case they decided to pass remove_unused_columns=False

cc @lvwerra

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@younesbelkada younesbelkada requested a review from lvwerra January 15, 2024 14:11
@xkszltl
Copy link

xkszltl commented Jan 15, 2024

Thanks for the action.
Not sure about details yet, just a general comment:

only in the case users do not pass a data_collator for non-packing datasets

Long special condition may not be a good design and may not be accurate either.

@younesbelkada
Copy link
Contributor Author

younesbelkada commented Jan 15, 2024

I agree, the thing is that :
1- in previous releases it was possible to pass remove_unused_columns=False to trainingarguments + non-packed dataset without any problem - however we were silently ignoring the unusued columns - and #1188 breaks it but while preserving the "right" behaviour
2- passing remove_unused_columns=False + default data collator will most likely fail before the default datacollator will try to tokenize text columns by default, so it'll always fail in that scenario. IMO it should be bad intent and we should leave the support of that feature only for users that they know what they're doing, by subclassing the default collator and passing it to SFTTrainer
What do you think?

@xkszltl
Copy link

xkszltl commented Jan 15, 2024

  1. Wait, maybe I didn't got it correctly in the first place? My intention is to tokenize the text column so that it can be trained with, is that supposed to be the case?

@younesbelkada
Copy link
Contributor Author

tokenizing the text column should be already handled properly by the SFTTrainer, when you pass dataset_text_field="text" it will tokenize the text field and remove the text column after tokenizing it

@xkszltl
Copy link

xkszltl commented Jan 15, 2024

(I could be using it in the totally wrong way because it was just used as a mitigation of the other issue I mentioned.)

Copy link
Member

@lvwerra lvwerra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing @younesbelkada! Left a comment regarding the warning logic.

Comment on lines 253 to 257
warnings.warn(
"You passed `remove_unused_columns=False` on a non-packed dataset while using `DataCollatorForLanguageModeling`. This might create some issues with the default collator. If you want to "
"inspect dataset other columns, you can subclass `DataCollatorForLanguageModeling` and create your own data collator in order to inspect the unused dataset columns."
)
args.remove_unused_columns = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this is the right approach here, potentially overwriting a user choice (args.remove_unused_columns = False) or warning when no warning is needed (if there are no extra columns).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can check if extra columns exist and suggesting setting removed unused columns to True if that's the case?

@younesbelkada younesbelkada requested a review from lvwerra January 15, 2024 16:40
Comment on lines 444 to 448
warnings.warn(
"You passed `remove_unused_columns=False` on a non-packed dataset while using `DataCollatorForLanguageModeling`. This might create some issues with the default collator. If you want to "
"inspect dataset other columns, you can subclass `DataCollatorForLanguageModeling` and create your own data collator in order to inspect the unused dataset columns."
)
remove_unused_columns = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we shouldn't overwrite if the user provided remove_unused_colums=False and not always yield a warning because there might not be an issue at all.

@younesbelkada younesbelkada requested a review from lvwerra January 17, 2024 08:54
@younesbelkada younesbelkada merged commit bcccdeb into main Jan 17, 2024
9 checks passed
@younesbelkada younesbelkada deleted the fix-breaking-change branch January 17, 2024 13:45
lapp0 pushed a commit to lapp0/trl that referenced this pull request May 10, 2024
* fix breaking change

* revert

* fix

* final fix

* fix

* fix tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failed to load data in trl 0.7.8/0.7.9.
4 participants