-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core
/ SFTTrainer] Fix breaking change
#1229
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Thanks for the action.
Long special condition may not be a good design and may not be accurate either. |
I agree, the thing is that : |
|
tokenizing the |
(I could be using it in the totally wrong way because it was just used as a mitigation of the other issue I mentioned.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing @younesbelkada! Left a comment regarding the warning logic.
trl/trainer/sft_trainer.py
Outdated
warnings.warn( | ||
"You passed `remove_unused_columns=False` on a non-packed dataset while using `DataCollatorForLanguageModeling`. This might create some issues with the default collator. If you want to " | ||
"inspect dataset other columns, you can subclass `DataCollatorForLanguageModeling` and create your own data collator in order to inspect the unused dataset columns." | ||
) | ||
args.remove_unused_columns = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this is the right approach here, potentially overwriting a user choice (args.remove_unused_columns = False
) or warning when no warning is needed (if there are no extra columns).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can check if extra columns exist and suggesting setting removed unused columns to True
if that's the case?
trl/trainer/sft_trainer.py
Outdated
warnings.warn( | ||
"You passed `remove_unused_columns=False` on a non-packed dataset while using `DataCollatorForLanguageModeling`. This might create some issues with the default collator. If you want to " | ||
"inspect dataset other columns, you can subclass `DataCollatorForLanguageModeling` and create your own data collator in order to inspect the unused dataset columns." | ||
) | ||
remove_unused_columns = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think we shouldn't overwrite if the user provided remove_unused_colums=False
and not always yield a warning because there might not be an issue at all.
* fix breaking change * revert * fix * final fix * fix * fix tests
What does this PR do?
fixes a breaking change on a recent PR: #1188
Fixes: #1216
In fact, if one uses the default data collator +
remove_unused_columns=False
, the training script fails for some datasets such as imdb as the collator will try to encode all the columns, including"text"
. The fix should be to force-set remove_unused_columns=True, only in the case users do not pass a data_collator for non-packing datasets and explain to the users who they should proceed in case they decided to passremove_unused_columns=False
cc @lvwerra