[`core` / SFTTrainer] Fix breaking change #1229

younesbelkada · 2024-01-15T13:44:51Z

What does this PR do?

fixes a breaking change on a recent PR: #1188
Fixes: #1216

In fact, if one uses the default data collator + remove_unused_columns=False, the training script fails for some datasets such as imdb as the collator will try to encode all the columns, including "text". The fix should be to force-set remove_unused_columns=True, only in the case users do not pass a data_collator for non-packing datasets and explain to the users who they should proceed in case they decided to pass remove_unused_columns=False

cc @lvwerra

HuggingFaceDocBuilderDev · 2024-01-15T13:49:53Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

xkszltl · 2024-01-15T14:22:26Z

Thanks for the action.
Not sure about details yet, just a general comment:

only in the case users do not pass a data_collator for non-packing datasets

Long special condition may not be a good design and may not be accurate either.

younesbelkada · 2024-01-15T14:31:46Z

I agree, the thing is that :
1- in previous releases it was possible to pass remove_unused_columns=False to trainingarguments + non-packed dataset without any problem - however we were silently ignoring the unusued columns - and #1188 breaks it but while preserving the "right" behaviour
2- passing remove_unused_columns=False + default data collator will most likely fail before the default datacollator will try to tokenize text columns by default, so it'll always fail in that scenario. IMO it should be bad intent and we should leave the support of that feature only for users that they know what they're doing, by subclassing the default collator and passing it to SFTTrainer
What do you think?

xkszltl · 2024-01-15T14:41:21Z

Wait, maybe I didn't got it correctly in the first place? My intention is to tokenize the text column so that it can be trained with, is that supposed to be the case?

younesbelkada · 2024-01-15T14:43:04Z

tokenizing the text column should be already handled properly by the SFTTrainer, when you pass dataset_text_field="text" it will tokenize the text field and remove the text column after tokenizing it

xkszltl · 2024-01-15T14:44:38Z

(I could be using it in the totally wrong way because it was just used as a mitigation of the other issue I mentioned.)

lvwerra

Thanks for fixing @younesbelkada! Left a comment regarding the warning logic.

lvwerra · 2024-01-15T15:00:51Z

trl/trainer/sft_trainer.py

+                warnings.warn(
+                    "You passed `remove_unused_columns=False` on a non-packed dataset while using `DataCollatorForLanguageModeling`. This might create some issues with the default collator. If you want to "
+                    "inspect dataset other columns, you can subclass `DataCollatorForLanguageModeling` and create your own data collator in order to inspect the unused dataset columns."
+                )
+                args.remove_unused_columns = True


I am not sure this is the right approach here, potentially overwriting a user choice (args.remove_unused_columns = False) or warning when no warning is needed (if there are no extra columns).

Maybe we can check if extra columns exist and suggesting setting removed unused columns to True if that's the case?

lvwerra · 2024-01-16T15:34:53Z

trl/trainer/sft_trainer.py

+            warnings.warn(
+                "You passed `remove_unused_columns=False` on a non-packed dataset while using `DataCollatorForLanguageModeling`. This might create some issues with the default collator. If you want to "
+                "inspect dataset other columns, you can subclass `DataCollatorForLanguageModeling` and create your own data collator in order to inspect the unused dataset columns."
+            )
+            remove_unused_columns = True


I still think we shouldn't overwrite if the user provided remove_unused_colums=False and not always yield a warning because there might not be an issue at all.

* fix breaking change * revert * fix * final fix * fix * fix tests

younesbelkada added 2 commits January 15, 2024 13:42

fix breaking change

7494b6c

revert

bedaff3

younesbelkada mentioned this pull request Jan 15, 2024

Failed to load data in trl 0.7.8/0.7.9. #1216

Closed

younesbelkada requested a review from lvwerra January 15, 2024 14:11

lvwerra reviewed Jan 15, 2024

View reviewed changes

fix

e35225a

younesbelkada requested a review from lvwerra January 15, 2024 16:40

lvwerra reviewed Jan 16, 2024

View reviewed changes

final fix

3f1dc3c

younesbelkada requested a review from lvwerra January 17, 2024 08:54

younesbelkada added 3 commits January 17, 2024 09:01

fix

ec1b957

Merge remote-tracking branch 'origin/main' into fix-breaking-change

9f0acd2

fix tests

68849f0

lvwerra approved these changes Jan 17, 2024

View reviewed changes

younesbelkada merged commit bcccdeb into main Jan 17, 2024
9 checks passed

younesbelkada deleted the fix-breaking-change branch January 17, 2024 13:45

lapp0 pushed a commit to lapp0/trl that referenced this pull request May 10, 2024

[core / SFTTrainer] Fix breaking change (huggingface#1229)

cdb768f

* fix breaking change * revert * fix * final fix * fix * fix tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`core` / SFTTrainer] Fix breaking change #1229

[`core` / SFTTrainer] Fix breaking change #1229

younesbelkada commented Jan 15, 2024

HuggingFaceDocBuilderDev commented Jan 15, 2024

xkszltl commented Jan 15, 2024

younesbelkada commented Jan 15, 2024 •

edited

Loading

xkszltl commented Jan 15, 2024

younesbelkada commented Jan 15, 2024

xkszltl commented Jan 15, 2024

lvwerra left a comment

lvwerra Jan 15, 2024

lvwerra Jan 15, 2024

lvwerra Jan 16, 2024

[core / SFTTrainer] Fix breaking change #1229

[core / SFTTrainer] Fix breaking change #1229

Conversation

younesbelkada commented Jan 15, 2024

What does this PR do?

HuggingFaceDocBuilderDev commented Jan 15, 2024

xkszltl commented Jan 15, 2024

younesbelkada commented Jan 15, 2024 • edited Loading

xkszltl commented Jan 15, 2024

younesbelkada commented Jan 15, 2024

xkszltl commented Jan 15, 2024

lvwerra left a comment

Choose a reason for hiding this comment

lvwerra Jan 15, 2024

Choose a reason for hiding this comment

lvwerra Jan 15, 2024

Choose a reason for hiding this comment

lvwerra Jan 16, 2024

Choose a reason for hiding this comment

[`core` / SFTTrainer] Fix breaking change #1229

[`core` / SFTTrainer] Fix breaking change #1229

younesbelkada commented Jan 15, 2024 •

edited

Loading