Add support for ChatML dataset format in #1208

philschmid · 2024-01-09T14:45:05Z

What does this PR do?

This PR adds support for a standardized dataset to be automatically formated for training in the SFTTrainer using the apply_chat_template from transformers.
This allow users to pass the dataset without the need of a formatting_func to the SFTTrainer. Example below

from datasets import load_dataset

dataset = load_dataset("philschmid/dolly-15k-oai-style", split="train")

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    packing=True,
)

In the init method the SFTTrainer tries to finds the correct formatting function based on the dataset structure. Currently supported datasets are:
- ChatML with [{"role": str, "content": str}]
- instruction with [{"prompt": str, "completion": str}]

Based on the dataset it returns a callable which uses the tokenizer of the model and the corresponding apply_chat_template method. This allows continues fine-tuning for, e.g. Llama2-chat or other models which already have a defined format.

The nice part about is that you can use the "extras" outside of the SFFTrainer, e.g. if you want to format DPO datasets with the methods.

SFTTrainer

HuggingFaceDocBuilderDev · 2024-01-09T14:49:17Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

younesbelkada

Amazing addition, thanks! I went through the PR and it looks great to me!
Can you add few lines in the documentaiton to explain what it does under the hood? I think the doc section should live inside the SFTTrainer docs - wdyt?

trl/extras/dataset_formatting.py

philschmid · 2024-01-10T07:51:29Z

Can you add few lines in the documentaiton to explain what it does under the hood? I think the doc section should live inside the SFTTrainer docs - wdyt?

I didn't worked on the documentation yet, since i wanted to see what you think. Will work on the docs next.

younesbelkada

Looking great to me thanks! I just left one question for the documentation

younesbelkada · 2024-01-10T14:02:07Z

docs/source/sft_trainer.mdx

+{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
+```
+
+If your dataset uses one of the above formats, you can directly pass it to the trainer without pre-processing. The [`SFTTrainer`] will then format the dataset for you using the defined format from the model's tokenizer with the [apply_chat_template](https://huggingface.co/docs/transformers/main/en/chat_templating#templates-for-chat-models) method. 


Maybe worth adding a line saying that the user need to make sure that tokenizer support apply_chat_template - otherwise it'll fail I think no?

it is always supported. If there is no template defined it falls back to a default template, which is ChatML format from OAI. cc @Rocketknight1 to confirm

ok perfect then!

Yep, the default format for tokenizers with no chat_template or class-level default_chat_template is ChatML.

lvwerra

Thanks a lot @philschmid! Overall very clean, just left a few small nits.

tests/test_sft_trainer.py

docs/source/sft_trainer.mdx

younesbelkada

Still looks really good to me! Is it ok for merge @philschmid ?

* Add support for ChatML dataset format in SFTTrainer * fix formatting * fix tests * more comment * fix intent * fix doc string * Update dataset_formatting.py * Update dataset_formatting.py * add documentation * Update sft_trainer.mdx * add leonardos comment and more tests * added more tests and fixed batching * style * comment in

Add support for ChatML dataset format in

3e36c67

SFTTrainer

philschmid and others added 6 commits January 9, 2024 14:50

fix formatting

2575691

fix tests

62da059

more comment

a951f61

fix intent

76fc266

fix doc string

5cf6e94

Update dataset_formatting.py

2294da7

younesbelkada reviewed Jan 10, 2024

View reviewed changes

trl/extras/dataset_formatting.py Outdated Show resolved Hide resolved

trl/extras/dataset_formatting.py Show resolved Hide resolved

philschmid and others added 2 commits January 10, 2024 14:20

Update dataset_formatting.py

dd04f15

add documentation

7a3d8ab

philschmid requested a review from younesbelkada January 10, 2024 13:37

Update sft_trainer.mdx

ddaa369

younesbelkada approved these changes Jan 10, 2024

View reviewed changes

lvwerra approved these changes Jan 11, 2024

View reviewed changes

tests/test_sft_trainer.py Show resolved Hide resolved

docs/source/sft_trainer.mdx Show resolved Hide resolved

philschmid added 3 commits January 11, 2024 14:34

add leonardos comment and more tests

b19cfed

added more tests and fixed batching

0dabd88

style

7746b8c

philschmid requested review from lvwerra and younesbelkada January 11, 2024 16:12

comment in

5cbcfe9

younesbelkada approved these changes Jan 12, 2024

View reviewed changes

younesbelkada merged commit 776939d into huggingface:main Jan 12, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for ChatML dataset format in #1208

Add support for ChatML dataset format in #1208

philschmid commented Jan 9, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Jan 9, 2024

younesbelkada left a comment

philschmid commented Jan 10, 2024

younesbelkada left a comment

younesbelkada Jan 10, 2024

philschmid Jan 10, 2024 •

edited

Loading

younesbelkada Jan 10, 2024

Rocketknight1 Jan 10, 2024

lvwerra left a comment

younesbelkada left a comment

Add support for ChatML dataset format in #1208

Add support for ChatML dataset format in #1208

Conversation

philschmid commented Jan 9, 2024 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Jan 9, 2024

younesbelkada left a comment

Choose a reason for hiding this comment

philschmid commented Jan 10, 2024

younesbelkada left a comment

Choose a reason for hiding this comment

younesbelkada Jan 10, 2024

Choose a reason for hiding this comment

philschmid Jan 10, 2024 • edited Loading

Choose a reason for hiding this comment

younesbelkada Jan 10, 2024

Choose a reason for hiding this comment

Rocketknight1 Jan 10, 2024

Choose a reason for hiding this comment

lvwerra left a comment

Choose a reason for hiding this comment

younesbelkada left a comment

Choose a reason for hiding this comment

philschmid commented Jan 9, 2024 •

edited

Loading

philschmid Jan 10, 2024 •

edited

Loading