-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for ChatML dataset format in #1208
Add support for ChatML dataset format in #1208
Conversation
SFTTrainer
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing addition, thanks! I went through the PR and it looks great to me!
Can you add few lines in the documentaiton to explain what it does under the hood? I think the doc section should live inside the SFTTrainer docs - wdyt?
I didn't worked on the documentation yet, since i wanted to see what you think. Will work on the docs next. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great to me thanks! I just left one question for the documentation
{"prompt": "<prompt text>", "completion": "<ideal generated text>"} | ||
``` | ||
|
||
If your dataset uses one of the above formats, you can directly pass it to the trainer without pre-processing. The [`SFTTrainer`] will then format the dataset for you using the defined format from the model's tokenizer with the [apply_chat_template](https://huggingface.co/docs/transformers/main/en/chat_templating#templates-for-chat-models) method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe worth adding a line saying that the user need to make sure that tokenizer support apply_chat_template - otherwise it'll fail I think no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is always supported. If there is no template defined it falls back to a default template, which is ChatML format from OAI. cc @Rocketknight1 to confirm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok perfect then!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, the default format for tokenizers with no chat_template
or class-level default_chat_template
is ChatML.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @philschmid! Overall very clean, just left a few small nits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still looks really good to me! Is it ok for merge @philschmid ?
* Add support for ChatML dataset format in SFTTrainer * fix formatting * fix tests * more comment * fix intent * fix doc string * Update dataset_formatting.py * Update dataset_formatting.py * add documentation * Update sft_trainer.mdx * add leonardos comment and more tests * added more tests and fixed batching * style * comment in
What does this PR do?
This PR adds support for a standardized dataset to be automatically formated for training in the
SFTTrainer
using theapply_chat_template
fromtransformers
.This allow users to pass the dataset without the need of a
formatting_func
to theSFTTrainer
. Example belowIn the init method the SFTTrainer tries to finds the correct formatting function based on the dataset structure. Currently supported datasets are:
-
ChatML
with [{"role": str, "content": str}]-
instruction
with [{"prompt": str, "completion": str}]Based on the dataset it returns a callable which uses the
tokenizer
of the model and the correspondingapply_chat_template
method. This allows continues fine-tuning for, e.g. Llama2-chat or other models which already have a defined format.The nice part about is that you can use the "extras" outside of the SFFTrainer, e.g. if you want to format DPO datasets with the methods.