Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] Trainer with prompts and prompt masking #2964

Open
wants to merge 32 commits into
base: master
Choose a base branch
from

Conversation

ArthurCamara
Copy link
Contributor

@ArthurCamara ArthurCamara commented Sep 27, 2024

Pull Request overview

  • Adds support to including prompts in the Trainer class
  • Supports masking the prompts in the Pooling when training.

Details

Currently, the encode method of SentenceTransformer supports adding prompts (or instructions) dynamically to the sentences by passing either prompt or prompt_name. However, this is not supported when training, as mentioned in #2945, as it uses the forward method instead.

This PR implements a similar functionality to the Trainer, by adding prompt parameter that can be:

  • str: The prompt will be appended to all sentences in the dataset
  • dict[str, str]: If the keys are column names, it will append the prompt to the respective column. If the training dataset is a dictionary of datasets, and the dictionary keys are names of the datasets, it will add the prompt to all the columns of the respective dataset.
  • dict[str, dict[str, str]]: Same as above, but assumes the first level is the dataset name and the second level are the column names.

As the prompts can be dynamic (changing for each dataset and column), they are injected in the sentences by the get_train|test|eval|_dataloader methods, by calling add_prompts_to_dataset, which solves for each dataset and column which prompt to inject.

Finally, the add_prompts_to_dataset also adds <column_name>_prompt_length columns that, when passed to Pooling method with include_prompt=False, will mask the instructions properly as well. (currently this is only explicitly for Instructor models, but can be set by the user by calling model.set_pooling_include_prompt(include_prompt=False)

@ArthurCamara ArthurCamara changed the title Trainer with prompt masking [feat] Trainer with prompts and prompt masking Sep 27, 2024
@tomaarsen tomaarsen force-pushed the trainer-with-prompt-masking branch 2 times, most recently from 354cb65 to bf9eb80 Compare September 30, 2024 15:36
@tomaarsen
Copy link
Collaborator

Hello!

Thanks for this PR. I rebased it to get rid of the leftover commits that aren't necessary here.
I have a few hesitations with the current approach, although I do quite like the idea of being able to specify prompts to use during training apart from manually adding them in your dataset(s). My current hesitations:

  1. Adding a column to the entire dataset before training will be incompatible with datasets IterableDataset (i.e., load_dataset("...", streaming=True).
  2. I'm in theory okay with adding a ..._prompt_length column: I recognize that it's crucial to get this information if include_prompt is False in the Pooling. However, I have two notes:
    • Could we e.g. only add the information if the Pooling module (if it exists) actually has include_prompt=False?
    • Could we perhaps not add an entire column, but instead create a nested dictionary with dataset names mapping to column names mapping to prompt lengths? Dataset names should be a column if there's multiple datasets.

Could we perhaps add the prompts (and prompt lengths) in the data collator? E.g. right here: https://github.com/ArthurCamara/sentence-transformers/blob/bf9eb803ce2dda26a8ef903c33d80cd1fcb55a3d/sentence_transformers/data_collator.py#L50-L56

The data collator knows the dataset name, the column name (see the snippet), and should then be able to use that information to "on the fly" prepend the prompts. In a perfect world we could even only tokenize the prompts once, but that gets complicated with padding and truncation, so it's better to keep it simpler.

I also like your idea that prompt can be multiple things: a single prompt, a prompt per column, or a prompt per column per dataset. I do think prompts is a bit better though, because that's what we use in the encode etc.

I'm curious to hear your thoughts on this.

  • Tom Aarsen

@ArthurCamara
Copy link
Contributor Author

Hello!

Thanks for this PR. I rebased it to get rid of the leftover commits that aren't necessary here. I have a few hesitations with the current approach, although I do quite like the idea of being able to specify prompts to use during training apart from manually adding them in your dataset(s). My current hesitations:

  1. Adding a column to the entire dataset before training will be incompatible with datasets IterableDataset (i.e., load_dataset("...", streaming=True).

  2. I'm in theory okay with adding a ..._prompt_length column: I recognize that it's crucial to get this information if include_prompt is False in the Pooling. However, I have two notes:

    • Could we e.g. only add the information if the Pooling module (if it exists) actually has include_prompt=False?
    • Could we perhaps not add an entire column, but instead create a nested dictionary with dataset names mapping to column names mapping to prompt lengths? Dataset names should be a column if there's multiple datasets.

Could we perhaps add the prompts (and prompt lengths) in the data collator? E.g. right here: https://github.com/ArthurCamara/sentence-transformers/blob/bf9eb803ce2dda26a8ef903c33d80cd1fcb55a3d/sentence_transformers/data_collator.py#L50-L56

The data collator knows the dataset name, the column name (see the snippet), and should then be able to use that information to "on the fly" prepend the prompts. In a perfect world we could even only tokenize the prompts once, but that gets complicated with padding and truncation, so it's better to keep it simpler.

This was one of the things I was considering, to change the Collator instead of the dataset itself. But I had issues with Accelerator and DDP before when the data was not exclusively tensors (i.e., strings), but I think we can walk around it within the collator. I will give it a shot and let you know.

I also like your idea that prompt can be multiple things: a single prompt, a prompt per column, or a prompt per column per dataset. I do think prompts is a bit better though, because that's what we use in the encode etc.

Agreed. =)

I'm curious to hear your thoughts on this.

  • Tom Aarsen

…/sentence-transformers into trainer-with-prompt-masking
@ArthurCamara
Copy link
Contributor Author

ArthurCamara commented Oct 3, 2024

@tomaarsen There you go. I've changed the prompting logic to the data_collator. The prompt_lenght column will only be added if the Pooler has include_prompt set to False.

What needed to be kept in the Trainer, however is the added logic to decide if we should call add_dataset_name_column or not. To clean up the code a bit, I've moved all these checks into the maybe_add_dataset_name_column method. Let me know if you think it is a good idea or not.

@ShengYun-Peng
Copy link

Hi @ArthurCamara @tomaarsen , thank you both for the PR, and sorry for the late reply! I have implemented a local trainer that supports the prompt logic following the instructor code. It's a bit different from the logic of current PR, and I would like to hear your thoughts on it.

The idea is only to change the dataset and the Transformer of the Sentence Transformer model. The dataset part will be similar to yours, which is loading the prompt/instruction and the user sentence separately. However, the Transformer tokenizer will compute an instruction mask based on the tokenized length of the prompt/instruction. Thus, the input dict contains three items: input_ids, attention_mask, and instruction_mask. After the Transformer finishes the forward pass, I update the attention_mask with the element-wise product of the attention_mask and the instruction_mask. Thus the instruction is excluded in the final embedding. Of course, if the user wants to include the instruction in the final embedding, we can add a flag in the argument.

I'm sure your code will also work. From my first glance, it seems like the above logic is implemented in the Pooling rather than the Transformer layer of the Sentence Transformer. It would be great if you can share your high-level design and thought process. Thanks!

  • Anthony Peng

@ShengYun-Peng
Copy link

ShengYun-Peng commented Oct 4, 2024

Thus, in my implementation, each sentence in a dataset is a pair of strings. Take STS for an example, one input sample was {"sentence1": "A plane is taking off.", "sentence2": "An air plane is taking off.", "score": 1} previously. Now the input is {"sentence1": ["Embed the semantic meaning of the sentence: ", "A plane is taking off."], "sentence2": ["Embed the semantic meaning of the sentence: ", "An air plane is taking off."], "score": 1}. It will break the model card creation since there is no .values() for a sequence object in the dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants