[`feat`] Trainer with prompts and prompt masking #2964

ArthurCamara · 2024-09-27T09:52:34Z

Pull Request overview

Adds support to including prompts in the Trainer class
Supports masking the prompts in the Pooling when training.

Details

Currently, the encode method of SentenceTransformer supports adding prompts (or instructions) dynamically to the sentences by passing either prompt or prompt_name. However, this is not supported when training, as mentioned in #2945, as it uses the forward method instead.

This PR implements a similar functionality to the Trainer, by adding prompt parameter that can be:

str: The prompt will be appended to all sentences in the dataset
dict[str, str]: If the keys are column names, it will append the prompt to the respective column. If the training dataset is a dictionary of datasets, and the dictionary keys are names of the datasets, it will add the prompt to all the columns of the respective dataset.
dict[str, dict[str, str]]: Same as above, but assumes the first level is the dataset name and the second level are the column names.

As the prompts can be dynamic (changing for each dataset and column), they are injected in the sentences by the get_train|test|eval|_dataloader methods, by calling add_prompts_to_dataset, which solves for each dataset and column which prompt to inject.

Finally, the add_prompts_to_dataset also adds <column_name>_prompt_length columns that, when passed to Pooling method with include_prompt=False, will mask the instructions properly as well. (currently this is only explicitly for Instructor models, but can be set by the user by calling model.set_pooling_include_prompt(include_prompt=False)

…-padded.

…tence-transformers into Prompting-on-evaluators

tomaarsen · 2024-09-30T15:53:46Z

Hello!

Thanks for this PR. I rebased it to get rid of the leftover commits that aren't necessary here.
I have a few hesitations with the current approach, although I do quite like the idea of being able to specify prompts to use during training apart from manually adding them in your dataset(s). My current hesitations:

Adding a column to the entire dataset before training will be incompatible with datasets IterableDataset (i.e., load_dataset("...", streaming=True).
I'm in theory okay with adding a ..._prompt_length column: I recognize that it's crucial to get this information if include_prompt is False in the Pooling. However, I have two notes:
- Could we e.g. only add the information if the Pooling module (if it exists) actually has include_prompt=False?
- Could we perhaps not add an entire column, but instead create a nested dictionary with dataset names mapping to column names mapping to prompt lengths? Dataset names should be a column if there's multiple datasets.

Could we perhaps add the prompts (and prompt lengths) in the data collator? E.g. right here: https://github.com/ArthurCamara/sentence-transformers/blob/bf9eb803ce2dda26a8ef903c33d80cd1fcb55a3d/sentence_transformers/data_collator.py#L50-L56

The data collator knows the dataset name, the column name (see the snippet), and should then be able to use that information to "on the fly" prepend the prompts. In a perfect world we could even only tokenize the prompts once, but that gets complicated with padding and truncation, so it's better to keep it simpler.

I also like your idea that prompt can be multiple things: a single prompt, a prompt per column, or a prompt per column per dataset. I do think prompts is a bit better though, because that's what we use in the encode etc.

I'm curious to hear your thoughts on this.

Tom Aarsen

ArthurCamara · 2024-10-02T08:33:36Z

Hello!

Thanks for this PR. I rebased it to get rid of the leftover commits that aren't necessary here. I have a few hesitations with the current approach, although I do quite like the idea of being able to specify prompts to use during training apart from manually adding them in your dataset(s). My current hesitations:

Adding a column to the entire dataset before training will be incompatible with datasets IterableDataset (i.e., load_dataset("...", streaming=True).

I'm in theory okay with adding a ..._prompt_length column: I recognize that it's crucial to get this information if include_prompt is False in the Pooling. However, I have two notes:

Could we e.g. only add the information if the Pooling module (if it exists) actually has include_prompt=False?

Could we perhaps not add an entire column, but instead create a nested dictionary with dataset names mapping to column names mapping to prompt lengths? Dataset names should be a column if there's multiple datasets.

Could we perhaps add the prompts (and prompt lengths) in the data collator? E.g. right here: https://github.com/ArthurCamara/sentence-transformers/blob/bf9eb803ce2dda26a8ef903c33d80cd1fcb55a3d/sentence_transformers/data_collator.py#L50-L56

The data collator knows the dataset name, the column name (see the snippet), and should then be able to use that information to "on the fly" prepend the prompts. In a perfect world we could even only tokenize the prompts once, but that gets complicated with padding and truncation, so it's better to keep it simpler.

This was one of the things I was considering, to change the Collator instead of the dataset itself. But I had issues with Accelerator and DDP before when the data was not exclusively tensors (i.e., strings), but I think we can walk around it within the collator. I will give it a shot and let you know.

I also like your idea that prompt can be multiple things: a single prompt, a prompt per column, or a prompt per column per dataset. I do think prompts is a bit better though, because that's what we use in the encode etc.

Agreed. =)

I'm curious to hear your thoughts on this.

Tom Aarsen

…/sentence-transformers into trainer-with-prompt-masking

…rompt exists in the Trainer

ArthurCamara · 2024-10-03T15:31:04Z

@tomaarsen There you go. I've changed the prompting logic to the data_collator. The prompt_lenght column will only be added if the Pooler has include_prompt set to False.

What needed to be kept in the Trainer, however is the added logic to decide if we should call add_dataset_name_column or not. To clean up the code a bit, I've moved all these checks into the maybe_add_dataset_name_column method. Let me know if you think it is a good idea or not.

ShengYun-Peng · 2024-10-04T15:45:06Z

Hi @ArthurCamara @tomaarsen , thank you both for the PR, and sorry for the late reply! I have implemented a local trainer that supports the prompt logic following the instructor code. It's a bit different from the logic of current PR, and I would like to hear your thoughts on it.

The idea is only to change the dataset and the Transformer of the Sentence Transformer model. The dataset part will be similar to yours, which is loading the prompt/instruction and the user sentence separately. However, the Transformer tokenizer will compute an instruction mask based on the tokenized length of the prompt/instruction. Thus, the input dict contains three items: input_ids, attention_mask, and instruction_mask. After the Transformer finishes the forward pass, I update the attention_mask with the element-wise product of the attention_mask and the instruction_mask. Thus the instruction is excluded in the final embedding. Of course, if the user wants to include the instruction in the final embedding, we can add a flag in the argument.

I'm sure your code will also work. From my first glance, it seems like the above logic is implemented in the Pooling rather than the Transformer layer of the Sentence Transformer. It would be great if you can share your high-level design and thought process. Thanks!

Anthony Peng

ShengYun-Peng · 2024-10-04T15:52:15Z

Thus, in my implementation, each sentence in a dataset is a pair of strings. Take STS for an example, one input sample was {"sentence1": "A plane is taking off.", "sentence2": "An air plane is taking off.", "score": 1} previously. Now the input is {"sentence1": ["Embed the semantic meaning of the sentence: ", "A plane is taking off."], "sentence2": ["Embed the semantic meaning of the sentence: ", "An air plane is taking off."], "score": 1}. It will break the model card creation since there is no .values() for a sequence object in the dataset.

ArthurCamara and others added 15 commits September 23, 2024 07:55

Added the possibility of masking the prompts if the tokenizer is left…

7dc7990

…-padded.

Simplify code

8d7b88b

Remove unrelated changes

c92e334

Move prompt_mask into the Transformer model

005039f

Merge branch 'UKPLab:master' into Prompting-on-evaluators

f95cb46

Added query and corpus prompts to Information Retrieval Evaluator

0effd4d

Merge branch 'Prompting-on-evaluators' of github.com:ArthurCamara/sen…

b653197

…tence-transformers into Prompting-on-evaluators

Fix for failing test

d856b47

Fix for pooling when mask is not passed

ad21eb7

Fix device placement for prompt_mask

82b8c7e

Revert left-padding changes

c7a5298

Revert left-padding changes

7cb8a51

Added support to prompts in the Trainer

c49ca90

Simplify logic and add prompt to eval dataset

781f1d7

add prompt to test dataset

72713b2

ArthurCamara mentioned this pull request Sep 27, 2024

Best practice to train on multiple datasets with different prompts #2945

Closed

ArthurCamara changed the title ~~Trainer with prompt masking~~ [feat] Trainer with prompts and prompt masking Sep 27, 2024

ArthurCamara added 3 commits September 30, 2024 17:34

Added support to prompts in the Trainer

810f5bf

Simplify logic and add prompt to eval dataset

1dea917

add prompt to test dataset

bf9eb80

tomaarsen force-pushed the trainer-with-prompt-masking branch 2 times, most recently from 354cb65 to bf9eb80 Compare September 30, 2024 15:36

Merge branch 'trainer-with-prompt-masking' of github.com:ArthurCamara…

86dd847

…/sentence-transformers into trainer-with-prompt-masking

ArthurCamara force-pushed the trainer-with-prompt-masking branch from 86dd847 to bf9eb80 Compare October 2, 2024 08:40

ArthurCamara and others added 4 commits October 2, 2024 10:49

Merge branch 'UKPLab:master' into trainer-with-prompt-masking

a12e5ca

Merge branch 'trainer-with-prompt-masking' of github.com:ArthurCamara…

2356542

…/sentence-transformers into trainer-with-prompt-masking

rename prompt to prompts

445ae63

Move prompts to collator

e855bf5

ArthurCamara added 9 commits October 2, 2024 10:06

rename to set_prompts

24355d8

Move prompts into data_collator

0791da4

Fix for pooling check

3cca709

Move prompt logic to Collator, add logic to add dataset column when p…

1a19605

…rompt exists in the Trainer

typo and init bug

b3ffea1

redundant initialization

21b7b9e

remove unused method

6f1e7af

add dtype of tensort

858a195

Fix for dtype and None dataset

1d808d9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`feat`] Trainer with prompts and prompt masking #2964

[`feat`] Trainer with prompts and prompt masking #2964

ArthurCamara commented Sep 27, 2024 •

edited

Loading

tomaarsen commented Sep 30, 2024

ArthurCamara commented Oct 2, 2024

ArthurCamara commented Oct 3, 2024 •

edited by tomaarsen

Loading

ShengYun-Peng commented Oct 4, 2024

ShengYun-Peng commented Oct 4, 2024 •

edited

Loading

[feat] Trainer with prompts and prompt masking #2964

Are you sure you want to change the base?

[feat] Trainer with prompts and prompt masking #2964

Conversation

ArthurCamara commented Sep 27, 2024 • edited Loading

Pull Request overview

Details

tomaarsen commented Sep 30, 2024

ArthurCamara commented Oct 2, 2024

ArthurCamara commented Oct 3, 2024 • edited by tomaarsen Loading

ShengYun-Peng commented Oct 4, 2024

ShengYun-Peng commented Oct 4, 2024 • edited Loading

[`feat`] Trainer with prompts and prompt masking #2964

[`feat`] Trainer with prompts and prompt masking #2964

ArthurCamara commented Sep 27, 2024 •

edited

Loading

ArthurCamara commented Oct 3, 2024 •

edited by tomaarsen

Loading

ShengYun-Peng commented Oct 4, 2024 •

edited

Loading