-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat
] Trainer with prompts and prompt masking
#2964
base: master
Are you sure you want to change the base?
[feat
] Trainer with prompts and prompt masking
#2964
Conversation
…tence-transformers into Prompting-on-evaluators
feat
] Trainer with prompts and prompt masking
354cb65
to
bf9eb80
Compare
Hello! Thanks for this PR. I rebased it to get rid of the leftover commits that aren't necessary here.
Could we perhaps add the prompts (and prompt lengths) in the data collator? E.g. right here: https://github.com/ArthurCamara/sentence-transformers/blob/bf9eb803ce2dda26a8ef903c33d80cd1fcb55a3d/sentence_transformers/data_collator.py#L50-L56 The data collator knows the dataset name, the column name (see the snippet), and should then be able to use that information to "on the fly" prepend the prompts. In a perfect world we could even only tokenize the prompts once, but that gets complicated with padding and truncation, so it's better to keep it simpler. I also like your idea that I'm curious to hear your thoughts on this.
|
This was one of the things I was considering, to change the Collator instead of the dataset itself. But I had issues with Accelerator and DDP before when the data was not exclusively tensors (i.e., strings), but I think we can walk around it within the collator. I will give it a shot and let you know.
Agreed. =)
|
…/sentence-transformers into trainer-with-prompt-masking
86dd847
to
bf9eb80
Compare
…/sentence-transformers into trainer-with-prompt-masking
…rompt exists in the Trainer
@tomaarsen There you go. I've changed the prompting logic to the What needed to be kept in the |
Hi @ArthurCamara @tomaarsen , thank you both for the PR, and sorry for the late reply! I have implemented a local trainer that supports the prompt logic following the instructor code. It's a bit different from the logic of current PR, and I would like to hear your thoughts on it. The idea is only to change the dataset and the Transformer of the Sentence Transformer model. The dataset part will be similar to yours, which is loading the prompt/instruction and the user sentence separately. However, the Transformer tokenizer will compute an instruction mask based on the tokenized length of the prompt/instruction. Thus, the input dict contains three items: input_ids, attention_mask, and instruction_mask. After the Transformer finishes the forward pass, I update the attention_mask with the element-wise product of the attention_mask and the instruction_mask. Thus the instruction is excluded in the final embedding. Of course, if the user wants to include the instruction in the final embedding, we can add a flag in the argument. I'm sure your code will also work. From my first glance, it seems like the above logic is implemented in the Pooling rather than the Transformer layer of the Sentence Transformer. It would be great if you can share your high-level design and thought process. Thanks!
|
Thus, in my implementation, each sentence in a dataset is a pair of strings. Take STS for an example, one input sample was |
Pull Request overview
Trainer
classPooling
when training.Details
Currently, the
encode
method ofSentenceTransformer
supports adding prompts (or instructions) dynamically to the sentences by passing eitherprompt
orprompt_name
. However, this is not supported when training, as mentioned in #2945, as it uses theforward
method instead.This PR implements a similar functionality to the
Trainer
, by addingprompt
parameter that can be:str
: The prompt will be appended to all sentences in the datasetdict[str, str]
: If the keys are column names, it will append the prompt to the respective column. If the training dataset is a dictionary of datasets, and the dictionary keys are names of the datasets, it will add the prompt to all the columns of the respective dataset.dict[str, dict[str, str]]
: Same as above, but assumes the first level is the dataset name and the second level are the column names.As the prompts can be dynamic (changing for each dataset and column), they are injected in the sentences by the
get_train|test|eval|_dataloader
methods, by callingadd_prompts_to_dataset
, which solves for each dataset and column which prompt to inject.Finally, the
add_prompts_to_dataset
also adds<column_name>_prompt_length
columns that, when passed toPooling
method withinclude_prompt=False
, will mask the instructions properly as well. (currently this is only explicitly forInstructor
models, but can be set by the user by callingmodel.set_pooling_include_prompt(include_prompt=False)