-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QOL: Fail during preprocessing if max sequence lengths are shorter than the prompt template. #3719
Conversation
|
||
Args: | ||
prompt_template: The prompt template for the model. Applicable only to LLMs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@justinxzhao One thing to call out is that prompt_template
isn't actually only applicable to LLMs - it is also applicable to ECD text features as a part of preprocessing. As a result of that, I think there's 2 places you want to check for the existence of the prompt template:
- If
prompt
is specified at the top level of the ModelConfig object, then use it from there -> LLM model type - If
prompt
is specified at the feature-specific preprocessing level, then use it from there -> Applies to both LLM and ECD model types.
I think the part that's missing today in case 2 is that we don't have an existing check to raise a warning for LLM model types if the prompt
is specified at the top-level config level and at the feature specific level under preprocessing, so that could be worth adding as a part of this PR since it should be quick and won't balloon the scope too much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I would prefer to keep this PR limited to only checking for top-level prompt template for LLM model types and leave all handling/checking of prompt templates for ECD in a separate PR since there's other complexities that would be good to align on beforehand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! Let's just maybe throw in a to-do for it somewhere in the comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM, left a few comments that may warrant a discussion and a couple of minor nits - happy to chat about any of them.
The main ones are:
- How we're calculating the number of tokens in the prompt, which I feel should be the number of tokens in the prompt template without the placeholder tokens in the template as opposed to the current implementation that incldes the placeholder tokens as part of the token count of the template.
- Accounting for the fact that prompt templates also live in ECD models under text feature preprocessing
…tment for them in vocabulary generation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 🚢
During preprocessing, raise an error if the max sequence length model can support would truncate all unique info in the training example.
Cases considered:
Case 1: input max sequence length < |prompt template|
TextFeature.get_feature_meta()
performs two main steps.prompt_template_num_tokens
as part of vocabulary creation (step a).TextFeature.get_feature_meta()
.BaseFeature.get_feature_meta()
and all subclasses to accept the new arugment.Case 2: global max sequence length < |prompt template|
prompt_template_num_tokens
in the return ofText.Feature.get_feature_meta()
so that it can be referenced inbuild_dataset()
.ValueError
is raised if the number of tokens in the prompt template is larger than theglobal_max_sequence_length
.