Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QOL: Fail during preprocessing if max sequence lengths are shorter than the prompt template. #3719

Merged
merged 13 commits into from
Oct 27, 2023

Conversation

justinxzhao
Copy link
Contributor

@justinxzhao justinxzhao commented Oct 11, 2023

During preprocessing, raise an error if the max sequence length model can support would truncate all unique info in the training example.

Cases considered:

  1. input max sequence length < |prompt template|
  2. global max sequence length < |prompt template|

Case 1: input max sequence length < |prompt template|

  • (refactored) During preprocessing, TextFeature.get_feature_meta() performs two main steps.
    • a) vocabulary creation: we tokenize the column of data to determine the vocabulary, take note of the true max sequence length of the column, and set vocabulary-index maps.
    • b) max sequence length reconciliation: we apply a simple policy to reconcile the final max sequence length that should be used.
      • If the max sequence length is explicitly specified, we use the minimum of the true maximum sequence length and the explicitly specified value. If the explicitly specified value is less than the true maximum sequence length, we log a warning. If the max sequence length is not specified, we use the true maximum sequence length as the feature's max sequence length.
  • (new) Compute prompt_template_num_tokens as part of vocabulary creation (step a).
  • (new) Add the model configuration, which contains the prompt template, as a new argument to TextFeature.get_feature_meta().
    • This requires making a change to BaseFeature.get_feature_meta() and all subclasses to accept the new arugment.
  • (new) During max sequence length reconciliation (step b), if the final max sequence length for the input feature is smaller than the number of tokens in the prompt template, a ValueError is raised.

Case 2: global max sequence length < |prompt template|

  • (new) We return prompt_template_num_tokens in the return of Text.Feature.get_feature_meta() so that it can be referenced in build_dataset().
  • A ValueError is raised if the number of tokens in the prompt template is larger than the global_max_sequence_length.

@github-actions
Copy link

github-actions bot commented Oct 11, 2023

Unit Test Results

  6 files  ±0    6 suites  ±0   22m 36s ⏱️ + 1m 52s
12 tests ±0    9 ✔️ ±0    3 💤 ±0  0 ±0 
60 runs  ±0  42 ✔️ ±0  18 💤 ±0  0 ±0 

Results for commit 33c0dfc. ± Comparison against base commit f98b8f6.

♻️ This comment has been updated with latest results.

@justinxzhao justinxzhao marked this pull request as ready for review October 12, 2023 13:54

Args:
prompt_template: The prompt template for the model. Applicable only to LLMs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justinxzhao One thing to call out is that prompt_template isn't actually only applicable to LLMs - it is also applicable to ECD text features as a part of preprocessing. As a result of that, I think there's 2 places you want to check for the existence of the prompt template:

  1. If prompt is specified at the top level of the ModelConfig object, then use it from there -> LLM model type
  2. If prompt is specified at the feature-specific preprocessing level, then use it from there -> Applies to both LLM and ECD model types.

I think the part that's missing today in case 2 is that we don't have an existing check to raise a warning for LLM model types if the prompt is specified at the top-level config level and at the feature specific level under preprocessing, so that could be worth adding as a part of this PR since it should be quick and won't balloon the scope too much.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I would prefer to keep this PR limited to only checking for top-level prompt template for LLM model types and leave all handling/checking of prompt templates for ECD in a separate PR since there's other complexities that would be good to align on beforehand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Let's just maybe throw in a to-do for it somewhere in the comments

Copy link
Contributor

@arnavgarg1 arnavgarg1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM, left a few comments that may warrant a discussion and a couple of minor nits - happy to chat about any of them.

The main ones are:

  1. How we're calculating the number of tokens in the prompt, which I feel should be the number of tokens in the prompt template without the placeholder tokens in the template as opposed to the current implementation that incldes the placeholder tokens as part of the token count of the template.
  2. Accounting for the fact that prompt templates also live in ECD models under text feature preprocessing

Copy link
Contributor

@arnavgarg1 arnavgarg1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🚢

@justinxzhao justinxzhao merged commit a0c42d8 into master Oct 27, 2023
18 checks passed
@justinxzhao justinxzhao deleted the check_prompt_token_lengths branch October 27, 2023 22:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants