QOL: Fail during preprocessing if max sequence lengths are shorter than the prompt template. #3719

justinxzhao · 2023-10-11T22:00:20Z

During preprocessing, raise an error if the max sequence length model can support would truncate all unique info in the training example.

Cases considered:

input max sequence length < |prompt template|
global max sequence length < |prompt template|

Case 1: input max sequence length < |prompt template|

(refactored) During preprocessing, TextFeature.get_feature_meta() performs two main steps.
- a) vocabulary creation: we tokenize the column of data to determine the vocabulary, take note of the true max sequence length of the column, and set vocabulary-index maps.
- b) max sequence length reconciliation: we apply a simple policy to reconcile the final max sequence length that should be used.
  - If the max sequence length is explicitly specified, we use the minimum of the true maximum sequence length and the explicitly specified value. If the explicitly specified value is less than the true maximum sequence length, we log a warning. If the max sequence length is not specified, we use the true maximum sequence length as the feature's max sequence length.
(new) Compute prompt_template_num_tokens as part of vocabulary creation (step a).
(new) Add the model configuration, which contains the prompt template, as a new argument to TextFeature.get_feature_meta().
- This requires making a change to BaseFeature.get_feature_meta() and all subclasses to accept the new arugment.
(new) During max sequence length reconciliation (step b), if the final max sequence length for the input feature is smaller than the number of tokens in the prompt template, a ValueError is raised.

Case 2: global max sequence length < |prompt template|

(new) We return prompt_template_num_tokens in the return of Text.Feature.get_feature_meta() so that it can be referenced in build_dataset().
A ValueError is raised if the number of tokens in the prompt template is larger than the global_max_sequence_length.

…gth.

github-actions · 2023-10-11T22:45:50Z

Unit Test Results

  6 files ±0   6 suites ±0 22m 36s ⏱️ + 1m 52s
12 tests ±0   9 ✔️ ±0   3 💤 ±0 0 ❌ ±0
60 runs ±0 42 ✔️ ±0 18 💤 ±0 0 ❌ ±0

Results for commit 33c0dfc. ± Comparison against base commit f98b8f6.

♻️ This comment has been updated with latest results.

ludwig/config_validation/preprocessing.py

ludwig/features/base_feature.py

ludwig/utils/strings_utils.py

arnavgarg1 · 2023-10-17T14:10:28Z

ludwig/utils/strings_utils.py


    Args:
+        prompt_template: The prompt template for the model. Applicable only to LLMs.


@justinxzhao One thing to call out is that prompt_template isn't actually only applicable to LLMs - it is also applicable to ECD text features as a part of preprocessing. As a result of that, I think there's 2 places you want to check for the existence of the prompt template:

If prompt is specified at the top level of the ModelConfig object, then use it from there -> LLM model type

If prompt is specified at the feature-specific preprocessing level, then use it from there -> Applies to both LLM and ECD model types.

I think the part that's missing today in case 2 is that we don't have an existing check to raise a warning for LLM model types if the prompt is specified at the top-level config level and at the feature specific level under preprocessing, so that could be worth adding as a part of this PR since it should be quick and won't balloon the scope too much.

Personally, I would prefer to keep this PR limited to only checking for top-level prompt template for LLM model types and leave all handling/checking of prompt templates for ECD in a separate PR since there's other complexities that would be good to align on beforehand.

Sure! Let's just maybe throw in a to-do for it somewhere in the comments

ludwig/features/text_feature.py

arnavgarg1

Generally LGTM, left a few comments that may warrant a discussion and a couple of minor nits - happy to chat about any of them.

The main ones are:

How we're calculating the number of tokens in the prompt, which I feel should be the number of tokens in the prompt template without the placeholder tokens in the template as opposed to the current implementation that incldes the placeholder tokens as part of the token count of the template.
Accounting for the fact that prompt templates also live in ECD models under text feature preprocessing

…tment for them in vocabulary generation.

…t_token_lengths

arnavgarg1

LGTM! 🚢

justinxzhao added 3 commits October 11, 2023 19:33

Check that input MSL and GMSL are longer than the prompt sequence len…

ccb03b5

…gth.

Remove extra argument, add test for validation, clean up f strings.

652fd32

Add tests to test_api.py

bd3f897

Fix tests.

dbe795f

justinxzhao requested review from tgaddair, jeffkinnison and arnavgarg1 October 12, 2023 13:35

Fix tests.

6d38115

justinxzhao marked this pull request as ready for review October 12, 2023 13:54

justinxzhao added 2 commits October 12, 2023 21:00

Set pad token in the vocab

2bfb0f5

Use get_pad_token().

f03d538