Tokenizers tokenizer #1261

gabe-l-hart · 2024-10-03T15:59:04Z

Dependencies

This PR is part of a sequence in support of adding Granite Code. It depends on merging the following PRs:

Safetensors: Safetensors #1255
Bias tensors: Bias tensors #1259
Tied word embeddings: Tied word embeddings #1260

Issues

Description

This PR adds partial support for models that use the tokenizers (as opposed to tiktoken or sentencepiece) for tokenization. This PR only addresses support in the python runner, and it does so by creating a new class in the tokenizer module that simply wraps tokenizers.

Discussion

I'm not sure this is the correct direction to go for solving this since the tokenizers library is not (to the best of my knowledge) portable to the various export formats (yet). There are two main challenges to extending more tokenizer support outside of simply wrapping tokenizers:

Pre-tokenizers

For may tokenizers, multiple regexes are used in sequence to split the raw string. Not being a regex expert myself, it's not immediately clear to me if it's possible to merge this kind of multi-pass splitting into a single regex. For other tokenizers, a single regex is used, but it is a different expression than any of those currently implemented in tiktoken.

From my investigation, I think there are a few candidate paths forward:

Provide a c++ implementation of the various tokenization routines from tokenizers in a separate implementation of the Tokenizer class.
Extend the existing c++ TikToken class to support multiple regexes in the pre-tokenizer
- This would also correspond with needing to make the set of patterns configurable and therefore serialized into the tokenizer.model artifact, or somehow making these tokenizer arguments an argument at instantiation time.

NOTE: The corresponding tokenization in llama.cpp lives here. This code is a full implementation of a unified tokenizer with configuration to dispatch between known patterns and optimized implementations. The config for the model that indicates which tokenizer to use is stored in the model's GGUF file directly, so at load time, the correct tokenizer is found based on that value.

Special Tokens

Even for models that use a single regex (and even the llama regex), models may use different special tokens for special functionality (chat template, FIM, tool calling, other custom prompting). Since the tokenizer.model, only the vocab is stored, so there is not currently any way to note the special tokens in serialization (similar to the need for configuration of pre-tokenizers).

pytorch-bot · 2024-10-03T15:59:08Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1261

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…support Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>

…tokenizers This allows for all HF tokenizers to be supported in the python layer. It will need significant work to offer similar compatibility at the c++ layer. Signed-off-by: Gabe Goodhart <[email protected]>

Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>

…kenizer Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>

Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 3, 2024

gabe-l-hart mentioned this pull request Oct 3, 2024

Add support for tokenizers tokenizers #1251

Open

gabe-l-hart force-pushed the TokenizersTokenizer-1251 branch 6 times, most recently from 7b580df to f2cba4c Compare October 9, 2024 12:39

gabe-l-hart added 5 commits October 9, 2024 17:51

feat(tokenizer): Add an abstract base class for additional tokenizer …

353f976

…support Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>

feat(tokenizers): Add a python impl of the Tokenizer interface using …

99de873

…tokenizers This allows for all HF tokenizers to be supported in the python layer. It will need significant work to offer similar compatibility at the c++ layer. Signed-off-by: Gabe Goodhart <[email protected]>

feat(builder): Add support for using the TokenizersTokenizer in builder

bafff76

Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>

feat(tokenizers): Add and plumb the option to use the "tokenizers" to…

a3618d2

…kenizer Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>

fix(tokenizers): Fix how bos/eos tokens are parsed from tokenizers (lib)

3554c3e

Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>

gabe-l-hart force-pushed the TokenizersTokenizer-1251 branch from f2cba4c to 3554c3e Compare October 9, 2024 23:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizers tokenizer #1261

Tokenizers tokenizer #1261

gabe-l-hart commented Oct 3, 2024 •

edited

Loading

pytorch-bot bot commented Oct 3, 2024 •

edited

Loading

Tokenizers tokenizer #1261

Are you sure you want to change the base?

Tokenizers tokenizer #1261

Conversation

gabe-l-hart commented Oct 3, 2024 • edited Loading

Dependencies

Issues

Description

Discussion

Pre-tokenizers

Special Tokens

pytorch-bot bot commented Oct 3, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1261

gabe-l-hart commented Oct 3, 2024 •

edited

Loading

pytorch-bot bot commented Oct 3, 2024 •

edited

Loading