Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Dependencies
This PR is part of a sequence in support of adding Granite Code. It depends on merging the following PRs:
Issues
Closes #1251
Description
This PR adds partial support for models that use the
tokenizers
(as opposed totiktoken
orsentencepiece
) for tokenization. This PR only addresses support in thepython
runner, and it does so by creating a new class in thetokenizer
module that simply wrapstokenizers
.Discussion
I'm not sure this is the correct direction to go for solving this since the
tokenizers
library is not (to the best of my knowledge) portable to the various export formats (yet). There are two main challenges to extending more tokenizer support outside of simply wrappingtokenizers
:Pre-tokenizers
For may tokenizers, multiple regexes are used in sequence to split the raw string. Not being a regex expert myself, it's not immediately clear to me if it's possible to merge this kind of multi-pass splitting into a single regex. For other tokenizers, a single regex is used, but it is a different expression than any of those currently implemented in
tiktoken
.From my investigation, I think there are a few candidate paths forward:
c++
implementation of the various tokenization routines fromtokenizers
in a separate implementation of theTokenizer
class.c++
TikToken
class to support multiple regexes in the pre-tokenizertokenizer.model
artifact, or somehow making these tokenizer arguments an argument at instantiation time.NOTE: The corresponding tokenization in
llama.cpp
lives here. This code is a full implementation of a unified tokenizer with configuration to dispatch between known patterns and optimized implementations. The config for the model that indicates which tokenizer to use is stored in the model'sGGUF
file directly, so at load time, the correct tokenizer is found based on that value.Special Tokens
Even for models that use a single regex (and even the
llama
regex), models may use different special tokens for special functionality (chat template, FIM, tool calling, other custom prompting). Since thetokenizer.model
, only the vocab is stored, so there is not currently any way to note the special tokens in serialization (similar to the need for configuration of pre-tokenizers).