-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding tokenizer configuration #19
Conversation
We can use https://github.com/zurawiki/tiktoken-rs since it's just a wrapper. For more tokenizer support, @usamoi do you have any suggestion? |
Can you fix the DCO with the instruction? |
Signed-off-by: jwnz <[email protected]>
Signed-off-by: jwnz <[email protected]>
Signed-off-by: jwnz <[email protected]>
Signed-off-by: jwnz <[email protected]>
Signed-off-by: jwnz <[email protected]>
4244656
to
a6c9462
Compare
Signed off the commits. Apologies for the inconvenience. |
Signed-off-by: jwnz <[email protected]>
Will this be added? Critical to be able to adjust the tokenizer for multilingual support. |
We'll definitely merge this PR. However the maintainer is on the vacation now, will be back at 10/8. |
Description
This PR aims to add some better functionality for configuration of the tokenizer (#5).
Todo
Notes
1. Currently, if a HuggingFace tokenizer is initialized, the tokenizer cannot be changed. e.g. DoingSELECT tokenize('i have an apple', 'hf', 'sentence-transformers/LaBSE');
afterSELECT tokenize('i have an apple', 'hf', 'google-t5/t5-base');
would just use the LaBSE tokenizer as it has already been initialized. Need a better way to handle this.2. There is no official rust crate for tiktoken.Used tiktoken-rsUsage