Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding tokenizer configuration #19

Merged
merged 6 commits into from
Oct 8, 2024

Conversation

jwnz
Copy link
Contributor

@jwnz jwnz commented Sep 23, 2024

Description

This PR aims to add some better functionality for configuration of the tokenizer (#5).

Todo

  • Allow for selection of huggingface tokenizer. Downloads the model from the hub.
  • Jieba tokenizer (chinese)
  • tiktoken
  • tiniestsegmenter (japanese) [optional]
  • Allow switching of HF tokenizers
  • Add Tests for tokenizing

Notes

1. Currently, if a HuggingFace tokenizer is initialized, the tokenizer cannot be changed. e.g. Doing SELECT tokenize('i have an apple', 'hf', 'sentence-transformers/LaBSE'); after SELECT tokenize('i have an apple', 'hf', 'google-t5/t5-base'); would just use the LaBSE tokenizer as it has already been initialized. Need a better way to handle this.

2. There is no official rust crate for tiktoken. Used tiktoken-rs

Usage

SELECT tokenize('i have an apple', 'hf', 'google-bert/bert-base-uncased');
SELECT tokenize('i have an apple', 'hf', 'google-t5/t5-base');
SELECT tokenize('i have an apple', 'tiktoken', 'gpt2');
SELECT tokenize('i have an apple', 'tiniestsegmenter', '');
SELECT tokenize('i have an apple', 'jieba', '');
SELECT tokenize('i have an apple', 'ws', '');

@VoVAllen
Copy link
Member

We can use https://github.com/zurawiki/tiktoken-rs since it's just a wrapper. For more tokenizer support, @usamoi do you have any suggestion?

@jwnz jwnz marked this pull request as ready for review September 26, 2024 16:21
@VoVAllen
Copy link
Member

Can you fix the DCO with the instruction?

@jwnz jwnz force-pushed the add-tokenizer-configuration branch from 4244656 to a6c9462 Compare September 29, 2024 13:26
@jwnz
Copy link
Contributor Author

jwnz commented Sep 29, 2024

Signed off the commits. Apologies for the inconvenience.

@RobertHH-IS
Copy link

Will this be added? Critical to be able to adjust the tokenizer for multilingual support.

@jwnz jwnz changed the title [WIP] adding tokenizer configuration adding tokenizer configuration Oct 1, 2024
@VoVAllen
Copy link
Member

VoVAllen commented Oct 2, 2024

We'll definitely merge this PR. However the maintainer is on the vacation now, will be back at 10/8.

@usamoi usamoi merged commit 9a7fcae into tensorchord:main Oct 8, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants