adding tokenizer configuration #19

jwnz · 2024-09-23T15:50:19Z

Description

This PR aims to add some better functionality for configuration of the tokenizer (#5).

Todo

Allow for selection of huggingface tokenizer. Downloads the model from the hub.
Jieba tokenizer (chinese)
tiktoken
tiniestsegmenter (japanese) [optional]
Allow switching of HF tokenizers
Add Tests for tokenizing

Notes

1. Currently, if a HuggingFace tokenizer is initialized, the tokenizer cannot be changed. e.g. Doing SELECT tokenize('i have an apple', 'hf', 'sentence-transformers/LaBSE'); after SELECT tokenize('i have an apple', 'hf', 'google-t5/t5-base'); would just use the LaBSE tokenizer as it has already been initialized. Need a better way to handle this.

~~2. There is no official rust crate for tiktoken.~~ Used tiktoken-rs

Usage

SELECT tokenize('i have an apple', 'hf', 'google-bert/bert-base-uncased');
SELECT tokenize('i have an apple', 'hf', 'google-t5/t5-base');
SELECT tokenize('i have an apple', 'tiktoken', 'gpt2');
SELECT tokenize('i have an apple', 'tiniestsegmenter', '');
SELECT tokenize('i have an apple', 'jieba', '');
SELECT tokenize('i have an apple', 'ws', '');

VoVAllen · 2024-09-25T08:07:46Z

We can use https://github.com/zurawiki/tiktoken-rs since it's just a wrapper. For more tokenizer support, @usamoi do you have any suggestion?

VoVAllen · 2024-09-29T13:03:08Z

Can you fix the DCO with the instruction?

Signed-off-by: jwnz <[email protected]>

jwnz · 2024-09-29T13:29:51Z

Signed off the commits. Apologies for the inconvenience.

Signed-off-by: jwnz <[email protected]>

RobertHH-IS · 2024-10-01T15:58:06Z

Will this be added? Critical to be able to adjust the tokenizer for multilingual support.

VoVAllen · 2024-10-02T08:51:48Z

We'll definitely merge this PR. However the maintainer is on the vacation now, will be back at 10/8.

jwnz marked this pull request as ready for review September 26, 2024 16:21

jwnz added 5 commits September 29, 2024 22:25

adding dynamic tokenizer configuration

fa7aa01

Signed-off-by: jwnz <[email protected]>

adding whitespace and japanese tokenizers, and tests

d2ffee1

Signed-off-by: jwnz <[email protected]>

Allow switching of hf tokenizers

1c9a8c7

Signed-off-by: jwnz <[email protected]>

cleanup & remove uneccesary match in tokenize method

c06f2ab

Signed-off-by: jwnz <[email protected]>

adding tiktoken tokenizer

a6c9462

Signed-off-by: jwnz <[email protected]>

jwnz force-pushed the add-tokenizer-configuration branch from 4244656 to a6c9462 Compare September 29, 2024 13:26

Formatting Cargo.toml & updating Cargo.lock to match

277fdc7

Signed-off-by: jwnz <[email protected]>

jwnz changed the title ~~[WIP] adding tokenizer configuration~~ adding tokenizer configuration Oct 1, 2024

usamoi approved these changes Oct 8, 2024

View reviewed changes

usamoi merged commit 9a7fcae into tensorchord:main Oct 8, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding tokenizer configuration #19

adding tokenizer configuration #19

jwnz commented Sep 23, 2024 •

edited

Loading

VoVAllen commented Sep 25, 2024

VoVAllen commented Sep 29, 2024

jwnz commented Sep 29, 2024

RobertHH-IS commented Oct 1, 2024

VoVAllen commented Oct 2, 2024

adding tokenizer configuration #19

adding tokenizer configuration #19

Conversation

jwnz commented Sep 23, 2024 • edited Loading

Description

Todo

Notes

Usage

VoVAllen commented Sep 25, 2024

VoVAllen commented Sep 29, 2024

jwnz commented Sep 29, 2024

RobertHH-IS commented Oct 1, 2024

VoVAllen commented Oct 2, 2024

jwnz commented Sep 23, 2024 •

edited

Loading