How much data was used to train the tokenizers? #769

nikitakit · 2021-03-30T15:52:28Z

nikitakit
Mar 30, 2021

I'm curious about how much raw text was used to train the sentencepiece tokenizer models.

The sentencepiece trainer seems to scale somewhat poorly to dataset sizes of 10s or 100s of gigabytes. I'm seeing that training a unigram tokenizer model on all of English Wikipedia might use as much as 500GB of RAM, and proceed super slowly in a single-threaded manner. C4 would be even larger, by more than an order of magnitude. But it's also unclear if there's any real benefit from using the full corpus compared to just subsampling the data.

Were the T5 tokenizers trained on some smaller subset of C4? Or did you request a VM with terabytes of RAM and manage to use nearly all of the data for tokenizer training? Or perhaps there's another option that I'm missing here?

I would greatly appreciate it if anyone familiar with T5 could answer these questions. Thank you!

Answered by adarob

Mar 30, 2021

We first ran an Apache Beam (i.e., DataFlow) job to count the individual tokens and fed those counts to the sentencepiece binary instead of the raw files themselves. We have not open-sourced that beam job, but it is something we could look into.

View full answer

adarob · 2021-03-30T16:52:50Z

adarob
Mar 30, 2021
Maintainer

We first ran an Apache Beam (i.e., DataFlow) job to count the individual tokens and fed those counts to the sentencepiece binary instead of the raw files themselves. We have not open-sourced that beam job, but it is something we could look into.

3 replies

nikitakit Mar 30, 2021
Author

Thanks for replying!

Is the option to feed in token counts available in the open source sentencepiece binary? I don't believe I've ever come across it just by reading the sentencepiece readme/docs.

Having open-source scripts for this would be quite useful! And at least for me personally, I'd find the sentencepiece interaction bits (how to count tokens, and how to feed them to sentencepiece) to be the most helpful parts to have reference code for.

adarob Mar 30, 2021
Maintainer

Yes, if you set input_format="tsv", it expects a tsv like <token>, <count>\n. Below is how we invoke the script.

I'm not sure when I'll have time to open source the code that creates the tsv.

spm_train \
   --input="$token_paths" \
   --input_format="tsv" \
   --model_prefix="$FLAGS_model_prefix" \
   --vocab_size="$FLAGS_vocab_size" \
   --input_sentence_size=95000000 \
   --split_by_whitespace="$FLAGS_split_by_whitespace" \
   --pad_id=0 \
   --eos_id=1 \
   --unk_id=2 \
   --bos_id=-1 \
   --byte_fallback="$FLAGS_byte_fallback" \
   --logtostderr

nikitakit Mar 30, 2021
Author

Thank you! That should be enough for me to implement my own pipeline for generating the tsv and then training the tokenizer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How much data was used to train the tokenizers? #769

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

How much data was used to train the tokenizers? #769

nikitakit Mar 30, 2021

Replies: 1 comment · 3 replies

adarob Mar 30, 2021 Maintainer

nikitakit Mar 30, 2021 Author

adarob Mar 30, 2021 Maintainer

nikitakit Mar 30, 2021 Author

nikitakit
Mar 30, 2021

Replies: 1 comment 3 replies

adarob
Mar 30, 2021
Maintainer

nikitakit Mar 30, 2021
Author

adarob Mar 30, 2021
Maintainer

nikitakit Mar 30, 2021
Author