-
I'm curious about how much raw text was used to train the sentencepiece tokenizer models. The sentencepiece trainer seems to scale somewhat poorly to dataset sizes of 10s or 100s of gigabytes. I'm seeing that training a unigram tokenizer model on all of English Wikipedia might use as much as 500GB of RAM, and proceed super slowly in a single-threaded manner. C4 would be even larger, by more than an order of magnitude. But it's also unclear if there's any real benefit from using the full corpus compared to just subsampling the data. Were the T5 tokenizers trained on some smaller subset of C4? Or did you request a VM with terabytes of RAM and manage to use nearly all of the data for tokenizer training? Or perhaps there's another option that I'm missing here? I would greatly appreciate it if anyone familiar with T5 could answer these questions. Thank you! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
We first ran an Apache Beam (i.e., DataFlow) job to count the individual tokens and fed those counts to the sentencepiece binary instead of the raw files themselves. We have not open-sourced that beam job, but it is something we could look into. |
Beta Was this translation helpful? Give feedback.
We first ran an Apache Beam (i.e., DataFlow) job to count the individual tokens and fed those counts to the sentencepiece binary instead of the raw files themselves. We have not open-sourced that beam job, but it is something we could look into.