Skip to content

How much data was used to train the tokenizers? #769

Answered by adarob
nikitakit asked this question in Q&A
Discussion options

You must be logged in to vote

We first ran an Apache Beam (i.e., DataFlow) job to count the individual tokens and fed those counts to the sentencepiece binary instead of the raw files themselves. We have not open-sourced that beam job, but it is something we could look into.

Replies: 1 comment 3 replies

Comment options

You must be logged in to vote
3 replies
@nikitakit
Comment options

@adarob
Comment options

@nikitakit
Comment options

Answer selected by adarob
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants