You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
However, for non-whitespace segmented languages, and for training using subword tokenization, this is not correct. The fix here would be to provide a tokenizer configuration that could create correct tokenization, such as a SentencePiece tokenizer.
The text was updated successfully, but these errors were encountered:
I read up on #38 which states that part of the design is to augment based off of whitespace splitting. I'm unsure what would be the best way to preserve the original alignment information.
Perhaps you could map each alignment along the way, or maybe just assume that tokenization counting up the original alignments and applying the offset would generate a correct result.
In the merge sentences modifiers, it uses whitespace tokenization:
OpusTrainer/src/opustrainer/modifiers/merge.py
Lines 12 to 17 in 9ec77d3
And then counts the tokens to perform offsetting for the alignments:
OpusTrainer/src/opustrainer/modifiers/merge.py
Lines 28 to 31 in 9ec77d3
However, for non-whitespace segmented languages, and for training using subword tokenization, this is not correct. The fix here would be to provide a tokenizer configuration that could create correct tokenization, such as a SentencePiece tokenizer.
The text was updated successfully, but these errors were encountered: