Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge sentences produces incorrect alignments when used with SentencePiece #53

Open
gregtatum opened this issue Mar 1, 2024 · 1 comment

Comments

@gregtatum
Copy link
Contributor

In the merge sentences modifiers, it uses whitespace tokenization:

rows = [line.split('\t') for line in inputs]
input_tokens = (
[row[0].split() for row in rows], # src
[row[1].split() for row in rows], # trg
)

And then counts the tokens to perform offsetting for the alignments:

offsets = tuple(
list(accumulate((len(sentence) for sentence in side), initial=0))
for side in input_tokens
)

However, for non-whitespace segmented languages, and for training using subword tokenization, this is not correct. The fix here would be to provide a tokenizer configuration that could create correct tokenization, such as a SentencePiece tokenizer.

@gregtatum
Copy link
Contributor Author

I read up on #38 which states that part of the design is to augment based off of whitespace splitting. I'm unsure what would be the best way to preserve the original alignment information.

Perhaps you could map each alignment along the way, or maybe just assume that tokenization counting up the original alignments and applying the offset would generate a correct result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant