Merge sentences produces incorrect alignments when used with SentencePiece #53

gregtatum · 2024-03-01T15:38:22Z

In the merge sentences modifiers, it uses whitespace tokenization:

OpusTrainer/src/opustrainer/modifiers/merge.py

Lines 12 to 17 in 9ec77d3

    
           rows = [line.split('\t') for line in inputs] 
        
           input_tokens = ( 
        
               [row[0].split() for row in rows], # src 
        
               [row[1].split() for row in rows], # trg 
        
           )

And then counts the tokens to perform offsetting for the alignments:

OpusTrainer/src/opustrainer/modifiers/merge.py

Lines 28 to 31 in 9ec77d3

    
           offsets = tuple( 
        
               list(accumulate((len(sentence) for sentence in side), initial=0)) 
        
               for side in input_tokens 
        
           )

However, for non-whitespace segmented languages, and for training using subword tokenization, this is not correct. The fix here would be to provide a tokenizer configuration that could create correct tokenization, such as a SentencePiece tokenizer.

gregtatum · 2024-03-01T19:20:39Z

I read up on #38 which states that part of the design is to augment based off of whitespace splitting. I'm unsure what would be the best way to preserve the original alignment information.

Perhaps you could map each alignment along the way, or maybe just assume that tokenization counting up the original alignments and applying the offset would generate a correct result.

This was referenced Mar 1, 2024

OpusTrainer can produce incorrect alignments, breaking student training with guided alignments mozilla/translations#469

Closed

Synthesized alignments should be validated before producing a new sentence pair #54

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge sentences produces incorrect alignments when used with SentencePiece #53

Merge sentences produces incorrect alignments when used with SentencePiece #53

gregtatum commented Mar 1, 2024

gregtatum commented Mar 1, 2024

Merge sentences produces incorrect alignments when used with SentencePiece #53

Merge sentences produces incorrect alignments when used with SentencePiece #53

Comments

gregtatum commented Mar 1, 2024

gregtatum commented Mar 1, 2024