Refactor filters as transfomers & scorers #130

jelmervdl · 2023-10-31T12:23:52Z

In hindsight, OpusFilter had the right idea here.

OpusCleaner right now has filters, which take lines on their stdin and produce lines on their stdout. This model is really simple, and matches what you normally do in bash scripts. But it also makes it so that you can't assume anything (or validate anything) about the output.

In practice, filters can do a combination of three things:

Transform lines: number of lines in input and output are the same, but the content of each individual line might have been altered. Think transl(iter)ation, or fixing orthogonal errors.
Remove lines: short lines, whitespace, low scoring lines, etc.
Add lines: uncommon, but think of a sentence splitter. bifixer can do this for example.

Filters that do 1 can be executed in a parallel streaming fashion so that's nice. You can also use tools like col.py to write them only on a single column (so you don't have to do column parsing in your filter 🎉)

Filters that do 2 can be rewritten as 1 but instead of changing the output, they score each line in the input. Thresholding can then be done using threshold.py which can keep a score cache, and the frontend could present a histogram to make it easier to pick a threshold. For filters that just remove empty lines, this could be a simple 0 for empty, 1 for non-empty. For language filters you can use prediction scores. Bicleaner also works with scores like this.

Additionally, if the frontend knows a filter is a type 1 or type 2, it can make better choices in how to present the diff.

Filters that do 3 are a bit of a pain, but also uncommon. I think we can keep a sort of legacy support for these types of filters.

Type 3 filters can still be used to also still support the filters that do too much (i.e. both transform and filter). And we can discourage these filters by not giving them any of the fancy interface or performance benefits.

The text was updated successfully, but these errors were encountered:

jelmervdl added the enhancement New feature or request label Oct 31, 2023

jelmervdl mentioned this issue Oct 31, 2023

Whitespace normalization filter #128

Merged

jelmervdl mentioned this issue Jan 7, 2024

Support monolingual datasets #141

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor filters as transfomers & scorers #130

Refactor filters as transfomers & scorers #130

jelmervdl commented Oct 31, 2023

Refactor filters as transfomers & scorers #130

Refactor filters as transfomers & scorers #130

Comments

jelmervdl commented Oct 31, 2023