Bad behavior in wikiextractor #200

HarikalarKutusu · 2023-07-21T23:58:52Z

Antik Yunanca Grekçe: matesis kelimesi matematik kelimesinin köküdür ve bilirim anlamına gelmektedir.

This is the related source:

Antik Yunanca ''{{dil|grc|matesis}}'' kelimesi matematik kelimesinin köküdür ve ''bilirim'' anlamına gelmektedir.

And this is what is extracted (from text/AA/wiki_00 file):

Antik Yunanca ' kelimesi matematik kelimesinin köküdür ve \"bilirim\" anlamına gelmektedir.

Somehow a ' is introduced and the Greek word is dropped. So the sentence has no meaning but except for the ' character, it is OK.
As the Greek word is also removed, we also cannot blacklist it.

I'm not sure how many such occurrences would drop into the random 3 selection, but a solution might be good.

PS: I'm aware this is NOT a cv-sentence-extractor issue, but the workflow includes wikiextractor, so...

The text was updated successfully, but these errors were encountered:

MichaelKohler · 2023-07-22T09:03:55Z

Previously filed issue, also with a link to an issue on the WikiExtractor repo: #72

HarikalarKutusu · 2023-07-22T18:25:21Z

@MichaelKohler, what is the reason for git checkout e4abb4cbd019b0257824ee47c23dd163919b731b for wikiextractor?

There has been a rather recent commit for template handling in that repo, I'm not sure if it is relevant, but as far as I can understand the issue is also related to templates...

(I scanned closed issues but only from titles, I should have looked deeper)

MichaelKohler · 2023-08-08T21:52:15Z

Sorry for the delay here. Here is the history of the Workflow file before it became wiki.sh: https://github.com/common-voice/cv-sentence-extractor/commits/1106b6851e4f725b3cbea3fc9128ae0af6dcb388/scripts/wiki-extraction.sh . From what I can see, this pinned commit got introduced in 4f12023.

Before that, we had ef4c47f which tried to fix an issue we ran into because of a new version. This didn't seem to work out nicely and therefore I pinned the version to at least keep the pipeline going as needed.

There are potentially new improvements there with newer versions, very well possible indeed. But I'd guess this would need some work to get it going again.

Alternatively, it seems there are also Cirrus dumps, which claim to include already expanded templates. That might be worth a try as well, though I can't say if there are other issues with that of course.

MichaelKohler added enhancement New feature or request extract-improvements labels Jul 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad behavior in wikiextractor #200

Bad behavior in wikiextractor #200

HarikalarKutusu commented Jul 21, 2023

MichaelKohler commented Jul 22, 2023

HarikalarKutusu commented Jul 22, 2023 •

edited

Loading

MichaelKohler commented Aug 8, 2023 •

edited

Loading

Bad behavior in wikiextractor #200

Bad behavior in wikiextractor #200

Comments

HarikalarKutusu commented Jul 21, 2023

MichaelKohler commented Jul 22, 2023

HarikalarKutusu commented Jul 22, 2023 • edited Loading

MichaelKohler commented Aug 8, 2023 • edited Loading

HarikalarKutusu commented Jul 22, 2023 •

edited

Loading

MichaelKohler commented Aug 8, 2023 •

edited

Loading