Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Imbalanced quotation mark in Mozilla Common Voice Japanese Dataset #2321

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

calcoloergosum
Copy link

@calcoloergosum calcoloergosum commented Dec 11, 2022

Summary

Mozilla Common Voice 11.0 Japanese dataset has unbalanced quotation mark that makes bin/import_cv2.py panic.

Reproduction

$ python bin/import_cv2.py cv-corpus-11.0-2022-09-21/ja/ --validate_label_locale $SOMETHING
...
Loading TSV file:  /mnt/ntfs/dev/voice/ja_simple/validated.tsv
Traceback (most recent call last):
  File "bin/import_cv2.py", line 255, in <module>
    main()
  File "bin/import_cv2.py", line 250, in main
    _preprocess_data(PARAMS.tsv_dir, audio_dir, PARAMS.space_after_every_character)
  File "bin/import_cv2.py", line 196, in _preprocess_data
    set_samples = _maybe_convert_set(
  File "bin/import_cv2.py", line 130, in _maybe_convert_set
    for row in reader:
  File "/usr/lib/python3.8/csv.py", line 111, in __next__
    row = next(self.reader)
_csv.Error: field larger than field limit (131072)
...

Why does it happen?

cv-corpus-11.0-2022-09-21/ja/validated.tsv has 4 lines that can potentially mess up csv package's quotation handling.

$ % cat ../common-voice-filter/cv-corpus-11.0-2022-09-21/ja/validated.tsv | grep '     "'                      
3447120ac93b7c7788687c259b7f55058804e4982c36174a9a0af762495a6c2310915d2b10562a1f75255d5b0a18eefb304ef7b042006d96d83158f22d238de8        common_voice_ja_26130815.mp3    "では、危険だということですか?"と彼は武者震いをしながら言った。 2       0       twenties        male            ja
3447120ac93b7c7788687c259b7f55058804e4982c36174a9a0af762495a6c2310915d2b10562a1f75255d5b0a18eefb304ef7b042006d96d83158f22d238de8        common_voice_ja_26134634.mp3    "ローデシアから来たのを覚えているだろう」「なんてことだ、殺人犯め!」と彼は声を詰まらせた。       2       0       twenties        male            ja
02a8841a00d762472a4797b56ee01643e8d9ece5a225f2e91c007ab1f94c49c99e50d19986ff3fefb18190257323f34238828114aa607f84fbe9764ecf5aaeaa        common_voice_ja_26015806.mp3    "パン・アム・クリッパーコネクション" バナーのもと、定期通勤サービスを運営していた。      2       0       fourties        female          ja
02a8841a00d762472a4797b56ee01643e8d9ece5a225f2e91c007ab1f94c49c99e50d19986ff3fefb18190257323f34238828114aa607f84fbe9764ecf5aaeaa        common_voice_ja_26127330.mp3    "もちろん違います。"ドロシーは答えました。 "私は何をすべきか?"  2       0       fourties        female          ja

Note that in the second occurrence, the quotation mark is not balanced. I assume it has something to do with Japanese typing system. Japanese language often uses 「」 instead of "", and it needs manual conversion, and for some reason it didn't get converted properly.

At the same time, python defaults double quotation mark as the quote character when parsing csv. So python tries to parse the file until the next quotation mark appears. The next occurrence is line 31236 (3712 lines later), thus the error message: _csv.Error: field larger than field limit (131072)

Fix

Do not use default quote character. In fact, do not worry about quotation at all when parsing csv.
That is what Common Voice ToolBox Package is doing too

@CLAassistant
Copy link

CLAassistant commented Dec 11, 2022

CLA assistant check
All committers have signed the CLA.

@calcoloergosum calcoloergosum changed the title tsv file quotation handling behavior - no more error on common voice ja Imbalanced quotation mark in Mozilla Common Voice Dec 11, 2022
@calcoloergosum calcoloergosum changed the title Imbalanced quotation mark in Mozilla Common Voice Imbalanced quotation mark in Mozilla Common Voice Japanese Dataset Dec 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants