Numbered Pinyin issues encountered in CEDICT #29

bai-yi-bai · 2020-10-10T14:52:07Z

Hi, Dragonmapper is an awesome library. I am using it (0.2.6) for many projects, which use CEDICT as a data source for further text processing. I found problems with numbered pinyin, accented pinyin, and zhuyin fuhao transcriptions.

Before I begin, I want to note I am not a Mandarin expert, therefore I don't know if my suggestions are the correct ones. A lot of my suggested clean up edits to CEDIT have been accepted. However, since CEDIT is not in a standard format like .csv, I had to build my own parser, read the data line by line, and .split() it to feed Dragonmapper. I'm not sure whether every issue I've discovered should be solved by Dragonmapper, I will simply present the problems I needed to work around and leave it up to discussion.

Issues

Numbered Pinyin do not convert to Accented
Accented pinyin which do not convert to zhuyin fuhao
Already noted in issue 27
Taiwanese pronunciation exceptions

Numbered Pinyin do not convert to Accented Pinyin

More than 2000 entries in the CEDICT have 'u:' combinations. 'yo1' and 'yo5' also have a combined 5 items in CEDICT which Dragonmapper cannot convert these items from numbered pinyin to accented pinyin. I found it necessary to loop through in this order:

'u:4', 'ǜ'
'u:3', 'ǚ'
'u:2', 'ǘ'
'u:1', 'ǖ'
'u:', 'ü'
'yo1', 'yō'
'yo5', 'yo'

These items raise 'ValueError: Not a valid syllable:' exceptions.

Accented pinyin which do not convert to zhuyin fuhao

I also encountered the following items which do not convert correctly:

'ó':'ㄛˊ' # 哦哦 [o2] /oh (interjection indicating doubt or surprise)/
'ò':'ㄛˋ' # 哦哦 [o4] /oh (interjection indicating that one has just learned sth)/
'ō':'ㄛ'
'ǒ':'ㄛˇ'
'yō':'ㄧㄛ'
'yo':'ㄧㄛ˙'
'dia3':'ㄉㄧㄚˇ' # diǎ 嗲嗲 [dia3] /coy/childish/
'm2':'ㄇˊ'
'm4':'ㄇˋ'

Already noted in issue 27

#27

'tēi':'ㄊㄨㄟ' # Workaround for 忒忒 [tei1] /(dialect) too/very/also pr. [tui1]/
'eng1':'ㄥ' # Work around for ēng 鞥鞥 [eng1] /reins/

Taiwanese Pronunciation Exceptions

I found it necessary to skip items which contained Taiwanese pronunciations of ['khè' ,'goá' ,'khàu' ,'ô' ,'yai2'] . I'm not sure anything can be done about this with Dragonmapper.
dragonmapper.hanzi.to_zhuyin('goá')
Results in a 'ValueError: Not a valid syllable: o5'

The text was updated successfully, but these errors were encountered:

bai-yi-bai · 2022-12-13T14:51:42Z

I'm returning two years later to provide some additional help.
The best advice I can give is to test your input before trying to run dragonmapper on it.

Always Use `.lower()` Before Calling `.to_zhuyin()`

Use .lower() to avoid errors like this:
ValueError: Not a valid syllable: Ān

Fix Incorrect Pinyin Vowel/Tone Characters

There are a lot of strange encodings out there. Dragonmapper doesn't work with these characters. Sanitize the input.

yourstring.replace('á','á').replace('ǎ','ǎ').replace('ē','ē').replace('é', 'é').replace('ī','ī').replace('ǐ','ǐ').replace('ì','ì').replace('ò','ò').replace('ū','ū').replace( 'ǔ','ǔ').replace('ù', 'ù')

Edge Case: Handling Single Latin Consonants in Strings with `to_zhuyin()`

Let's say you have strings containing single latin consonants:

the_pinyin_input = 'X fēn zhī Y'
the_pinyin_input = 'X guāng'

Calling dragonmapper.transcriptions.to_zhuyin(the_pinyin_input) results in ValueError: String is not a valid Chinese transcription.

It is possible to .split() these and run try/except blocks on them, but there might be a better test available:

if 1 in [len(x) for x in the_pinyin_input.split()]:
    print(' '.join([dragonmapper.transcriptions.to_zhuyin(x) if len(x) > 1 else x for x in the_pinyin_input.split()]))

This splits the string into a list ['X', 'fēn', 'zhī', 'Y'], and tests the length of each element. The second list comprehension only operates on the longer elements. Unfortunately, test operates on a lot of strings with non-letter characters, such as / and .

Here's a slightly better alternative, testing for consonants in a string.

import string
if any(x in string.ascii_lowercase.strip('aeiu') for x in the_pinyin_input.split()):
    print(' '.join([dragonmapper.transcriptions.to_zhuyin(x) if len(x) > 1 else x for x in the_pinyin_input.split()]))

This uses the same list comprehension. Unfortunately this if statement runs on 的 (de, ㄉㄜ˙).

Hope this helps some people in the future.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numbered Pinyin issues encountered in CEDICT #29

Numbered Pinyin issues encountered in CEDICT #29

bai-yi-bai commented Oct 10, 2020

bai-yi-bai commented Dec 13, 2022

Numbered Pinyin issues encountered in CEDICT #29

Numbered Pinyin issues encountered in CEDICT #29

Comments

bai-yi-bai commented Oct 10, 2020

Numbered Pinyin do not convert to Accented Pinyin

Accented pinyin which do not convert to zhuyin fuhao

Already noted in issue 27

Taiwanese Pronunciation Exceptions

bai-yi-bai commented Dec 13, 2022

Always Use .lower() Before Calling .to_zhuyin()

Fix Incorrect Pinyin Vowel/Tone Characters

Edge Case: Handling Single Latin Consonants in Strings with to_zhuyin()

Always Use `.lower()` Before Calling `.to_zhuyin()`

Edge Case: Handling Single Latin Consonants in Strings with `to_zhuyin()`