You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, Dragonmapper is an awesome library. I am using it (0.2.6) for many projects, which use CEDICT as a data source for further text processing. I found problems with numbered pinyin, accented pinyin, and zhuyin fuhao transcriptions.
Before I begin, I want to note I am not a Mandarin expert, therefore I don't know if my suggestions are the correct ones. A lot of my suggested clean up edits to CEDIT have been accepted. However, since CEDIT is not in a standard format like .csv, I had to build my own parser, read the data line by line, and .split() it to feed Dragonmapper. I'm not sure whether every issue I've discovered should be solved by Dragonmapper, I will simply present the problems I needed to work around and leave it up to discussion.
Issues
Numbered Pinyin do not convert to Accented
Accented pinyin which do not convert to zhuyin fuhao
Already noted in issue 27
Taiwanese pronunciation exceptions
Numbered Pinyin do not convert to Accented Pinyin
More than 2000 entries in the CEDICT have 'u:' combinations. 'yo1' and 'yo5' also have a combined 5 items in CEDICT which Dragonmapper cannot convert these items from numbered pinyin to accented pinyin. I found it necessary to loop through in this order:
'u:4', 'ǜ'
'u:3', 'ǚ'
'u:2', 'ǘ'
'u:1', 'ǖ'
'u:', 'ü'
'yo1', 'yō'
'yo5', 'yo'
These items raise 'ValueError: Not a valid syllable:' exceptions.
Accented pinyin which do not convert to zhuyin fuhao
I also encountered the following items which do not convert correctly:
'tēi':'ㄊㄨㄟ' # Workaround for 忒 忒 [tei1] /(dialect) too/very/also pr. [tui1]/
'eng1':'ㄥ' # Work around for ēng 鞥 鞥 [eng1] /reins/
Taiwanese Pronunciation Exceptions
I found it necessary to skip items which contained Taiwanese pronunciations of ['khè' ,'goá' ,'khàu' ,'ô' ,'yai2'] . I'm not sure anything can be done about this with Dragonmapper. dragonmapper.hanzi.to_zhuyin('goá')
Results in a 'ValueError: Not a valid syllable: o5'
The text was updated successfully, but these errors were encountered:
I'm returning two years later to provide some additional help.
The best advice I can give is to test your input before trying to run dragonmapper on it.
Always Use .lower() Before Calling .to_zhuyin()
Use .lower() to avoid errors like this: ValueError: Not a valid syllable: Ān
Fix Incorrect Pinyin Vowel/Tone Characters
There are a lot of strange encodings out there. Dragonmapper doesn't work with these characters. Sanitize the input.
Calling dragonmapper.transcriptions.to_zhuyin(the_pinyin_input) results in ValueError: String is not a valid Chinese transcription.
It is possible to .split() these and run try/except blocks on them, but there might be a better test available:
if 1 in [len(x) for x in the_pinyin_input.split()]:
print(' '.join([dragonmapper.transcriptions.to_zhuyin(x) if len(x) > 1 else x for x in the_pinyin_input.split()]))
This splits the string into a list ['X', 'fēn', 'zhī', 'Y'], and tests the length of each element. The second list comprehension only operates on the longer elements. Unfortunately, test operates on a lot of strings with non-letter characters, such as / and .
Here's a slightly better alternative, testing for consonants in a string.
import string
if any(x in string.ascii_lowercase.strip('aeiu') for x in the_pinyin_input.split()):
print(' '.join([dragonmapper.transcriptions.to_zhuyin(x) if len(x) > 1 else x for x in the_pinyin_input.split()]))
This uses the same list comprehension. Unfortunately this if statement runs on 的 (de, ㄉㄜ˙).
Hi, Dragonmapper is an awesome library. I am using it (0.2.6) for many projects, which use CEDICT as a data source for further text processing. I found problems with numbered pinyin, accented pinyin, and zhuyin fuhao transcriptions.
Before I begin, I want to note I am not a Mandarin expert, therefore I don't know if my suggestions are the correct ones. A lot of my suggested clean up edits to CEDIT have been accepted. However, since CEDIT is not in a standard format like .csv, I had to build my own parser, read the data line by line, and .split() it to feed Dragonmapper. I'm not sure whether every issue I've discovered should be solved by Dragonmapper, I will simply present the problems I needed to work around and leave it up to discussion.
Issues
Numbered Pinyin do not convert to Accented Pinyin
More than 2000 entries in the CEDICT have 'u:' combinations. 'yo1' and 'yo5' also have a combined 5 items in CEDICT which Dragonmapper cannot convert these items from numbered pinyin to accented pinyin. I found it necessary to loop through in this order:
These items raise 'ValueError: Not a valid syllable:' exceptions.
Accented pinyin which do not convert to zhuyin fuhao
I also encountered the following items which do not convert correctly:
Already noted in issue 27
#27
Taiwanese Pronunciation Exceptions
I found it necessary to skip items which contained Taiwanese pronunciations of
['khè' ,'goá' ,'khàu' ,'ô' ,'yai2']
. I'm not sure anything can be done about this with Dragonmapper.dragonmapper.hanzi.to_zhuyin('goá')
Results in a 'ValueError: Not a valid syllable: o5'
The text was updated successfully, but these errors were encountered: