Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numbered Pinyin issues encountered in CEDICT #29

Open
bai-yi-bai opened this issue Oct 10, 2020 · 1 comment
Open

Numbered Pinyin issues encountered in CEDICT #29

bai-yi-bai opened this issue Oct 10, 2020 · 1 comment

Comments

@bai-yi-bai
Copy link

Hi, Dragonmapper is an awesome library. I am using it (0.2.6) for many projects, which use CEDICT as a data source for further text processing. I found problems with numbered pinyin, accented pinyin, and zhuyin fuhao transcriptions.

Before I begin, I want to note I am not a Mandarin expert, therefore I don't know if my suggestions are the correct ones. A lot of my suggested clean up edits to CEDIT have been accepted. However, since CEDIT is not in a standard format like .csv, I had to build my own parser, read the data line by line, and .split() it to feed Dragonmapper. I'm not sure whether every issue I've discovered should be solved by Dragonmapper, I will simply present the problems I needed to work around and leave it up to discussion.

Issues

  1. Numbered Pinyin do not convert to Accented
  2. Accented pinyin which do not convert to zhuyin fuhao
  3. Already noted in issue 27
  4. Taiwanese pronunciation exceptions

Numbered Pinyin do not convert to Accented Pinyin

More than 2000 entries in the CEDICT have 'u:' combinations. 'yo1' and 'yo5' also have a combined 5 items in CEDICT which Dragonmapper cannot convert these items from numbered pinyin to accented pinyin. I found it necessary to loop through in this order:

  • 'u:4', 'ǜ'
  • 'u:3', 'ǚ'
  • 'u:2', 'ǘ'
  • 'u:1', 'ǖ'
  • 'u:', 'ü'
  • 'yo1', 'yō'
  • 'yo5', 'yo'

These items raise 'ValueError: Not a valid syllable:' exceptions.

Accented pinyin which do not convert to zhuyin fuhao

I also encountered the following items which do not convert correctly:

  • 'ó':'ㄛˊ' # 哦 哦 [o2] /oh (interjection indicating doubt or surprise)/
  • 'ò':'ㄛˋ' # 哦 哦 [o4] /oh (interjection indicating that one has just learned sth)/
  • 'ō':'ㄛ'
  • 'ǒ':'ㄛˇ'
  • 'yō':'ㄧㄛ'
  • 'yo':'ㄧㄛ˙'
  • 'dia3':'ㄉㄧㄚˇ' # diǎ 嗲 嗲 [dia3] /coy/childish/
  • 'm2':'ㄇˊ'
  • 'm4':'ㄇˋ'

Already noted in issue 27

#27

  • 'tēi':'ㄊㄨㄟ' # Workaround for 忒 忒 [tei1] /(dialect) too/very/also pr. [tui1]/
  • 'eng1':'ㄥ' # Work around for ēng 鞥 鞥 [eng1] /reins/

Taiwanese Pronunciation Exceptions

I found it necessary to skip items which contained Taiwanese pronunciations of ['khè' ,'goá' ,'khàu' ,'ô' ,'yai2'] . I'm not sure anything can be done about this with Dragonmapper.
dragonmapper.hanzi.to_zhuyin('goá')
Results in a 'ValueError: Not a valid syllable: o5'

@bai-yi-bai
Copy link
Author

I'm returning two years later to provide some additional help.
The best advice I can give is to test your input before trying to run dragonmapper on it.

Always Use .lower() Before Calling .to_zhuyin()

Use .lower() to avoid errors like this:
ValueError: Not a valid syllable: Ān

Fix Incorrect Pinyin Vowel/Tone Characters

There are a lot of strange encodings out there. Dragonmapper doesn't work with these characters. Sanitize the input.

yourstring.replace('á','á').replace('ǎ','ǎ').replace('ē','ē').replace('é', 'é').replace('ī','ī').replace('ǐ','ǐ').replace('ì','ì').replace('ò','ò').replace('ū','ū').replace( 'ǔ','ǔ').replace('ù', 'ù')

Edge Case: Handling Single Latin Consonants in Strings with to_zhuyin()

Let's say you have strings containing single latin consonants:

the_pinyin_input = 'X fēn zhī Y'
the_pinyin_input = 'X guāng'

Calling dragonmapper.transcriptions.to_zhuyin(the_pinyin_input) results in ValueError: String is not a valid Chinese transcription.

It is possible to .split() these and run try/except blocks on them, but there might be a better test available:

if 1 in [len(x) for x in the_pinyin_input.split()]:
    print(' '.join([dragonmapper.transcriptions.to_zhuyin(x) if len(x) > 1 else x for x in the_pinyin_input.split()]))

This splits the string into a list ['X', 'fēn', 'zhī', 'Y'], and tests the length of each element. The second list comprehension only operates on the longer elements. Unfortunately, test operates on a lot of strings with non-letter characters, such as / and .

Here's a slightly better alternative, testing for consonants in a string.

import string
if any(x in string.ascii_lowercase.strip('aeiu') for x in the_pinyin_input.split()):
    print(' '.join([dragonmapper.transcriptions.to_zhuyin(x) if len(x) > 1 else x for x in the_pinyin_input.split()]))

This uses the same list comprehension. Unfortunately this if statement runs on 的 (de, ㄉㄜ˙).

Hope this helps some people in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant