-
-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jyutping to IPA support #44
Comments
Hello, thank you for reaching out here! Coincidentally, a while ago other colleagues also asked about a Jyutping-to-IPA conversion function. I then managed to put together a draft implementation for pip install git+https://github.com/jacksonllee/pycantonese.git@jyutping-to-ipa Sample usage: >>> import pycantonese
>>> pycantonese.jyutping_to_ipa('gwong2dung1waa2') # 廣東話 Cantonese
['kʷɔŋ25', 'tʊŋ55', 'waː25']
>>> pycantonese.jyutping_to_ipa('gwong2dung1waa2', as_list=False)
'kʷɔŋ25 tʊŋ55 waː25' For details such as Jyutping-to-IPA mapping tables, customization, and documentation notes, please see the source code of the branch: https://github.com/jacksonllee/pycantonese/compare/jyutping-to-ipa. Hope this helps! |
Hello! Great thanks for your reply. It really helps a lot in preparing the dataset! The added function jyutping_to_ipa() works most of the time, but we just encounter this error while parsing the wikipedia zh-yue dataset, wonder if you have any insights into the issue: File ~/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pycantonese/jyutping/ipa.py:133, in jyutping_to_ipa(jp_str, as_list, onsets, nuclei, codas, tones)
75 def jyutping_to_ipa(
76 jp_str: str,
77 as_list: bool = True,
(...)
82 tones: Optional[Dict[str, str]] = None,
83 ) -> Union[List[str], str]:
84 """Convert Jyutping romanization into IPA.
85
86 The Jyutping-to-IPA mapping is based on Matthews and Yip (2011: 461-463).
(...)
131 ['tsʰi˥']
132 """
--> 133 jp_parsed_list = parse_jyutping(jp_str)
134 ipa_list = []
136 for jp_parsed in jp_parsed_list:
File ~/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pycantonese/jyutping/parse_jyutping.py:168, in parse_jyutping(jp_str)
165 onset = cv
167 if onset not in ONSETS:
--> 168 raise ValueError("onset error -- " + repr(jp))
170 jp_parsed_list.append(Jyutping(onset, nucleus, coda, tone))
172 return jp_parsed_list
ValueError: onset error -- 'vi1' It contains some japanese character, not sure if that's the cause of the error. But for unrecognised characters, usually the method will just return empty string. So, it might not be caused by unrecognised characters? I would like to thanks again for your kind help! |
The stack trace shows that the error was |
Hi Jackson! Just wondering what the numbers represent after the IPA transcription? |
The numbers represent tone using the Chao tone letters. For instance, "55" means the high-level tone, i.e., tone 1 in Jyutping. |
Thanks for your reply! I am currently doing something similar to rjrobben, however trying to map articulatory features to each phonemes, for Jyutping, there is the |
If I understand what you're trying to do, it can be done in a two-step process: (1) use I should be making a new release of pycantonese soon, so that folks who'd like to use the new |
Vowel length is not contrastive for Cantonese (except for the borderline case between Jyutping "aa" and "a", which in my mappings I've used [aː] and [ɐ], respectively, for both differences of vowel length and quality). For basic/canonical/regular IPA transcription, vowel length shouldn't be or at least doesn't need to be part of it. My choice of the exact symbols for Jyutping-to-IPA conversation is based on Matthews and Yip (2011), already documented here. If for whatever reason (e.g., if the transcription you need isn't "basic/canonical/regular" but for, say, showing a specific speaker's speech features) you want to override any of the pre-defined mappings, then |
Thank you very much! |
Feature you are interested in and your specific question(s):
Is there any method that does jyutping to ipa ? I know there's a jyutping to tipa method now, would be great if also have jyutping to ipa.
What you are trying to accomplish with this feature or functionality:
I am currently helping to prepare the data for training the cantonese part of a multilingual pl-bert for the open source StyleTTS2 model. link. We need a grapheme to phoneme library for zh-yue/zh language using the wikipedia dataset.
We have yet to find a good enough quality g2p library, tried espeak-ng, some deep learning library, that fits into the StyleTTS2 format. So we are attempting to use the pycantonese characters_to_jyutping method, then convert from jyutping_to_ipa.
Additional context:
The text was updated successfully, but these errors were encountered: