Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speech+Transcript conditioned phoneme recognition as an alternative to G2P #29

Open
vishalbhavani opened this issue Nov 2, 2021 · 1 comment

Comments

@vishalbhavani
Copy link

Hi @wookladin ,
While creating the training data, G2P gives phonemes based on how a particular word is supposed to be pronounced, but the audio might have a slightly different pronunciation due to various accents. I understand that you've used proprietary G2P for better results. But g2p models only utilize transcript information.

  1. A speech+transcript conditioned phoneme recognizer would give better results wouldn't it?
  2. Phoneme error rates are still high in the latest ASR acoustic models. Usually, ASR acoustic models predict not-so-accurate phonemes and ASR language models predict the transcript from the phonemes. But here we want to improve the accuracy of the phonemes given audio and transcript. I couldn't find any literature around that. Any leads/ideas?
@wookladin
Copy link
Contributor

Hi, sorry for the late response.

  1. Yes. It would be better if there is a phoneme recognizer that gets speech and transcript as an input. But I've never seen such an approach.
  2. I think it could be better if the language model of ASR and the other model using the ASR model's output to reconstruct original speech are learned jointly. However, this training process will be unstable. I think using RL might help in this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants