Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

phone boundary between continuous vowels #5

Open
Liujingxiu23 opened this issue Aug 30, 2022 · 9 comments
Open

phone boundary between continuous vowels #5

Liujingxiu23 opened this issue Aug 30, 2022 · 9 comments

Comments

@Liujingxiu23
Copy link

Liujingxiu23 commented Aug 30, 2022

@petronny Hi, I have trained the model using chinese dataset successfully.
But I meet a problem, the bounary of continuous vowels is not as correct as other phones. For example "我安心的点点头", phone boundarys between “我” 和 “安”,“o3” and "an1" ,are wrong. And this kind of problems happen frequently.

For syllables like "yun1"(云) I can split to "y vn1" where “y” has a certain duration value, for "wu2"(无) I can split to "w u2" where "w" has a duration value. But for some vowel, for example "安/an" "阿/a”, there is really no consonant at all.

Have you found problems like this, how did you solve the problem?

@petronny
Copy link
Member

Well, it would be a superise to me if NeuFA (or any other FA model) predicts some insane boundaries.

Like the paper said, the 50 ms tolerance accuracy of NeuFA is 95% at word level.
It seems to be high. But in practice, for a sentence with 20 phonemes in example. The possibilty that there is a phoneme with a predicted boundary 50ms biased from the ground-truth is 1 - .95 ^ 20 = 64.15%. Similarly, the possibilty that there is a phoneme with a predicted boundary 100ms biased from the ground-truth is 1 - .98 ^ 20 = 33.24%.

Also, NeuFA currently doesn't restrict the predicted boundaries to be nonoverlapping (we are working on this in NeuFA 2),
which makes the situation even worse.

So my opinion is NeuFA is not ready for production enviroments yet.
But NeuFA could be used as a "soft" FA model which extracts the attention weights between the text and speech to map the information between them. And this is exactly why we propose NeuFA and how we use it in our other researches.

Hope this will answer your question.

@Liujingxiu23
Copy link
Author

Liujingxiu23 commented Aug 30, 2022

@petronny Thank you for your reply!

  1. "nonoverlapping" and "fixed thred=0.5" make boundaries not very clear, and the results are hard to use even though most of the results are really good.
  2. Can you share the code "extracts the attention weights between the text and speech to map the information between them"?

@petronny
Copy link
Member

the results are hard to use even though most of the results are really good.

I agree with that. We are working on the nonoverlapping issue.

Can you share the code to "extract the attention weights between the text and speech to map the information between them"?

See https://github.com/thuhcsi/NeuFA/blob/master/inference.py#L112 , I mainly uses the attention weights from the ASR direction.

@Liujingxiu23
Copy link
Author

Get it! Thank you again! @petronny

@Liujingxiu23
Copy link
Author

Liujingxiu23 commented Sep 6, 2022

I tried w_tts and w_asr at phone level, but the results are both bad since the result of the first phone "silence" of each sentence has a big difference from the ground truth. I did not why。 Then I tried weight = boundary_left - boudary_right for each phone (the weight values are 1 in the middle of the phone, and about 0 in the border of the phone) and ues functions in "https://github.com/as-ideas/DeepForcedAligner/blob/main/dfa/duration_extraction.py" to extract durations. Then I can get a continues , no overlap alignment.

@panxin801
Copy link

Well , in fact I meet the similar problem with you. In my experiment, align is not even Monotony. Which means end time of a word is earlier than the start time of the word. And this made this great work suits for real scenario I think . Your idea may work I think , thank you

@panxin801
Copy link

The bad case is like this

intervals [18]:
				xmin = 7.36
				xmax = 7.48
				text = "the"
			intervals [19]:
				xmin = 7.48
				xmax = 7.26 # watch here
				text = "assassination"
			intervals [20]:
				xmin = 7.71
				xmax = 7.93
				text = "of"
			intervals [21]:
				xmin = 7.95

@Liujingxiu23
Copy link
Author

@panxin801 May be you can use weight/attn = boundary_left - boundary_right for test.
@petronny I also found that in statistics as the paper showed, NeuFA is much better than MFA in my experiment using chinese dataset. But in some cases, the phones boundaries has very large deviation from the groundtruth. “Very large Error” ,for examples, larger than 5 frames, happens more than MFA.

@panxin801
Copy link

@Liujingxiu23 Yeah, I have the same conclusion with you, Chinese result better than English in average. And thank you for your advice

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants