Details on TTS evaluation? #42

Btlmd · 2024-09-14T13:30:02Z

Hello! Thanks for your wonderful work. Trying to reproduce your results on the TTS task, I'm wondering if you could provide more details about the evaluation of the TTS task, especially:

How many / Which samples are used in VCTK dataset
Which ASR model is used to convert the generated speech into text
How the WER is calculated; What kind of text normalization is applied before the calculation

Thanks!

JunZhan2000 · 2024-09-30T15:15:25Z

Please refer to Appendix C.

jee019 · 2024-10-11T10:39:10Z

Hi, I have a few questions about zero-shot TTS evaluation using the VCTK dataset.

1. Evaluation Methodology:

We randomly select a 3-second clip from each speaker as the vocal prompt along with a separate text as input.

In the paper, particularly in Appendix C, the evaluation process seems a bit open to interpretation. Could you please provide a detailed description of how the evaluation was conducted?
Additionally, did you use all audio files from the VCTK dataset for the evaluation and cut off any entries longer than 3 seconds?

2. Dataset Usage:
I am curious about the specific version of the VCTK dataset used in your study. Did you only utilize audio files from the "mic2" recordings in the VCTK 0.92 version?

jingfanke · 2025-01-27T15:34:50Z

Hi,

I'm currently trying to reproduce your results on the TTS task, too. My performance on the VCTK 0.92 dataset yielded a WER of around 27 and a speaker similarity score of approximately 71.

Here’s a summary of my reproduction setup:

I randomly selected 2 audio samples from each speaker in the VCTK dataset. I ensured that one of the audio samples was longer than 3 seconds, from which I randomly clipped a 3-second segment for prompts.
Following Appendix C of the paper, I utilized the Whisper-medium model for ASR and employed the WavLM-TDNN code to compute the similarity between the model-generated audio and the human-spoken audio.
For each speaker, I evaluated TTS performance using these 2 audio files and calculated the average score across all speakers to obtain the final benchmark scores.

Could you please provide more details regarding the evaluation process for the TTS task? Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Details on TTS evaluation? #42

Details on TTS evaluation? #42

Btlmd commented Sep 14, 2024

JunZhan2000 commented Sep 30, 2024

jee019 commented Oct 11, 2024

jingfanke commented Jan 27, 2025 •

edited

Loading

Details on TTS evaluation? #42

Details on TTS evaluation? #42

Comments

Btlmd commented Sep 14, 2024

JunZhan2000 commented Sep 30, 2024

jee019 commented Oct 11, 2024

jingfanke commented Jan 27, 2025 • edited Loading

jingfanke commented Jan 27, 2025 •

edited

Loading