Questions on training details and enhancement process #19

TianyuCao · 2024-06-30T04:30:46Z

Hi,

Thanks for your great work. I have several questions and hope you can clarify it.

For the Storm model in paper, what is the batch size (I saw 8 by default in the codes)? How many epochs did it train (I also saw the earlystopping setting in codes, but I wonder whether it was trained until the max of 1000 epochs, or stopped by earlystopping after 50 patience)?

Besides, I saw "For training, sequences of 256 STFT frames (≈2s) are randomly extracted from the full-length utterances". In this case, when it comes to enhancement, does it segment the whole input into several frames (2s), enhance each frame, and finally concatenate them as the output? Or enhance the whole utterance at the same time?

Also, I just generated the data based on your codes and use the WSJ0+Chime3 checkpoint to denoise the data. However, the pretrained checkpoint has lower results than article results. I wonder whether the default parameters in the codes are exactly the same as what was used to obtain the results in the paper for both data generation (create_data.py) as well as model training.

Sorry for so many questions. Thanks for your clarifications in advance.

MichaelChen147 · 2024-06-30T04:49:22Z

Good question! I also want to know it.
I am a second-year graduate student. I was also researching speech enhancement based on diffusion models.
Can we have a talk?
你好

jmlemercier · 2024-10-17T14:00:03Z

Hi @TianyuCao thank you for your interest!

The training was interrupted using the early stopping criterion in the code, we did the experiments as in the current version of the code. I recall early stopping was most of the time raised after 200-300 epochs. Batch size was 8 * 4 GPUs = effective batch size of 32.
For inference, the sequences are not segmented, but entirely fed to the machine. Since the NCSN++ is CNN and Transformer-based, it can perform on arbitrary lengths (with of course, a limited effective receptive field). This means the longer your sequences, the more memory burden on your GPU. But it should not affect performance itself.

W.r.t your reported results: could you share the figures you obtained, as well as the parameters used for inference please? (and also which results you compare to in the paper, i.e. which line in which table).

Thank you

jmlemercier · 2024-10-17T14:01:33Z

Good question! I also want to know it. I am a second-year graduate student. I was also researching speech enhancement based on diffusion models. Can we have a talk? 你好

I am unable to have a general research discussion at the moment, but invite you to ask any question related to this repository please (i.e. code issues, unexpected behaviour) by raising an issue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions on training details and enhancement process #19

Questions on training details and enhancement process #19

TianyuCao commented Jun 30, 2024 •

edited

Loading

MichaelChen147 commented Jun 30, 2024

jmlemercier commented Oct 17, 2024

jmlemercier commented Oct 17, 2024

Questions on training details and enhancement process #19

Questions on training details and enhancement process #19

Comments

TianyuCao commented Jun 30, 2024 • edited Loading

MichaelChen147 commented Jun 30, 2024

jmlemercier commented Oct 17, 2024

jmlemercier commented Oct 17, 2024

TianyuCao commented Jun 30, 2024 •

edited

Loading