Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions on training details and enhancement process #19

Open
TianyuCao opened this issue Jun 30, 2024 · 3 comments
Open

Questions on training details and enhancement process #19

TianyuCao opened this issue Jun 30, 2024 · 3 comments

Comments

@TianyuCao
Copy link

TianyuCao commented Jun 30, 2024

Hi,

Thanks for your great work. I have several questions and hope you can clarify it.

For the Storm model in paper, what is the batch size (I saw 8 by default in the codes)? How many epochs did it train (I also saw the earlystopping setting in codes, but I wonder whether it was trained until the max of 1000 epochs, or stopped by earlystopping after 50 patience)?

Besides, I saw "For training, sequences of 256 STFT frames (≈2s) are randomly extracted from the full-length utterances". In this case, when it comes to enhancement, does it segment the whole input into several frames (2s), enhance each frame, and finally concatenate them as the output? Or enhance the whole utterance at the same time?

Also, I just generated the data based on your codes and use the WSJ0+Chime3 checkpoint to denoise the data. However, the pretrained checkpoint has lower results than article results. I wonder whether the default parameters in the codes are exactly the same as what was used to obtain the results in the paper for both data generation (create_data.py) as well as model training.

Sorry for so many questions. Thanks for your clarifications in advance.

@MichaelChen147
Copy link

Good question! I also want to know it.
I am a second-year graduate student. I was also researching speech enhancement based on diffusion models.
Can we have a talk?
你好

@jmlemercier
Copy link
Member

Hi @TianyuCao thank you for your interest!

The training was interrupted using the early stopping criterion in the code, we did the experiments as in the current version of the code. I recall early stopping was most of the time raised after 200-300 epochs. Batch size was 8 * 4 GPUs = effective batch size of 32.
For inference, the sequences are not segmented, but entirely fed to the machine. Since the NCSN++ is CNN and Transformer-based, it can perform on arbitrary lengths (with of course, a limited effective receptive field). This means the longer your sequences, the more memory burden on your GPU. But it should not affect performance itself.

W.r.t your reported results: could you share the figures you obtained, as well as the parameters used for inference please? (and also which results you compare to in the paper, i.e. which line in which table).

Thank you

@jmlemercier
Copy link
Member

Good question! I also want to know it. I am a second-year graduate student. I was also researching speech enhancement based on diffusion models. Can we have a talk? 你好

I am unable to have a general research discussion at the moment, but invite you to ask any question related to this repository please (i.e. code issues, unexpected behaviour) by raising an issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants