Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About usage #17

Open
MonolithFoundation opened this issue Jan 14, 2025 · 6 comments
Open

About usage #17

MonolithFoundation opened this issue Jan 14, 2025 · 6 comments

Comments

@MonolithFoundation
Copy link

Hello, I'd like use this model performance a scenario of already segmented audio part by VAD, but these segments might have connected more than 1 speaker. Is. this model able to do that?

For detail, like:

---
   ---

they are different speaker, but VAD can't not separate them. I need their independent voices.

If it does, any simple code snippet can be referenced to do it?

@DiLiangWU
Copy link
Member

You can obtain the prediction diarization results by following these steps

  • Prepare the wav.scp file, like:

mix001 /path/to/audio/file1.wav
mix002 /path/to/audio/file2.wav
mix003 /path/to/audio/file3.wav

  • Comment out self.segments, self.utt2spk, self.reco2dur, and self.spk2utt in kaldi_data.py.
  • Modify datasets.diarization_dataset to datasets.diarization_dataset_predict
  • Modify trainer.test(spk_dia_main) to trainer.predict(spk_dia_main)
  • python train_dia.py --configs conf/xxx_infer.yaml --gpus YOUR_DEVICE_ID, --test_from_folder YOUR_CKPT_SAVE_DIR
  • Generate speech activity probability

cd visualize
python gen_h5_output.py

Then you can obtain the decision results through a threshold, like in the image below, for single-speaker speech extraction.

Image

@MonolithFoundation
Copy link
Author

Hi, what if spk1 and spk2 have overlap?

I just want a code that can send a voice in, output timestamp result.

@DiLiangWU
Copy link
Member

DiLiangWU commented Jan 21, 2025

Hi, what if spk1 and spk2 have overlap?

Sorry, I misunderstood that your input is non-overlapping multi-speaker speech. FS-EEND can naturally handle overlapping speech. An example of the output is shown in the figure below.

Image

The code for receiving a WAV file and outputting a Rich Transcription Time Marked (RTTM) file has been updated. You can infer by (# Modify val_data_dir in conf/xxx_infer.yaml according to your own WAV directory.)
python train_dia_pred.py --configs conf/xxx_infer.yaml --gpus YOUR_DEVICE_ID, --test_from_folder YOUR_CKPT_SAVE_DIR

@MonolithFoundation
Copy link
Author

Am wondering if there any as simple as possible function to do this for example: dia_pred(audio_path), then it returns the timestamps dict.

I looked the train_dia_pred code, way to complicated and coupled with training all kinds of code.

Would consider make a simple inference only code for users easy to use out of box?

@DiLiangWU
Copy link
Member

Sure, I understand your point. Thank you for your suggestion. We will simplify the inference code and update it in the repo.

@MonolithFoundation
Copy link
Author

MonolithFoundation commented Jan 21, 2025

Thank u so much for the consideration! Hoping for a strong base diari model with overlap that can use at ease

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants