Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do something more for the silence mic hallucinating part #20

Closed
Ughuuu opened this issue Dec 4, 2023 · 8 comments
Closed

Do something more for the silence mic hallucinating part #20

Ughuuu opened this issue Dec 4, 2023 · 8 comments

Comments

@Ughuuu
Copy link
Collaborator

Ughuuu commented Dec 4, 2023

It's not very bad as is, maybe we just expose the vad option threshold and play with it.

@fire
Copy link
Member

fire commented Jan 16, 2024

@aiaimimi0920 Is the silence mic hallucination better now?

@aiaimimi0920
Copy link
Contributor

no, now when you not talking ,it will automatically adds "Thank you" or "you"
like this issue: ggerganov/whisper.cpp#1592

maybe I can use the method mentioned in this issue to solve it

@fire
Copy link
Member

fire commented Jan 23, 2024

Is this fixed by the latest changes?

@aiaimimi0920
Copy link
Contributor

Not resolved.

  1. I just added a bloacking character map like "thanks" and "thanks you". But when using other languages, there is still a high probability of hallucinatory characters text, such as “xx字幕”,“谢谢你”

  2. Because when I check whether the audio is pure silent in the "add buffer", it will detect whether the energy is less than a certain value. If your environment is very quiet, it won't generate hallucinatory text.
    But if your environment is somewhat noisy, such as the sound of a fan, it will still be judged as a valid audio file entering the inference stage, and then generate generating hallucinatory text

Possible solutions:

  1. Increase the amount of blacklist: However, I am very skeptical whether it will filter out valid text
  2. Before input, make a certain judgment on whether the audio only contains environmental noise. If there is no effective human voice, filter it directly

@fire
Copy link
Member

fire commented Jan 23, 2024

I've seen solutions like implementing an audio denoise https://github.com/snakers4/silero-vad but it seems more work to use than ggml-whisper.

@aiaimimi0920
Copy link
Contributor

It is a PyTorch model, and we may need a C++ implementation, or use iree.gd?

@fire
Copy link
Member

fire commented Jan 23, 2024

I can't estimate how long it will take to combine silero-vad with Whisper together but it should be feasible inside of iree.gd.

@Ughuuu
Copy link
Collaborator Author

Ughuuu commented Feb 22, 2024

Exposed the vad option for threshold. Also made it so that if it's halucinating, only take max 4 tokens, and no more(as they could be legit characters).
Anything else should fall outside of this repo(eg. processing of text with iree.gd). We can open another issue for that after iree.gd is released.

@Ughuuu Ughuuu closed this as completed Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants