Investigate WhisperX performance #64

alundgard · 2024-12-09T20:08:23Z

At AI4LAM in June we discussed Whisper's tendency to hallucinate during silence and non-speech sound. University of Oslo noted they used WhisperX to fix this issue in their Autotekst application.
Documentation suggests WhisperX uses Voice Activity Detection in combination with a phoneme model.
To test this, we'd like to run a few of our hallucination-prone test items through WhisperX and observe the output.
- https://argo-qa.stanford.edu/view/druid:jz734cm7143 (tt618qz3245_sl): hallucinations during intro/outro music and silence
- https://argo-qa.stanford.edu/view/druid:dg444xm0599 (fb204cb6192_sl): hallucinations during instrumental music and also has lyrics
- https://argo-qa.stanford.edu/view/druid:fk250vc8974 (LLSF_7_20121209): has the "BF-WATCH TV" hallucination in English
- https://argo-qa.stanford.edu/view/druid:bz245jm8076 (bw689yg2740_sl): has a couple of stretches of hallucination over difficult to hear audio
- https://argo-qa.stanford.edu/view/druid:mc135dt6327 (bs744dg5568_sl): has the "Amara.org" hallucination in primarily non-English (Japanese in this case) audio
Question: should we try the WhisperX out-of-the-box parameters? We might use our Whisper pilot parameters, in particular setting model_size=large and condition_on_previous_text=False.

The text was updated successfully, but these errors were encountered:

dnoneill · 2025-01-08T23:08:11Z

alundgard · 2025-01-15T19:43:10Z

Comparing SDR Whisper to default WhisperX small

Druid: jz734cm7143, File: tt618qz3245_sl

SDR Whisper output: Hallucinations during music/noise at 00:00, 17:40, 23:22, 24:23 — the "BF-WATCH TV 2021" hallucination.
Default WhisperX output: No apparent hallucinations at the above times.

Druid: dg444xm0599, File: fb204cb6192_sl

SDR Whisper output: Repeated "Thank you" hallucinations during the first 11 minutes of instrumental music. Semi-accurate transcription of sung lyrics ("Daisy, daisy...") and spoken numbers ("One million nine hundred..."). Repeated "I don't know" hallucinations during the last 7 minutes of instrumental music.
Default WhisperX output: Mostly-accurate "Music playing" caption during the first 11 minutes of instrumental music. Mostly-accurate transcription of sung lyrics ("Daisy, daisy...") and spoken numbers ("One million nine hundred..."). Repeated "So, you" hallucinations during the last 7 minutes of instrumental music.

Druid: mc135dt6327, File: bs744dg5568_sl

SDR Whisper output: Japanese language output, unable to read. (Was Japanese specified as the transcription language? Answer: Yes.) There appears to be repeated hallucination during the last 8 minutes of silence (starting around 52:00).
Default WhisperX output: English language translation, unable to evaluate accuracy. (Was English specified as the translation language? Answer: No, there is no speech in the first 30 seconds for WhisperX to auto-detect.) No hallucination during the last 8 minutes of silence (the last vtt caption is at 51:52).

alundgard · 2025-01-17T14:31:02Z

Druid: jz734cm7143, File: tt618qz3245_sl

Default Whisper small output: Fewer hallucinations during music compared to SDR Whisper. Still some hallucinations during music/noise at 23:58. Note: No "BF-WATCH TV 2021" hallucinations.
Default WhisperX output: No apparent hallucinations at the above times.

Druid: dg444xm0599, File: fb204cb6192_sl

Default Whisper small output: Very significant hallucinations throughout ("I don't know what I'm doing" and "I'm sorry"). Almost unusable. Does not capture any of the sung lyrics ("Daisy, daisy ..."). Captures some of the spoken numbers ("One million nine hundred...").
Default WhisperX output: Mostly-accurate "Music playing" caption during the first 11 minutes of instrumental music. Mostly-accurate transcription of sung lyrics ("Daisy, daisy...") and spoken numbers ("One million nine hundred..."). Repeated "So, you" hallucinations during the last 7 minutes of instrumental music.

Druid: mc135dt6327, File: bs744dg5568_sl

SDR Whisper output: Very significant hallucinations throughout ("..." and "Please subscribe to my channel"). Hallucination during the last 8 minutes of silence. Unable to evaluate translation accuracy.
Default WhisperX output: English language translation, unable to evaluate accuracy. (Was English specified as the translation language? Answer: No, there is no speech in the first 30 seconds for WhisperX to auto-detect.) No hallucination during the last 8 minutes of silence (the last vtt caption is at 51:52).

dnoneill self-assigned this Jan 8, 2025