Decreasing Speed and Delayed Confirmation in Stream Transcription Over Time #198

gavin1818 · 2024-08-27T23:32:12Z

I’ve been using WhisperKit for real-time stream transcription in a project, and I’ve noticed that as time progresses, particularly after 20-30 minutes of continuous use, the transcription speed begins to decrease noticeably. Additionally, the transcript seems to remain unconfirmed for an extended period. During this time, the same text is repeated for a long duration within the unconfirmed segment, which results in the latest transcript not being transferred in a timely manner. This causes a significant gap between the audio and the corresponding transcription.

I’m aware that this issue might be challenging to resolve quickly, but I’m curious about the potential causes. Could this be related to Model-Level Issues, Decoder-Level Issues or others?

I would appreciate any insights into which areas might be the most likely cause of the issue. If there are specific parts of the code or certain tools I should use to investigate these potential causes further, I’d be grateful for the guidance.

Thanks

atiorh · 2024-08-28T06:08:49Z

@gavin1818 Thanks for the report! Are you able to share the input file that reproduces this?

ZachNagengast · 2024-08-28T06:10:16Z

@gavin1818 Could you also provide a little info on the model, device, an os you're running on? There could be a number of different potential issues depending on that.

vojto · 2024-09-04T09:11:39Z

The realtime demo seems to be running Whisper over the full recorded file, and using seeking to transcribe the most recent bit.

Would that be slowing things down? Would it make sense to cut it off - accepting that we would lose some of the context?

ZachNagengast · 2024-09-05T21:57:42Z

For realtime it is looking for a specific amount of confirmations to say that it effectively transcribed some set of audio without any cutoff in the middle of words. That is set to 2 segments currently which does end up being slower because we need to fix the timestamp filter to get smaller segments. You can configure it to only use 1 segment confirmation which might help. We also have a task to drop audio after transcribe is completed, but that wouldn't impact performance.

ppcfan · 2024-10-19T00:54:20Z

For realtime it is looking for a specific amount of confirmations to say that it effectively transcribed some set of audio without any cutoff in the middle of words. That is set to 2 segments currently which does end up being slower because we need to fix the timestamp filter to get smaller segments. You can configure it to only use 1 segment confirmation which might help. We also have a task to drop audio after transcribe is completed, but that wouldn't impact performance.

@ZachNagengast Do you mean that the current word-level timestamp might be inaccurate? Because if it were accurate, then cutting the buffer based on the end time of a word shouldn’t result in splitting a word in the middle, right?

ZachNagengast · 2024-10-19T01:59:52Z

Word-level timestamps have to be accurate for realtime (eager) mode to work, but cutting the buffer changes the start time of the audio, and since we are using seeking (aka clipTimestamps) to decode the next segment of real time audio as it comes in from the microphone, the seek value would also have to be reset if we're dropping audio from the buffer, in order for the clipTimestamps to correctly seek to the point in the buffer that has yet to be transcribed. None of that is super complicated to build into the system, so we're keeping all these kinds of things in mind for when we take eager mode out of the experimental stage.

ppcfan · 2024-10-19T02:52:44Z

Word-level timestamps have to be accurate for realtime (eager) mode to work, but cutting the buffer changes the start time of the audio, and since we are using seeking (aka clipTimestamps) to decode the next segment of real time audio as it comes in from the microphone, the seek value would also have to be reset if we're dropping audio from the buffer, in order for the clipTimestamps to correctly seek to the point in the buffer that has yet to be transcribed. None of that is super complicated to build into the system, so we're keeping all these kinds of things in mind for when we take eager mode out of the experimental stage.

Thank you for your reply

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decreasing Speed and Delayed Confirmation in Stream Transcription Over Time #198

Decreasing Speed and Delayed Confirmation in Stream Transcription Over Time #198

gavin1818 commented Aug 27, 2024

atiorh commented Aug 28, 2024

ZachNagengast commented Aug 28, 2024

vojto commented Sep 4, 2024

ZachNagengast commented Sep 5, 2024

ppcfan commented Oct 19, 2024 •

edited

Loading

ZachNagengast commented Oct 19, 2024

ppcfan commented Oct 19, 2024

Decreasing Speed and Delayed Confirmation in Stream Transcription Over Time #198

Decreasing Speed and Delayed Confirmation in Stream Transcription Over Time #198

Comments

gavin1818 commented Aug 27, 2024

atiorh commented Aug 28, 2024

ZachNagengast commented Aug 28, 2024

vojto commented Sep 4, 2024

ZachNagengast commented Sep 5, 2024

ppcfan commented Oct 19, 2024 • edited Loading

ZachNagengast commented Oct 19, 2024

ppcfan commented Oct 19, 2024

ppcfan commented Oct 19, 2024 •

edited

Loading