Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decreasing Speed and Delayed Confirmation in Stream Transcription Over Time #198

Open
gavin1818 opened this issue Aug 27, 2024 · 7 comments

Comments

@gavin1818
Copy link

I’ve been using WhisperKit for real-time stream transcription in a project, and I’ve noticed that as time progresses, particularly after 20-30 minutes of continuous use, the transcription speed begins to decrease noticeably. Additionally, the transcript seems to remain unconfirmed for an extended period. During this time, the same text is repeated for a long duration within the unconfirmed segment, which results in the latest transcript not being transferred in a timely manner. This causes a significant gap between the audio and the corresponding transcription.

I’m aware that this issue might be challenging to resolve quickly, but I’m curious about the potential causes. Could this be related to Model-Level Issues, Decoder-Level Issues or others?

I would appreciate any insights into which areas might be the most likely cause of the issue. If there are specific parts of the code or certain tools I should use to investigate these potential causes further, I’d be grateful for the guidance.

Thanks

@atiorh
Copy link
Contributor

atiorh commented Aug 28, 2024

@gavin1818 Thanks for the report! Are you able to share the input file that reproduces this?

@ZachNagengast
Copy link
Contributor

@gavin1818 Could you also provide a little info on the model, device, an os you're running on? There could be a number of different potential issues depending on that.

@vojto
Copy link

vojto commented Sep 4, 2024

The realtime demo seems to be running Whisper over the full recorded file, and using seeking to transcribe the most recent bit.

Would that be slowing things down? Would it make sense to cut it off - accepting that we would lose some of the context?

@ZachNagengast
Copy link
Contributor

For realtime it is looking for a specific amount of confirmations to say that it effectively transcribed some set of audio without any cutoff in the middle of words. That is set to 2 segments currently which does end up being slower because we need to fix the timestamp filter to get smaller segments. You can configure it to only use 1 segment confirmation which might help. We also have a task to drop audio after transcribe is completed, but that wouldn't impact performance.

@ppcfan
Copy link

ppcfan commented Oct 19, 2024

For realtime it is looking for a specific amount of confirmations to say that it effectively transcribed some set of audio without any cutoff in the middle of words. That is set to 2 segments currently which does end up being slower because we need to fix the timestamp filter to get smaller segments. You can configure it to only use 1 segment confirmation which might help. We also have a task to drop audio after transcribe is completed, but that wouldn't impact performance.

@ZachNagengast Do you mean that the current word-level timestamp might be inaccurate? Because if it were accurate, then cutting the buffer based on the end time of a word shouldn’t result in splitting a word in the middle, right?

@ZachNagengast
Copy link
Contributor

Word-level timestamps have to be accurate for realtime (eager) mode to work, but cutting the buffer changes the start time of the audio, and since we are using seeking (aka clipTimestamps) to decode the next segment of real time audio as it comes in from the microphone, the seek value would also have to be reset if we're dropping audio from the buffer, in order for the clipTimestamps to correctly seek to the point in the buffer that has yet to be transcribed. None of that is super complicated to build into the system, so we're keeping all these kinds of things in mind for when we take eager mode out of the experimental stage.

@ppcfan
Copy link

ppcfan commented Oct 19, 2024

Word-level timestamps have to be accurate for realtime (eager) mode to work, but cutting the buffer changes the start time of the audio, and since we are using seeking (aka clipTimestamps) to decode the next segment of real time audio as it comes in from the microphone, the seek value would also have to be reset if we're dropping audio from the buffer, in order for the clipTimestamps to correctly seek to the point in the buffer that has yet to be transcribed. None of that is super complicated to build into the system, so we're keeping all these kinds of things in mind for when we take eager mode out of the experimental stage.

Thank you for your reply

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants