-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Whisper streaming - Buffer audio & incorrect finals #49
Comments
I'm not sure to understand that part, you are talking about a "transcription crash" whereas from your error it seems to be a timeout. When I tried to reproduce by sending data and then pausing, I had a timeout too. Maybe you are saying that we should not have a timeout when connected to the websocket? Or maybe I'm missing something |
The way we consume streaming assumes that the audio stream may be processed using a client-side VAD. During periods of silence, no data is transmitted over the established websocket connection. However, as long as the socket connection remains active, the ASR system must gracefully wait for incoming data without timing out (which is similar to a crash in this situation) Note: I have combined the two issues (buffer/flush mechanism and broken partials/finals logic) into a single implementation recommendation. A potential solution could involve using a circular buffer to handle audio data effectively, addressing both problems. Functionally, pauses in data transmission could be used to signal the "end of active speech," triggering a "final" transcription. Additionally, implementing a custom VAD in the buffered audio could further enhance smooth handling of partials and finals. |
Signed-off-by: AudranBert <[email protected]>
Whisper Streaming - Buffer Audio & Incorrect Finals
Description
While testing streaming functionality using Whisper, we have identified two principal issues:
1. Missing Buffer/Flush Mechanism
When audio is filtered locally via VAD (Voice Activity Detection) and sent through WebSocket, the implementation appears to lack a proper buffer/flush mechanism. If the audio stream is interrupted (e.g., silence or stop in incoming WebSocket audio), this causes a transcription crash.
Error Logs:
Below is a sample traceback illustrating the crash caused by this issue:
Expected Behavior:
Implement a buffer/flush mechanism that ensures a smooth continuation or graceful handling of audio stream interruptions. If the websocket is not disconnected, the ASR shall await incoming audio packets.
2. Broken "Partials" and "Finals" Logic
The current implementation of the "partials/finals" transcription logic is problematic. It seems to produce a "final" for every token validated instead of grouping and correcting the transcription until a pause or silence is detected.
Current Behavior:
The system generates "final" outputs prematurely. Example:
Expected Behavior:
Transcriptions should remain in the "partial" state during active speech.
Upon detecting a silence (pause), the finalized and corrected transcription should flush to a "final."
Desired output:
The text was updated successfully, but these errors were encountered: