Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whisper streaming - Buffer audio & incorrect finals #49

Open
damienlaine opened this issue Nov 22, 2024 · 3 comments · Fixed by #59
Open

Whisper streaming - Buffer audio & incorrect finals #49

damienlaine opened this issue Nov 22, 2024 · 3 comments · Fixed by #59
Assignees
Labels
🪲BUG enhancement New feature or request

Comments

@damienlaine
Copy link
Member

damienlaine commented Nov 22, 2024

Whisper Streaming - Buffer Audio & Incorrect Finals

Description

While testing streaming functionality using Whisper, we have identified two principal issues:


1. Missing Buffer/Flush Mechanism

When audio is filtered locally via VAD (Voice Activity Detection) and sent through WebSocket, the implementation appears to lack a proper buffer/flush mechanism. If the audio stream is interrupted (e.g., silence or stop in incoming WebSocket audio), this causes a transcription crash.

Error Logs:

Below is a sample traceback illustrating the crash caused by this issue:

[2024-11-22 16:07:43,234 __stt__] INFO: Received config: {'sample_rate': 16000}
[2024-11-22 16:07:43,234 __stt__] INFO: Using ctranslate2 for decoding
[2024-11-22 16:07:43,234 __stt__] INFO: Starting transcription ...
[2024-11-22 16:08:43,962 websockets.server] ERROR: connection handler failed
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/websockets/legacy/protocol.py", line 963, in transfer_data
    message = await self.read_message()
  File "/usr/local/lib/python3.8/dist-packages/websockets/legacy/protocol.py", line 1033, in read_message
    frame = await self.read_data_frame(max_size=self.max_size)
  File "/usr/local/lib/python3.8/dist-packages/websockets/legacy/protocol.py", line 1108, in read_data_frame
    frame = await self.read_frame(max_size)
  File "/usr/local/lib/python3.8/dist-packages/websockets/legacy/protocol.py", line 1165, in read_frame
    frame = await Frame.read(
  File "/usr/local/lib/python3.8/dist-packages/websockets/legacy/framing.py", line 98, in read
    data = await reader(length)
  File "/usr/lib/python3.8/asyncio/streams.py", line 723, in readexactly
    await self._wait_for_data('readexactly')
  File "/usr/lib/python3.8/asyncio/streams.py", line 517, in _wait_for_data
    await self._waiter
asyncio.exceptions.CancelledError

Expected Behavior:

Implement a buffer/flush mechanism that ensures a smooth continuation or graceful handling of audio stream interruptions. If the websocket is not disconnected, the ASR shall await incoming audio packets.

2. Broken "Partials" and "Finals" Logic

The current implementation of the "partials/finals" transcription logic is problematic. It seems to produce a "final" for every token validated instead of grouping and correcting the transcription until a pause or silence is detected.

Current Behavior:

The system generates "final" outputs prematurely. Example:

image

Expected Behavior:

Transcriptions should remain in the "partial" state during active speech.
Upon detecting a silence (pause), the finalized and corrected transcription should flush to a "final."

Desired output:

image
@damienlaine damienlaine added enhancement New feature or request 🪲BUG labels Nov 22, 2024
@damienlaine
Copy link
Member Author

@Jeronymous @AudranBert

@damienlaine damienlaine moved this to Considering in LinTO Roadmap Nov 24, 2024
@AudranBert
Copy link
Member

AudranBert commented Dec 5, 2024

1. Missing Buffer/Flush Mechanism

I'm not sure to understand that part, you are talking about a "transcription crash" whereas from your error it seems to be a timeout. When I tried to reproduce by sending data and then pausing, I had a timeout too. Maybe you are saying that we should not have a timeout when connected to the websocket? Or maybe I'm missing something

@damienlaine
Copy link
Member Author

damienlaine commented Dec 6, 2024

1. Missing Buffer/Flush Mechanism

I'm not sure to understand that part, you are talking about a "transcription crash" whereas from your error it seems to be a timeout. When I tried to reproduce by sending data and then pausing, I had a timeout too. Maybe you are saying that we should not have a timeout when connected to the websocket? Or maybe I'm missing something

The way we consume streaming assumes that the audio stream may be processed using a client-side VAD. During periods of silence, no data is transmitted over the established websocket connection. However, as long as the socket connection remains active, the ASR system must gracefully wait for incoming data without timing out (which is similar to a crash in this situation)

Note: I have combined the two issues (buffer/flush mechanism and broken partials/finals logic) into a single implementation recommendation. A potential solution could involve using a circular buffer to handle audio data effectively, addressing both problems. Functionally, pauses in data transmission could be used to signal the "end of active speech," triggering a "final" transcription. Additionally, implementing a custom VAD in the buffered audio could further enhance smooth handling of partials and finals.

AudranBert added a commit that referenced this issue Dec 9, 2024
@AudranBert AudranBert self-assigned this Jan 6, 2025
@AudranBert AudranBert linked a pull request Jan 8, 2025 that will close this issue
@AudranBert AudranBert moved this from Investigating to In preview / Experimental in LinTO Roadmap Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🪲BUG enhancement New feature or request
Projects
Status: In preview / Experimental
Development

Successfully merging a pull request may close this issue.

2 participants