Whisper streaming - Buffer audio & incorrect finals #49

damienlaine · 2024-11-22T16:32:03Z

Whisper Streaming - Buffer Audio & Incorrect Finals

Description

While testing streaming functionality using Whisper, we have identified two principal issues:

1. Missing Buffer/Flush Mechanism

When audio is filtered locally via VAD (Voice Activity Detection) and sent through WebSocket, the implementation appears to lack a proper buffer/flush mechanism. If the audio stream is interrupted (e.g., silence or stop in incoming WebSocket audio), this causes a transcription crash.

Error Logs:

Below is a sample traceback illustrating the crash caused by this issue:

[2024-11-22 16:07:43,234 __stt__] INFO: Received config: {'sample_rate': 16000}
[2024-11-22 16:07:43,234 __stt__] INFO: Using ctranslate2 for decoding
[2024-11-22 16:07:43,234 __stt__] INFO: Starting transcription ...
[2024-11-22 16:08:43,962 websockets.server] ERROR: connection handler failed
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/websockets/legacy/protocol.py", line 963, in transfer_data
    message = await self.read_message()
  File "/usr/local/lib/python3.8/dist-packages/websockets/legacy/protocol.py", line 1033, in read_message
    frame = await self.read_data_frame(max_size=self.max_size)
  File "/usr/local/lib/python3.8/dist-packages/websockets/legacy/protocol.py", line 1108, in read_data_frame
    frame = await self.read_frame(max_size)
  File "/usr/local/lib/python3.8/dist-packages/websockets/legacy/protocol.py", line 1165, in read_frame
    frame = await Frame.read(
  File "/usr/local/lib/python3.8/dist-packages/websockets/legacy/framing.py", line 98, in read
    data = await reader(length)
  File "/usr/lib/python3.8/asyncio/streams.py", line 723, in readexactly
    await self._wait_for_data('readexactly')
  File "/usr/lib/python3.8/asyncio/streams.py", line 517, in _wait_for_data
    await self._waiter
asyncio.exceptions.CancelledError

Expected Behavior:

Implement a buffer/flush mechanism that ensures a smooth continuation or graceful handling of audio stream interruptions. If the websocket is not disconnected, the ASR shall await incoming audio packets.

2. Broken "Partials" and "Finals" Logic

The current implementation of the "partials/finals" transcription logic is problematic. It seems to produce a "final" for every token validated instead of grouping and correcting the transcription until a pause or silence is detected.

Current Behavior:

The system generates "final" outputs prematurely. Example:

Expected Behavior:

Transcriptions should remain in the "partial" state during active speech.
Upon detecting a silence (pause), the finalized and corrected transcription should flush to a "final."

Desired output:

damienlaine · 2024-11-22T19:18:40Z

@Jeronymous @AudranBert

AudranBert · 2024-12-05T13:19:02Z

1. Missing Buffer/Flush Mechanism

I'm not sure to understand that part, you are talking about a "transcription crash" whereas from your error it seems to be a timeout. When I tried to reproduce by sending data and then pausing, I had a timeout too. Maybe you are saying that we should not have a timeout when connected to the websocket? Or maybe I'm missing something

damienlaine · 2024-12-06T15:58:15Z

1. Missing Buffer/Flush Mechanism

I'm not sure to understand that part, you are talking about a "transcription crash" whereas from your error it seems to be a timeout. When I tried to reproduce by sending data and then pausing, I had a timeout too. Maybe you are saying that we should not have a timeout when connected to the websocket? Or maybe I'm missing something

The way we consume streaming assumes that the audio stream may be processed using a client-side VAD. During periods of silence, no data is transmitted over the established websocket connection. However, as long as the socket connection remains active, the ASR system must gracefully wait for incoming data without timing out (which is similar to a crash in this situation)

Note: I have combined the two issues (buffer/flush mechanism and broken partials/finals logic) into a single implementation recommendation. A potential solution could involve using a circular buffer to handle audio data effectively, addressing both problems. Functionally, pauses in data transmission could be used to signal the "end of active speech," triggering a "final" transcription. Additionally, implementing a custom VAD in the buffered audio could further enhance smooth handling of partials and finals.

Signed-off-by: AudranBert <[email protected]>

damienlaine added enhancement New feature or request 🪲BUG labels Nov 22, 2024

damienlaine added this to LinTO Roadmap Nov 24, 2024

damienlaine moved this to Considering in LinTO Roadmap Nov 24, 2024

AudranBert added a commit that referenced this issue Dec 9, 2024

Fix first part of #49 (no timeouts)

f66a3ac

Signed-off-by: AudranBert <[email protected]>

AudranBert self-assigned this Jan 6, 2025

AudranBert mentioned this issue Jan 7, 2025

Fix healthcheck and streaming with whisper #59

Merged

AudranBert linked a pull request Jan 8, 2025 that will close this issue

Fix healthcheck and streaming with whisper #59

Merged

AudranBert moved this from Investigating to In preview / Experimental in LinTO Roadmap Feb 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whisper streaming - Buffer audio & incorrect finals #49

Whisper streaming - Buffer audio & incorrect finals #49

damienlaine commented Nov 22, 2024 •

edited

Loading

damienlaine commented Nov 22, 2024

AudranBert commented Dec 5, 2024 •

edited

Loading

damienlaine commented Dec 6, 2024 •

edited

Loading

Whisper streaming - Buffer audio & incorrect finals #49

Whisper streaming - Buffer audio & incorrect finals #49

Comments

damienlaine commented Nov 22, 2024 • edited Loading

Whisper Streaming - Buffer Audio & Incorrect Finals

Description

1. Missing Buffer/Flush Mechanism

Error Logs:

Expected Behavior:

2. Broken "Partials" and "Finals" Logic

Current Behavior:

Expected Behavior:

damienlaine commented Nov 22, 2024

AudranBert commented Dec 5, 2024 • edited Loading

damienlaine commented Dec 6, 2024 • edited Loading

damienlaine commented Nov 22, 2024 •

edited

Loading

AudranBert commented Dec 5, 2024 •

edited

Loading

damienlaine commented Dec 6, 2024 •

edited

Loading