This directory contains a service that can receive audio data over websocket and sends the transcription result using CoquiSTT speech-to-text-engine back to the client. This service can also receive RTP packets and extract the payload (transcription of payload is work in progress). The websocket server code in this project is a modified version of this GitHub project.
Server configuration is specified in the application.conf
file.
- Git clone the repository
- Download and install ffmpeg
- Download the acoustic model and language model files for CoquiSTT and place it in the cloned repository
- Create a venv using python -m venv venv
- Enter venv using venv\scripts\activate (Windows) or source venv/bin/activate (Linux)
- Run pip install -r requirements.txt
- Run python -m coqui_server.app
A sample client script is provided, which can be run by executing the following:
coqui_server\client.py 2830-3980-0043.wav
2830-3980-0043.wav
can be replaced with a path to the audio file to be transcribed.
The websocket client-server request-response process looks like the following:
- Client opens websocket W to server
- Client sends binary audio data via W
- Server responds with transcribed text via W once transcription process is completed. The server's response is in JSON format
- Server closes W
The time t taken by the transcription process depends on several factors, such as the duration of the audio, how busy the service is, etc. Under normal circumstances, t is roughly the same as the duration of the provided audio.
Because this service uses websockets, it is currently not possible to interact with it using certain HTTP clients
which do not support websockets, like curl
.
The server can also accept RTP packets. Upon receiving RTP packets, the server decodes the RTP packet to obtain the payload.
The payload is then sent to webrtcvad
, and the voiced audio frames are sent to
CoquiSTT for transcription. The transcription functionality is still work in progress.