LinTO-STT-Kaldi is an API for Automatic Speech Recognition (ASR) based on models trained with Kaldi.
LinTO-STT-Kaldi can either be used as a standalone transcription service or deployed within a micro-services infrastructure using a message broker connector.
It can be used to do offline or real-time transcriptions.
To run the transcription models you'll need:
- At least 7Go of disk space to build the docker image.
- Up to 7GB of RAM depending on the model used.
- One CPU per worker. Inference time scales on CPU performances.
If not done alreadt, download and unzip model folders into a directory accessible from the docker container.
LinTO-STT-Kaldi accepts two kinds of ASR models:
- LinTO Acoustic and Languages models.
- Vosk models (all in one).
We provide home-cured models (v2) on dl.linto.ai. Or you can also use Vosk models available here.
If you want text with upper case letters and punctuation, you can specify a recasepunc model. Some recasepunc models trained on Common Crawl are available on recasepunc for the following the languages:
- French
- English
- Italian
- Chinese
The transcription service requires docker up and running.
The STT only entry point in task mode are tasks posted on a message broker. Supported message broker are RabbitMQ, Redis, Amazon SQS. On addition, as to prevent large audio from transiting through the message broker, STT-Worker use a shared storage folder (SHARED_FOLDER).
1- First step is to build or pull the image:
git clone https://github.com/linto-ai/linto-stt.git
cd linto-stt
docker build . -f kaldi/Dockerfile -t linto-stt-kaldi:latest
or
docker pull lintoai/linto-stt-kaldi
2- Download the models
Have the acoustic and language model ready at AM_PATH and LM_PATH if you are using LinTO models. If you are using a Vosk model, have it ready at MODEL.
3- Fill the .env file
An example of .env file is provided in kaldi/.envdefault.
PARAMETER | DESCRIPTION | EXEMPLE |
---|---|---|
SERVICE_MODE | STT serving mode see Serving mode | http|task|websocket |
MODEL_TYPE | Type of STT model used. | lin|vosk |
ENABLE_STREAMING | Using http serving mode, enable the /streaming websocket route | true|false |
SERVICE_NAME | Using the task mode, set the queue's name for task processing | my-stt |
SERVICE_BROKER | Using the task mode, URL of the message broker | redis://my-broker:6379 |
BROKER_PASS | Using the task mode, broker password | my-password |
STREAMING_PORT | Using the websocket mode, the listening port for ingoing WS connexions. | 80 |
CONCURRENCY | Maximum number of parallel requests | >1 |
PUNCTUATION_MODEL | Path to a recasepunc model, for recovering punctuation and upper letter in streaming | opt/PUNCT |
STT can be used three ways:
- Through an HTTP API using the http's mode.
- Through a message broker using the task's mode.
- Through a websocket server websocket's mode.
Mode is specified using the .env value or environment variable SERVING_MODE
.
SERVICE_MODE=http
The HTTP serving mode deploys a HTTP server and a swagger-ui to allow transcription request on a dedicated route.
The SERVICE_MODE value in the .env should be set to http
.
docker run --rm \
-p HOST_SERVING_PORT:80 \
-v AM_PATH:/opt/AM \
-v LM_PATH:/opt/LM \
--env-file .env \
linto-stt-kaldi:latest
If you have a recasepunc model do recover punctuation marks, you can add the following option:
-v <</path/to/recasepunc/model/folder>>:/opt/PUNCT
--env PUNCTUATION_MODEL=/opt/PUNCT
This will run a container providing an HTTP API binded on the host HOST_SERVING_PORT port.
Parameters:
Variables | Description | Example |
---|---|---|
HOST_SERVING_PORT | Host serving port | 80 |
AM_PATH | Path to the acoustic model on the host machine mounted to /opt/AM | /my/path/to/models/AM_fr-FR_v2.2.0 |
LM_PATH | Path to the language model on the host machine mounted to /opt/LM | /my/path/to/models/fr-FR_big-v2.2.0 |
MODEL_PATH | Path to the model (using MODEL_TYPE=vosk) mounted to /opt/model | /my/path/to/models/vosk-model |
The TASK serving mode connect a celery worker to a message broker.
The SERVICE_MODE value in the .env should be set to task
.
You need a message broker up and running at MY_SERVICE_BROKER.
docker run --rm \
-v AM_PATH:/opt/AM \
-v LM_PATH:/opt/LM \
-v SHARED_AUDIO_FOLDER:/opt/audio \
--env-file .env \
linto-stt-kaldi:latest
Parameters:
Variables | Description | Example |
---|---|---|
AM_PATH | Path to the acoustic model on the host machine mounted to /opt/AM | /my/path/to/models/AM_fr-FR_v2.2.0 |
LM_PATH | Path to the language model on the host machine mounted to /opt/LM | /my/path/to/models/fr-FR_big-v2.2.0 |
MODEL_PATH | Path to the model (using MODEL_TYPE=vosk) mounted to /opt/model | /my/path/to/models/vosk-model |
SHARED_AUDIO_FOLDER | Shared audio folder mounted to /opt/audio | /my/path/to/models/vosk-model |
Websocket server's mode deploy a streaming transcription service only.
The SERVICE_MODE value in the .env should be set to websocket
.
Usage is the same as the http streaming API
Returns the state of the API
Method: GET
Returns "1" if healthcheck passes.
Transcription API
- Method: POST
- Response content: text/plain or application/json
- File: An Wave file 16b 16Khz
Return the transcripted text using "text/plain" or a json object when using "application/json" structure as followed:
{
"text" : "This is the transcription",
"words" : [
{"word":"This", "start": 0.123, "end": 0.453, "conf": 0.9},
...
]
"confidence-score": 0.879
}
The /streaming route is accessible if the ENABLE_STREAMING environment variable is set to true.
The route accepts websocket connexions. Exchanges are structured as followed:
- Client send a json {"config": {"sample_rate":16000}}.
- Client send audio chunk (go to 3- ) or {"eof" : 1} (go to 5-).
- Server send either a partial result {"partial" : "this is a "} or a final result {"text": "this is a transcription"}.
- Back to 2-
- Server send a final result and close the connexion.
Connexion will be closed and the worker will be freed if no chunk are received for 10s.
The /docs route offers a OpenAPI/swagger interface.
STT-Worker accepts requests with the following arguments:
file_path: str, with_metadata: bool
- file_path: Is the location of the file within the shared_folder. /.../SHARED_FOLDER/{file_path}
- with_metadata: If True, words timestamps and confidence will be computed and returned. If false, the fields will be empty.
On a successfull transcription the returned object is a json object structured as follow:
{
"text" : "this is the transcription as text",
"words": [
{
"word" : "this",
"start": 0.0,
"end": 0.124,
"conf": 1.0
},
...
],
"confidence-score": ""
}
- The text field contains the raw transcription.
- The word field contains each word with their time stamp and individual confidence. (Empty if with_metadata=False)
- The confidence field contains the overall confidence for the transcription. (0.0 if with_metadata=False)
You can test you http API using curl:
curl -X POST "http://YOUR_SERVICE:YOUR_PORT/transcribe" -H "accept: application/json" -H "Content-Type: multipart/form-data" -F "file=@YOUR_FILE;type=audio/x-wav"
This project is developped under the AGPLv3 License (see LICENSE).