Releases: Wordcab/wordcab-transcribe
v0.5.3
This PR introduces an engine
system to swap Whisper engines between faster-whisper
and TensorRT-LLM
.
API
- MIT License!
- Added the ability to swap the Whipser "engine" from the default faster-whisper to TensorRT-LLM, which is much faster. #285
- Added support for distil models like
distil-large-v2
anddistil-large-v3
. These work with the TensorRT-LLM engine. - Added a
batch_size
parameter for the endpoints. It doesn't do anything yet, but the TensorRT-LLM engine supports batch processing of files, and the idea is add this feature along with dynamic batch. - Overall tighter control over dependencies, and various dependency updates.
Diarization
- Started work implementing Nvidia NeMo's new long-form diarization class. Currently it still consumes too much memory.
Documentation
Thanks to contributors @aleksandr-smechov and for the work from the NeMo team, and the WhisperS2T project for the initial code for the TensorRT-LLM backend, and by extension, TensorRT-LLM's Whisper example.
v0.5.2
This PR introduces a lot of things to allow remote execution and single-service deployment.
API
- Added the possibility to make
RemoteExecution
orLocalExecution
for transcription and diarization services #258 - Implemented single-service deployment with the new
only_transcription
andonly_diarization
asr types #261 - Added new endpoints to manage remote execution servers #263
- Allow the user to auto-switch between local and remote execution for all services #266
Diarization
- Adjusted VAD speech padding and updated the diarization logic #271
Bug and Fixes
Documentation
- Added new documentation to the project via
mkdocs-material
and GitHub pages #269
Thanks to contributors @aleksandr-smechov @chainyo
v0.5.1
v0.5.0
This release is a significant change from poetry
to hatch
with many improvements to CI, tests, local development, and dependencies handling.
API
- Added a warmup for inference #201
- Added
repetion_penalty
parameter #207 - Added
num_speakers
parameter #195 - Improved the
time_and_tell
function #213 - Updated the API schemas #188
- Added transcription parameters for control #213
Transcription
- Added
bfloat16
to compute types #209
Diarization
- Added empty audio catch during diarization #223 #225
- Reimplemented the entire diarization module to skip NeMo module installation #186 #202
CI
- Added concurrency on CI tests #191
Contributors:
@aleksandr-smechov @chainyo
v0.4.0
This release includes a lot of improvements and a new License starting with the v0.4.0
of wordcab-transcribe (inspired by the HFOIL).
The new License WTLv0.1
The new License prevents anyone from using this project after v0.4.0
(included) to sell a self-hosted version of this software without any agreements from Wordcab.
But you can still use the project for research, personal use, or even as a backend tool for your projects.
API
- Fixed
CortexResponse
for Svix size limit #101 - Made
alignment
non-critical if the process fails #105 - Added multi-GPU support for transcription, alignment, and diarization #114
- Added the
audio_duration
(in seconds) in the API response #127 - Added a catch for invalid or empty audio file #128
- Added a log about the number of detected and used GPUs at launch #138
- Updated pydantic to v2 #157
- Added an audio file global download queue #168
- Added the new WTL v0.1 License #177 #183 #184
Transcription
- Added the
vocab
feature #124 - Added an
internal_vad
parameter that helps with empty utterances #142 #173 - Added a new fallback for empty segments during transcription #149
- Added the
float32
compute type for the transcription model #157
Diarization
- Decomposed the diarization process into sub-modules and optimized diarization inference #180
Alignment
- Added new
cs
,in
,sl
andth
alignment models #164
Post-processing
Instructions
- Improvement of the contributions instructions #131
Deploy
- Update error payload for Svix in cortex endpoint #118
- Docker image updated to
cuda:11.7.1
#133 - Update Svix payload in cortex endpoint #144
- Add a configuration file using Nginx for custom deploy #146
Need improvements / Not fully working
- Added the possibility to use extra transcription models for specific languages #110
Contributors:
@chainyo @aleksandr-smechov @jissagn
v0.3.1
TL;DR: Transcription is now on steroids. 2x faster than the actual faster-whisper implementation.
API
- Add
time_and_tell
decorator on specific functions to time individual processes on debug=True #77 - Add a
LoggingMiddleware
on debug=True #77 - Add a fallback for
dual_channel
if the audio file is not stereo #87
Transcription
- Add quality metrics for the batch process and fallback if the quality is under defined thresholds #89
- Implement
word_timestamps
for the batch process #91
Post-processing
- Fix timestamps format during the post-processing step #86
v0.3.0
Documentation
- Improve
.env
readability for an easier API configuration #52 - Add README instructions for profiling container #72
API
- Add authentication when the API is not in debug mode #56
- Fix the audio file endpoint inputs #59
- All submitted files are converted into
.wav
16kHz for consistency #60 - Reworked and more coherent Request/Response models for the API endpoints #60
- Streamline the post-process functions (with or without alignment/diarization) #63
- Simplify timestamps conversion in outputs #63
- Fix blocking non-async functions #67
- Huge API rework for handling concurrent requests better #71
- Fix Exception/Error returns through the API -> raised errors should be more transparent for user #72
- VAD use now onnx and faster-whisper implementation #72
AI models
- Add
alignment
(fromwhisperX
) as a new possible step #51 - Fix alignment for
fr
,de
,es
, andit
models #59 - Add dual_channel transcription process for stereo audio file #60
- Add the choice to use
diarization
or not #63 - Implement Batch request process for transcription #72
Deploy
- Docker is aligned with the local setup now #55
- Improve Dockerfile and commands to use cache for models #55
Contributors:
@aleksandr-smechov @chainyo
v0.2.0
- Replace diarization with NVIDIA NeMo asr toolkit
- Update config.py and add validators for necessary config settings
- Update the Docker image with the latest from NVIDIA
- Fix dependencies and versions
- Fix the Python version to 3.9 locally and on Docker
- New available timestamps format:
ms
. Now user can choose betweenhms
,s
(default) andms
. - Remove unused
num_speakers
parameter.