Releases: OpenNMT/CTranslate2
Releases · OpenNMT/CTranslate2
CTranslate2 3.9.1
Fixes and improvements
- Fix missing alignments in the
Whisper.align
result due to a bug in the DTW implementation - Fix error when converting a Whisper model from a path
CTranslate2 3.9.0
New features
- Support BLOOM language models
- Add method
Whisper.align
to return the text/audio alignment and implement word-level timestamps
Fixes and improvements
- Do not force
intra_threads
to 1 when loading a model on the GPU as some ops may still run on the CPU - Disable multithreading when copying a batch of small arrays
CTranslate2 3.8.0
New features
- Experimental support of AVX512 in manually vectorized functions: this code path is not enabled by default but can be enabled by setting the environment variable
CT2_FORCE_CPU_ISA=AVX512
- Add Transformers converter option
copy_files
to copy any files from the Hugging Face model to the converted model directory - Expose some Whisper parameters:
max_initial_timestamp_index
suppress_blank
suppress_tokens
Fixes and improvements
- Reduce conversion time for large models by skipping some weights comparisons
- Reduce maximum memory usage when converting Transformers models with
--quantization float16
- Set FP32 compute type for FP16 convolutions to match the PyTorch behavior and accuracy
- Update oneDNN to 3.0.1
CTranslate2 3.7.0
Changes
- Rename the "float" compute type to "float32" for clarity. "float" is still accepted for backward compatibility.
New features
- Add the environment variable
CT2_CUDA_TRUE_FP16_GEMM
. This flag is enabled by default so that FP16 GEMMs are running in full FP16. When disabled, the compute type of FP16 GEMMs is set to FP32, which is what PyTorch and TensorFlow do by default.
Fixes and improvements
- Improve the numerical precision of Whisper models running in FP16 by setting the FP32 compute type for GEMMs (same behavior as PyTorch)
- Improve support for running the Whisper models with INT16 quantization
- Ensure the Whisper decoding does not continue past
max_length
, which could previously happen when the prompt was longer thanmax_length/2
- Include the EOS score in the score returned by Whisper during greedy search
CTranslate2 3.6.0
New features
- Build the Windows Python wheels with cuDNN to enable GPU execution of Whisper models
- Add the model attribute
Whisper.is_multilingual
Fixes and improvements
- Reduce the beam search memory usage by not duplicating the decoder states that are the same in each beam (e.g. the projected memory keys and values)
- Optimize the dot product attention during beam search by moving the query beam dimension to the time dimension
- Fix support of English-only Whisper models
- Include the prefix tokens (if they exist) in the output of
Whisper.generate
- Log a warning when the model weights are implicitly converted to another type
CTranslate2 3.5.1
Fixes and improvements
- Whisper: fix an incorrect timestamp rule that prevented timestamps to be generated in pairs
- Whisper: ignore the EOS token when applying the length penalty to match the original implementation
CTranslate2 3.5.0
New features
- Add a patience factor for beam search to continue decoding until
beam_size * patience
hypotheses are finished, as described in Kasai et al. 2022 - Implement all GELU variants and select them accordingly when converting models:
- Tanh approximation (already implemented)
- Sigmoid approximation
- Reference implementation based on the CDF
Fixes and improvements
- Fix incorrect outputs of T5 models due to a bug in the CUDA kernel of the RMS normalization
- Raise an error if the Whisper input shape is incorrect
- Optimize the transposition operator used in the multi-head attention when running on GPU
- Remove the upper limit in
python_requires
to facilitate the package installation with tools like Poetry and PDM
CTranslate2 3.4.0
Fixes and improvements
- Fix incorrect vocabulary in M2M100 models after conversion with
transformers>=4.24
- Fix incorrect model outputs when executing with very large batch sizes on GPU
- Fix memory error in biased decoding: the vector of divergence was read and updated past its length
- Allow setting
prefix_bias_beta
> 0 withbeam_size
== 1 - Prevent timestamps from decreasing during Whisper generation
- Make some error messages more helpful when implementing a custom converter
CTranslate2 3.3.0
New features
- Support T5 models, including the variants T5v1.1 and mT5
- Support loading the model files from memory:
- Python: see the
files
argument in the constructor of classes loading models - C++: see the
models::ModelMemoryReader
class
- Python: see the
Fixes and improvements
- Improve the quantization accuracy of OPT models by applying the SmoothQuant technique during conversion (pre-computed activation scales should be passed to the converter option
--activation_scales
) - Fix conversion of BART-like models from HuggingFace that are using a different number of encoder and decoder layers
- Fix compilation when no BLAS CPU backend is selected
- Remove no longer relevant CMake warning when the project is compiled without oneDNN
- Update oneDNN to 3.0
- Update oneMKL to 2023.0
CTranslate2 3.2.0
New features
- Add decoding option
suppress_sequences
to prevent specific sequences of tokens from being generated - Add decoding option
end_token
to stop the decoding on a different token than the model EOS token - Allow returning multiple random hypotheses from greedy search + random sampling when setting
num_hypotheses
> 1
Fixes and improvements
- Improve support for batch generation with the Whisper model:
- Improve performance of batch generation with a context (we only require the prompts to have the same length, which is easily done by adapting the number of previous text tokens)
- Support batch mode for option
return_no_speech_prob
- Support cases where some prompts in the batch have the token
<|notimestamps|>
but not others
- Enable the Conv1D layer in more Python wheels:
- macOS x64 (using oneDNN)
- macOS ARM64 (using a custom implementation)
- Linux AArch64 (using a custom implementation)
- Update the OpenNMT-py converter to support the latest checkpoint structure
- Generalize the
TransformerSpec
constructor to accept arbitrary encoder and decoder specifications - Remove the global compilation flag
-ffast-math
which introduces unwanted side effects and enable it only for the layer norm CPU kernel where it is actually useful - Fix CMake error on Windows when setting
-DOPENMP_RUNTIME=COMP