Skip to content

v1.5.0

Compare
Choose a tag to compare
@ggerganov ggerganov released this 15 Nov 21:06
· 1045 commits to master since this release
d38af15

Overview

This major release includes the following changes:

  • Full GPU processing of the Encoder and the Decoder with CUDA and Metal is now supported
  • Efficient beam-search implementation via batched decoding and unified KV cache
  • Full quantization support of all available ggml quantization types
  • Support for grammar constrained sampling
  • Support for Distil Whisper models
  • Support for Whisper Large-v3

and more

Full GPU support

On Apple Silicon, GPU support has been available to a large extend since 15 Sep. However, part of the Encoder was still being executed on the CPU due to lack of MSL kernels for the convolution operations. These kernels are now available resulting in additional speed-up of the Encoder in this release:

image

Encoder performance on Apple M1 Max - before and after (plot by @dreness)

For NVIDIA hardware, the entire computation can now be offloaded to the GPU which results in significant performance boost. For detailed performance breakdown, checkout the Benchmarks section below.

The GPU processing on Apple Silicon is enabled by default, while for NVIDIA you need to build with WHISPER_CUBLAS=1:

# Apple Silicon
make

# NVIDIA
WHISPER_CUBLAS=1 make

Implementation: #1472

Special credits to: @FSSRepo, @slaren

Batched decoding + efficient Beam Search

At last, whisper.cpp now supports efficient Beam Search decoding. The missing piece was the implementation of batched decoding, which now follows closely the unified KV cache idea from llama.cpp. On modern NVIDIA hardware, the performance with 5 beams is the same as 1 beam thanks to the large amount of computing power available. With Metal, the speed with 5 beams is a bit slower compared to 1 beam, but it is significantly faster compared to 5x times the time for single batch which was observed with the old naive implementation.

Beam Search is now enabled by default in whisper.cpp to match the OG implementation of OpenAI Whisper. For more performance details, checkout the Benchmarks section below.

Implementation: #1486

Quantization support

All ggml quantization types are now supported. Quantization mixtures for Whisper model can be implemented. It's still unclear how the quality is affected from the quantization - this is an interesting area which can be explored in the future.

Grammar sampling

The decoder output can now be constrained with a GBNF grammar. This can be a useful technique for further improving the transcription quality in situations where the set of possible phrases are limited.

whisper-chess.mp4

Implementation: #1229

Special credits to @ejones

Distil Whisper

Recently, Distil Whisper models have been released: https://huggingface.co/distil-whisper

whisper.cpp offers support for these models, although it still lacks full implementation of the proposed chunking strategy. Performance details for distilled models are included in the Benchmarks section below.

Implementation: #1424

Whisper Large-v3

Recently, OpenAI released a new version 3 of the Large model: openai/whisper#1761

Implementation: #1444

Benchmarks

Below is a breakdown of the performance of whisper.cpp on Apple Silicon, NVIDIA and CPU. The tables show the Encoder and Decoder speed in ms/tok. The Dec. column corresponds to batch size 1. The Bch5 column corresponds to batch size 5. The PP column corresponds to batch size 128.

For optimal Beam Search performance, the Bch5 number should be 5 times smaller than Dec.

Hw Config Model Th Enc. Dec. Bch5 PP Commit
M2 Ultra METAL tiny 1 11.14 1.40 0.49 0.01 ccc85b4
M2 Ultra METAL tiny-q5_0 1 11.51 1.41 0.52 0.01 ccc85b4
M2 Ultra METAL tiny-q5_1 1 12.21 1.41 0.52 0.01 ccc85b4
M2 Ultra METAL base 1 20.21 2.05 0.77 0.02 ccc85b4
M2 Ultra METAL base-q5_0 1 19.89 1.96 0.81 0.02 ccc85b4
M2 Ultra METAL base-q5_1 1 20.14 2.02 0.81 0.02 ccc85b4
M2 Ultra METAL small 1 51.01 3.97 1.74 0.05 ccc85b4
M2 Ultra METAL small-q5_0 1 56.86 4.09 1.85 0.06 ccc85b4
M2 Ultra METAL small-q5_1 1 56.81 4.14 1.85 0.06 ccc85b4
M2 Ultra METAL medium 1 141.21 8.47 3.98 0.13 ccc85b4
M2 Ultra METAL medium-q5_0 1 160.56 8.27 4.18 0.14 ccc85b4
M2 Ultra METAL medium-q5_1 1 160.52 8.40 4.15 0.14 ccc85b4
M2 Ultra METAL medium-dis 1 128.14 1.13 0.43 0.02 ccc85b4
M2 Ultra METAL large-v2 1 248.73 11.96 6.08 0.22 ccc85b4
M2 Ultra METAL large-v2-q5_0 1 286.31 11.99 6.60 0.26 ccc85b4
M2 Ultra METAL large-v2-q5_1 1 284.56 12.42 6.47 0.26 ccc85b4
M2 Ultra METAL large-v2-dis 1 224.31 1.26 0.49 0.02 ccc85b4
Hw Config Model Th Enc. Dec. Bch5 PP Commit
M2 Ultra COREML METAL tiny 1 7.60 1.41 0.50 0.01 ccc85b4
M2 Ultra COREML METAL base 1 11.90 2.07 0.78 0.02 ccc85b4
M2 Ultra COREML METAL small 1 32.19 4.10 1.78 0.05 ccc85b4
M2 Ultra COREML METAL medium 1 94.43 8.40 3.89 0.12 ccc85b4
M2 Ultra COREML METAL large-v2 1 179.78 12.12 6.07 0.22 ccc85b4
Hw Config Model Th Enc. Dec. Bch5 PP Commit
NVIDIA V100 BLAS CUDA tiny 1 8.84 1.62 0.33 0.02 ccc85b4
NVIDIA V100 BLAS CUDA tiny-q5_0 1 8.43 1.19 0.31 0.02 ccc85b4
NVIDIA V100 BLAS CUDA tiny-q5_1 1 8.41 1.19 0.29 0.02 ccc85b4
NVIDIA V100 BLAS CUDA base 1 14.79 2.31 0.46 0.03 ccc85b4
NVIDIA V100 BLAS CUDA base-q5_0 1 15.05 1.66 0.44 0.03 ccc85b4
NVIDIA V100 BLAS CUDA base-q5_1 1 15.01 1.68 0.46 0.03 ccc85b4
NVIDIA V100 BLAS CUDA small 1 40.30 4.37 0.88 0.05 ccc85b4
NVIDIA V100 BLAS CUDA small-q5_0 1 41.17 3.11 0.94 0.05 ccc85b4
NVIDIA V100 BLAS CUDA small-q5_1 1 41.12 3.11 0.82 0.05 ccc85b4
NVIDIA V100 BLAS CUDA medium 1 104.93 10.06 1.77 0.11 ccc85b4
NVIDIA V100 BLAS CUDA medium-q5_0 1 107.11 6.13 2.07 0.12 ccc85b4
NVIDIA V100 BLAS CUDA medium-q5_1 1 107.91 6.21 1.77 0.12 ccc85b4
NVIDIA V100 BLAS CUDA medium-dis 1 103.45 1.11 0.24 0.02 ccc85b4
NVIDIA V100 BLAS CUDA large-v2 1 171.55 15.76 2.62 0.17 ccc85b4
NVIDIA V100 BLAS CUDA large-v2-q5_0 1 176.27 8.61 3.17 0.19 ccc85b4
NVIDIA V100 BLAS CUDA large-v2-q5_1 1 176.23 8.67 2.59 0.19 ccc85b4
Hw Config Model Th Enc. Dec. Bch5 PP Commit
AMD Ryzen 9 5950X AVX2 tiny 8 197.47 1.22 0.44 0.25 ccc85b4
AMD Ryzen 9 5950X AVX2 tiny-q5_0 8 222.92 0.87 0.45 0.30 ccc85b4
AMD Ryzen 9 5950X AVX2 tiny-q5_1 8 221.25 0.89 0.45 0.30 ccc85b4
AMD Ryzen 9 5950X AVX2 base 8 427.14 3.11 0.88 0.43 ccc85b4
AMD Ryzen 9 5950X AVX2 base-q5_0 8 474.96 1.41 0.72 0.51 ccc85b4
AMD Ryzen 9 5950X AVX2 base-q5_1 8 485.05 1.48 0.73 0.52 ccc85b4
AMD Ryzen 9 5950X AVX2 small 8 1470.51 11.70 2.89 1.21 ccc85b4
AMD Ryzen 9 5950X AVX2 small-q5_0 8 1700.43 5.48 1.98 1.41 ccc85b4
AMD Ryzen 9 5950X AVX2 small-q5_1 8 1719.03 5.79 2.02 1.42 ccc85b4
AMD Ryzen 9 5950X AVX2 medium 8 4417.70 35.13 8.14 3.24 ccc85b4
AMD Ryzen 9 5950X AVX2 medium-q5_0 8 5335.77 17.44 5.35 3.92 ccc85b4
AMD Ryzen 9 5950X AVX2 medium-q5_1 8 5372.26 18.36 5.42 3.88 ccc85b4
AMD Ryzen 9 5950X AVX2 medium-dis 8 4070.25 4.86 1.16 0.53 ccc85b4
AMD Ryzen 9 5950X AVX2 large-v2 8 8179.09 66.89 15.45 5.88 ccc85b4
AMD Ryzen 9 5950X AVX2 large-v2-dis 8 7490.45 7.06 1.63 0.70 ccc85b4

API Changes

  • Add struct whisper_context_params

  • Add whisper_log_set

  • Deprecate:

    • whisper_init_from_file
    • whisper_init_from_buffer
    • whisper_init
    • whisper_init_from_file_no_state
    • whisper_init_from_buffer_no_state
    • whisper_init_no_state
  • Add:

    • whisper_init_from_file_with_params
    • whisper_init_from_buffer_with_params
    • whisper_init_with_params
    • whisper_init_from_file_with_params_no_state
    • whisper_init_from_buffer_with_params_no_state
    • whisper_init_with_params_no_state
  • Diff of struct whisper_full_params

     struct whisper_full_params {
         enum whisper_sampling_strategy strategy;
@@ -338,6 +435,7 @@ extern "C" {
 
         bool translate;
         bool no_context;        // do not use past transcription (if any) as initial prompt for the decoder
+        bool no_timestamps;     // do not generate timestamps
         bool single_segment;    // force single segment output (useful for streaming)
         bool print_special;     // print special tokens (e.g. <SOT>, <EOT>, <BEG>, etc.)
         bool print_progress;    // print progress information
@@ -355,8 +453,12 @@ extern "C" {
         // [EXPERIMENTAL] speed-up techniques
         // note: these can significantly reduce the quality of the output
         bool speed_up;          // speed-up the audio by 2x using Phase Vocoder
+        bool debug_mode;        // enable debug_mode provides extra info (eg. Dump log_mel)
         int  audio_ctx;         // overwrite the audio context size (0 = use default)
 
+        // [EXPERIMENTAL] [TDRZ] tinydiarize
+        bool tdrz_enable;       // enable tinydiarize speaker turn detection
+
         // tokens to provide to the whisper decoder as initial prompt
         // these are prepended to any existing text context from a previous call
         const char * initial_prompt;
@@ -365,6 +467,7 @@ extern "C" {
 
         // for auto-detection, set to nullptr, "" or "auto"
         const char * language;
+        bool detect_language;
 
         // common decoding parameters:
         bool suppress_blank;    // ref: https://github.com/openai/whisper/blob/f82bc59f5ea234d4b97fb2860842ed38519f7e65/whisper/decoding.py#L89
@@ -403,11 +506,24 @@ extern "C" {
         whisper_encoder_begin_callback encoder_begin_callback;
         void * encoder_begin_callback_user_data;
 
+        // called each time before ggml computation starts
+        whisper_abort_callback abort_callback;
+        void * abort_callback_user_data;
+
         // called by each decoder to filter obtained logits
         whisper_logits_filter_callback logits_filter_callback;
         void * logits_filter_callback_user_data;
+
+        const whisper_grammar_element ** grammar_rules;
+        size_t                           n_grammar_rules;
+        size_t                           i_start_rule;
+        float                            grammar_penalty;
     };
 

There might be some instability around the API, especially with the existing language bindings. I wasn't able to test everything, so expect some issues and feel free to submit PRs with any kind of fixes that you find.

Highlights and what's next

A lot of the updates in these release are possible thanks to the many contributions in llama.cpp - huge shoutout to all the contributors and collaborators there!

Regarding future updates to whisper.cpp, I'm looking forward to the following things:

  • Add server example similar to the one in llama.cpp
  • Try to improve Metal's batched decoding performance
  • Look for some interesting applications of the grammar sampling functionality

  • Latest performance of the talk-llama example

    talk-llama-1.mp4

What's Changed

New Contributors

Full Changelog: v1.4.0...v1.5.0