Skip to content

Releases: ggerganov/whisper.cpp

v1.5.2

14 Dec 16:06
88112c8
Compare
Choose a tag to compare

Overview

Minor maintenance release:

  • Re-enable CPU BLAS processing after fixing a regression (#1583)

Add new example: wchess

wchess-0.mp4

Shoutout to @fraxy-v (implementation) and @ejones (grammar) for making it work!

What's Changed

New Contributors

Full Changelog: v1.5.1...v1.5.2

v1.5.1

24 Nov 10:45
9d6ebd8
Compare
Choose a tag to compare

Overview

Minor update:

  • With Metal, auto-fallback to CPU if device does not support Apple7 family
  • Add server example

What's Changed

New Contributors

Full Changelog: v1.5.0...v1.5.1

v1.5.0

15 Nov 21:06
d38af15
Compare
Choose a tag to compare

Overview

This major release includes the following changes:

  • Full GPU processing of the Encoder and the Decoder with CUDA and Metal is now supported
  • Efficient beam-search implementation via batched decoding and unified KV cache
  • Full quantization support of all available ggml quantization types
  • Support for grammar constrained sampling
  • Support for Distil Whisper models
  • Support for Whisper Large-v3

and more

Full GPU support

On Apple Silicon, GPU support has been available to a large extend since 15 Sep. However, part of the Encoder was still being executed on the CPU due to lack of MSL kernels for the convolution operations. These kernels are now available resulting in additional speed-up of the Encoder in this release:

image

Encoder performance on Apple M1 Max - before and after (plot by @dreness)

For NVIDIA hardware, the entire computation can now be offloaded to the GPU which results in significant performance boost. For detailed performance breakdown, checkout the Benchmarks section below.

The GPU processing on Apple Silicon is enabled by default, while for NVIDIA you need to build with WHISPER_CUBLAS=1:

# Apple Silicon
make

# NVIDIA
WHISPER_CUBLAS=1 make

Implementation: #1472

Special credits to: @FSSRepo, @slaren

Batched decoding + efficient Beam Search

At last, whisper.cpp now supports efficient Beam Search decoding. The missing piece was the implementation of batched decoding, which now follows closely the unified KV cache idea from llama.cpp. On modern NVIDIA hardware, the performance with 5 beams is the same as 1 beam thanks to the large amount of computing power available. With Metal, the speed with 5 beams is a bit slower compared to 1 beam, but it is significantly faster compared to 5x times the time for single batch which was observed with the old naive implementation.

Beam Search is now enabled by default in whisper.cpp to match the OG implementation of OpenAI Whisper. For more performance details, checkout the Benchmarks section below.

Implementation: #1486

Quantization support

All ggml quantization types are now supported. Quantization mixtures for Whisper model can be implemented. It's still unclear how the quality is affected from the quantization - this is an interesting area which can be explored in the future.

Grammar sampling

The decoder output can now be constrained with a GBNF grammar. This can be a useful technique for further improving the transcription quality in situations where the set of possible phrases are limited.

whisper-chess.mp4

Implementation: #1229

Special credits to @ejones

Distil Whisper

Recently, Distil Whisper models have been released: https://huggingface.co/distil-whisper

whisper.cpp offers support for these models, although it still lacks full implementation of the proposed chunking strategy. Performance details for distilled models are included in the Benchmarks section below.

Implementation: #1424

Whisper Large-v3

Recently, OpenAI released a new version 3 of the Large model: openai/whisper#1761

Implementation: #1444

Benchmarks

Below is a breakdown of the performance of whisper.cpp on Apple Silicon, NVIDIA and CPU. The tables show the Encoder and Decoder speed in ms/tok. The Dec. column corresponds to batch size 1. The Bch5 column corresponds to batch size 5. The PP column corresponds to batch size 128.

For optimal Beam Search performance, the Bch5 number should be 5 times smaller than Dec.

Hw Config Model Th Enc. Dec. Bch5 PP Commit
M2 Ultra METAL tiny 1 11.14 1.40 0.49 0.01 ccc85b4
M2 Ultra METAL tiny-q5_0 1 11.51 1.41 0.52 0.01 ccc85b4
M2 Ultra METAL tiny-q5_1 1 12.21 1.41 0.52 0.01 ccc85b4
M2 Ultra METAL base 1 20.21 2.05 0.77 0.02 ccc85b4
M2 Ultra METAL base-q5_0 1 19.89 1.96 0.81 0.02 ccc85b4
M2 Ultra METAL base-q5_1 1 20.14 2.02 0.81 0.02 ccc85b4
M2 Ultra METAL small 1 51.01 3.97 1.74 0.05 ccc85b4
M2 Ultra METAL small-q5_0 1 56.86 4.09 1.85 0.06 ccc85b4
M2 Ultra METAL small-q5_1 1 56.81 4.14 1.85 0.06 ccc85b4
M2 Ultra METAL medium 1 141.21 8.47 3.98 0.13 ccc85b4
M2 Ultra METAL medium-q5_0 1 160.56 8.27 4.18 0.14 ccc85b4
M2 Ultra METAL medium-q5_1 1 160.52 8.40 4.15 0.14 ccc85b4
M2 Ultra METAL medium-dis 1 128.14 1.13 0.43 0.02 ccc85b4
M2 Ultra METAL large-v2 1 248.73 11.96 6.08 0.22 ccc85b4
M2 Ultra METAL large-v2-q5_0 1 286.31 11.99 6.60 0.26 ccc85b4
M2 Ultra METAL large-v2-q5_1 1 284.56 12.42 6.47 0.26 ccc85b4
M2 Ultra METAL large-v2-dis 1 224.31 1.26 0.49 0.02 ccc85b4
Hw Config Model Th Enc. Dec. Bch5 PP Commit
M2 Ultra COREML METAL tiny 1 7.60 1.41 0.50 0.01 ccc85b4
M2 Ultra COREML METAL base 1 11.90 2.07 0.78 0.02 ccc85b4
M2 Ultra COREML METAL small 1 32.19 4.10 1.78 0.05 ccc85b4
M2 Ultra COREML METAL medium 1 94.43 8.40 3.89 0.12 ccc85b4
M2 Ultra COREML METAL large-v2 1 179.78 12.12 6.07 0.22 ccc85b4
Hw Config Model Th Enc. Dec. Bch5 PP Commit
NVIDIA V100 BLAS CUDA tiny 1 8.84 1.62 0.33 0.02 ccc85b4
NVIDIA V100 BLAS CUDA tiny-q5_0 1 8.43 1.19 0.31 0.02 ccc85b4
NVIDIA V100 BLAS CUDA tiny-q5_1 1 8.41 1.19 0.29 0.02 ccc85b4
NVIDIA V100 BLAS CUDA base 1 14.79 2.31 0.46 0.03 ccc85b4
NVIDIA V100 BLAS CUDA base-q5_0 1 15.05 1.66 0.44 0.03 ccc85b4
NVIDIA V100 BLAS CUDA base-q5_1 1 15.01 1.68 0.46 0.03 ccc85b4
NVIDIA V100 BLAS CUDA small 1 40.30 4.37 0.88 0.05 ccc85b4
NVIDIA V100 BLAS CUDA small-q5_0 1 41.17 3.11 0.94 0.05 ccc85b4
NVIDIA V100 BLAS CUDA small-q5_1 1 41.12 3.11 0.82 0.05 ccc85b4
NVIDIA V100 BLAS CUDA medium 1 104.93 10.06 1.77 0.11 ccc85b4
NVIDIA V100 BLAS CUDA medium-q5_0 1 107.11 6.13 2.07 0.12 ccc85b4
NVIDIA V100 BLAS CUDA medium-q5_1 1 107.91 6.21 1.77 0.12 ccc85b4
NVIDIA V100 BLAS CUDA medium-dis 1 103.45 1.11 0.24 0.02 ccc85b4
NVIDIA V100 BLAS CUDA large-v2 1 171.55 15.76 2.62 0.17 ccc85b4
NVIDIA V100 BLAS CUDA large-v2-q5_0 1 176.27 8.61 3.17 0.19 ccc85b4
NVIDIA V100 BLAS CUDA large-v2-q5_1 1 176.23 8.67 2.59 0.19 ccc85b4
Hw Config Model Th Enc. Dec. Bch5 PP Commit
AMD Ryzen 9 5950X AVX2 tiny 8 197.47 1.22 0.44 0.25 ccc85b4
AMD Ryzen 9 5950X AVX2 tiny-q5_0 8 222.92 0.87 0.45 0.30 ccc85b4
AMD Ryzen 9 5950X AVX2 tiny-q5_1 8 221.25 0.89 0.45 0.30 ccc85b4
AMD Ryzen 9 5950X AVX2 base 8 427.14 3.11 0.88 0.43 ccc85b4
AMD Ryzen 9 5950X AVX2 base-q5_0 8 474.96 1.41 0.72 0.51 ccc85b4
AMD Ryzen 9 5950X AVX2 base-q5_1 8 485.05 1.48 0.73 0.52 ccc85b4
AMD Ryzen 9 5950X AVX2 small 8 1470.51 11.70 2.89 1.21 ccc85b4
AMD Ryzen 9 5950X AVX2 small-q5_0 8 1700.43 5.48 1.98 1.41 ccc85b4
AMD Ryzen 9 5950X AVX2 small-q5_1 8 1719.03 5.79 2.02 1.42 ccc85b4
AMD Ryzen 9 5950X AVX2 medium 8 4417.70 35.13 8.14...
Read more

v1.4.3

07 Nov 14:29
6a5d195
Compare
Choose a tag to compare
v1.4.3 Pre-release
Pre-release

This is a minor release, the main reason for which is that there hasn't been an official release for a few months now and some small things have accumulated on the master branch that would be nice to be upstreamed. I am planning a major v1.5.0 release with some new and long-waited functionality soon:

  • Full CUDA offloading
  • Efficient Beam-Search implementation
  • Grammar support

The current version v1.4.3 should be considered in beta as I haven't worked intensively on whisper.cpp recently and there might be some issues that made their way in the code. I'll try to polish things in the next days and prepare a stable v1.5.0 release. In the meantime, any feedback will be highly appreciated.

Detailed API changes, features and new contributor recognitions will be included in the v1.5.0 release.

v1.4.0

30 Apr 16:56
fa8dbdc
Compare
Choose a tag to compare

Overview

This is a new major release adding integer quantization and partial GPU (NVIDIA) support

Integer quantization

This allows the ggml Whisper models to be converted from the default 16-bit floating point weights to 4, 5 or 8 bit integer weights.
The resulting quantized models are smaller in disk size and memory usage and can be processed faster on some architectures. The transcription quality is degraded to some extend - not quantified at the moment.

  • Supported quantization modes: Q4_0, Q4_1, Q4_2, Q5_0, Q5_1, Q8_0
  • Implementation details: #540
  • Usage instructions: README
  • All WASM examples now support Q5 quantized models: https://whisper.ggerganov.com

Here is a quantitative evaluation of the different quantization modes applied to the LLaMA and RWKV large language models. These results can give an impression about the expected quality, size and speed for quantized Whisper models:

LLaMA quantization (measured on M1 Pro)

Model Measure F16 Q4_0 Q4_1 Q4_2 Q5_0 Q5_1 Q8_0
7B perplexity 5.9565 6.2103 6.1286 6.1698 6.0139 5.9934 5.9571
7B file size 13.0G 4.0G 4.8G 4.0G 4.4G 4.8G 7.1G
7B ms/tok @ 4th 128 56 61 84 91 95 75
7B ms/tok @ 8th 128 47 55 48 53 59 75
7B bits/weight 16.0 5.0 6.0 5.0 5.5 6.0 9.0
13B perplexity 5.2455 5.3748 5.3471 5.3433 5.2768 5.2582 5.2458
13B file size 25.0G 7.6G 9.1G 7.6G 8.4G 9.1G 14G
13B ms/tok @ 4th 239 104 113 160 176 185 141
13B ms/tok @ 8th 240 85 99 97 108 117 147
13B bits/weight 16.0 5.0 6.0 5.0 5.5 6.0 9.0

ref: https://github.com/ggerganov/llama.cpp#quantization

RWKV quantization

Format Perplexity (169M) Latency, ms (1.5B) File size, GB (1.5B)
Q4_0 17.507 76 1.53
Q4_1 17.187 72 1.68
Q4_2 17.060 85 1.53
Q5_0 16.194 78 1.60
Q5_1 15.851 81 1.68
Q8_0 15.652 89 2.13
FP16 15.623 117 2.82
FP32 15.623 198 5.64

ref: ggerganov/ggml#89 (comment)

This feature is possible thanks to the many contributions in the llama.cpp project: https://github.com/users/ggerganov/projects/2

GPU support via cuBLAS

Using cuBLAS results mainly in improved Encoder inference speed. I haven't done proper timings, but one can expect at least 2-3 times faster Encoder evaluation with modern NVIDIA GPU cards compared to CPU-only processing. Feel free to post your Encoder benchmarks in issue #89.

  • Implementation details: #834
  • Usage instructions: README

This is another feature made possible by the llama.cpp project. Special recognition to @slaren for putting almost all of this work together


This release remains in "beta" stage as I haven't verified that everything works as expected.

What's Changed

New Contributors

Full Changelog: v1.3.0...v1.4.0

v1.3.0

15 Apr 14:41
c23588c
Compare
Choose a tag to compare

Overview

This release should be considered in Beta stage, since I haven't done a lot of testing and I am not sure if I didn't break something.
But overall, I believe both the performance and the quality are improved.

  • Added Core ML support #566
  • Restored decoding fallbacks with default size of 2 instead of 5 (f19e23f)
  • Pad the audio with zeros instead of the spectrogram (5108b30)
  • Added talk-llama example
  • Added whisper_state which allows parallel transcriptions with a single model in memory (#523)

The C-style API has been extended significantly to support the new whisper_state, but in general should be backwards compatible.
The only breaking change is in the callbacks signatures.

Please provide feedback in the discussion if you observe any issues.

The next release v1.4.0 will follow up relatively soon and will provide 4-bit integer quantization support.

What's Changed

New Contributors

Full Changelog: v1.2.1...v1.3.0

v1.2.1

28 Feb 20:30
ad13890
Compare
Choose a tag to compare

Overview

This is a minor release. The main reason for it is a critical bug fix that causes the software to crash randomly when the language auto-detect option is used (i.e. whisper_lang_auto_detect()).

Other than that, the release includes refactoring of the examples, ruby bindings and some minor changes to the C API.

You can provide feedback in the existing v1.2.0 discussion.

What's Changed

Core ggml / whisper

  • whisper : whisper : add "split_on_word" flag when using using "max_len" option by @mightymatth in #455 and @boolemancer in #476
  • whisper : add whisper_full_lang_id() for getting the context lang by @kamranjon in #461
  • whisper : fixed Beam Search Strategy and exposed whisper_pcm_to_mel_phase_vocoder by @sandrohanea in #474
  • whisper : suppress non-speech-related token outputs by @shibukazu in #473
  • cmake : install whisper.h header by @aviks in #485
  • whisper : fix signedness compiler warning by @shikokuchuo in #506
  • whisper : by default disable non-speech tokens suppression #473
  • whisper : add API for applying custom logits filters during decoding 0d22916
  • whisper : fix uninitialized exp_n_audio_ctx by @finnvoor in #520

Bindings

  • bindings : add Ruby by @taf2 in #500
  • readme : add .NET repos (#303)
  • readme : add cython bindings (#9)
  • readme : add pybind11 bindings by @aarnphm in #538

Examples

  • ci : add node addon test and optimize compilation configuration by @chenqianhe in #468
  • yt-wsp.sh : add unique filename generation by @genevera in #495
  • examples : refactor in order to reuse code and reduce duplication by @ggerganov in #482
  • main : fix stdin pipe stream by @conradg in #503
  • make : add "-mcpu=native" when building for aarch64 (#532)

C-style API

  • Add whisper_pcm_to_mel_phase_vocoder()
  • Add *(whisper_logits_filter_callback)()
  • Change struct whisper_full_params
  • Add whisper_full_lang_id()

New Contributors

Full Changelog: v1.2.0...v1.2.1

Highlights

Recently, I have been making progress on adding integer quantisation support in the ggml tensor library. This will eventually allow to use quantised models which require less memory and will hopefully run faster. I think the next major release v1.3.0 will officially add quantisation support. For now, you can keep track of the progress in #540


v1.2.0

04 Feb 08:55
b2083c5
Compare
Choose a tag to compare

Overview

In this release we significantly reduce the memory usage during inference by introducing "scratch" buffers to ggml.

The new memory requirements per model are as follows:

Model Disk Mem (Old) Mem (New)
tiny 75 MB ~390 MB ~125 MB
base 142 MB ~500 MB ~210 MB
small 466 MB ~1.0 GB ~600 MB
medium 1.5 GB ~2.6 GB ~1.7 GB
large 2.9 GB ~4.7 GB ~3.3 GB

It's a simple idea that instead of creating a new memory buffer for each new tensor in the computation, we reuse the memory of old tensors that are no longer needed. The implementation is in PR #431. It's not very clean - I think there is some better way to do this, but for now it will work.

Additionally, there might be some inference speed improvements on Apple Silicon in the Decoder part of the transformer. I haven't done proper benchmarks, but seems there is about ~30% performance boost. The results are identical to v1.1.1.

What's Changed

Core ggml / whisper

  • whisper : PPC64 big-endian support by @fitzsim in #398
  • whisper : condition sampled timestamp tokens to be monotonically increasing by @ggerganov in #425
  • wasm : fix typo in helper.js by @bhbs in #459
  • ggml/whisper : reduce memory usage during inference by @ggerganov in #431

Bindings

  • ci : run workflows on pull requests + bindings depend on .h by @ggerganov in #446
  • go : added wrappers to reset and print timings by @glaslos in #436
  • go : add WhisperLangAutoDetect method to go binding by @RobinXL in #451
  • go : add wrapper for system info by @glaslos in #456
  • go : support "auto" as an option when set language by @polarmoon in #462

Examples

  • whisper.wasm : add labels for easier radio selection by @kokes in #435
  • livestream.sh : run main with model arg instead of default by @EricTendian in #453
  • main : CSV format export trimmed spaces fix by @alex-bacart in #444
  • addon.node : using whisper as a Node.js addon by @chenqianhe in #443

New Contributors

Full Changelog: v1.1.1...v1.2.0

Highlights

I'll use these release notes to write some random thoughts about the project - sort of a short blog post.

I'm really happy with how whisper.cpp turned out to be so far. There is a very positive reception in the ML community - most people seem to be excited by the simplicity of the implementation and the fact that it is quite self-contained. I receive a lot of questions about the project and about various ideas that it can be applied to. I really enjoy it and I try to respond to everyone!

I also find it very satisfying that there are so many contributions already happening by so many people. To me this illustrates the power of open-source collaboration. The contributions not only improve the functionality and the quality of the code, but also help to generate various new ideas and approaches to explore.

Another interesting thing is that the project keeps on giving. Every time I start to think that now is a good time to put it in the background for a while and focus on other stuff, some new cool idea pops up and I can't help but start working on it. Having this custom implementation allows me to interact with the model on a lower level which opens some interesting ways to explore it.

So far the development has been focused on improving the performance, expanding the platform coverage and having robust decoding strategies with a variety of examples. During this time, there have been several ideas that accumulated over-time which I find interesting to explore (diarization, token-level timestamps, improved timestamp accuracy, etc). I think I'll try to focus more on these in the future and see if I can achieve something interesting.



  • "The New Yorker" article featuring whisper.cpp

v1.1.1

23 Jan 18:41
2c3f50a
Compare
Choose a tag to compare

Overview

Since the v1.1.0 pre-release there have been several reports of improved transcription quality.
Together with my observations, I think we can declare version v1.1.1 as "stable".

There were actually a couple of bug-fixes implemented since v1.1.0, so make sure to update to v1.1.1 for optimal results.

Another update is that the prototype for v1.2.0 is almost ready: #431
Initial results indicate that the memory usage can be reduced by a factor of 2-3 for the smaller models.

You can provide feedback in the existing v1.1.0 discussion.

What's Changed

Core ggml / whisper

  • whisper : perform entropy check only when we have at least 32 tokens 1a91c19
  • whisper : fix condition for providing past prompt (critical) 78f1661

Bindings

  • go : remove sample_best and sample_timestamp bindings by @Trojan295 in #409

Examples

  • main : re-enable temperature fallback f583e2d
  • main : add an option to accept optional output filenames by @garychia in #424
  • whisper.android : use AssetManager for Android by @Digipom in #415
  • whisper.wasm : add small and small.en models 206fc93
  • bench : add memcpy and ggml_mul_mat benchmarks (experimental) 1290fc6

New Contributors

Full Changelog: v1.1.0...v1.1.1

v1.1.0

15 Jan 12:00
8738427
Compare
Choose a tag to compare
v1.1.0 Pre-release
Pre-release

Overview

The major change in this pre-release is the improved decoding implementation in whisper.cpp:

  • Support for average logprob and entropy based criteria for fallback
  • Support for temperature T > 0
  • Improved Greedy decoder via best_of parameter for T > 0
  • Add beam search decoding (a.k.a beam_size)

More information about the decoding changes can be found in #291
Additionally, there are a few performance improvements for Apple Silicon, WASM and non-F16C platforms.
Support for POWER9 architectures has been added.

The reason that this is a pre-release and not an official release is that the new implementation has not been sufficiently tested yet and the existing bindings for other languages have not been updated to support the API changes. The official release 1.1.x will be created when there is enough feedback about the new decoding implementation and when the bindings have been updated. So make sure to send your feedback in the discussion created for this pre-release. For now, the 1.0.4 release should be considered more stable.

What's Changed

Core ggml / whisper

  • ggml : POWER9 support by @fitzsim in #320, #349, #369
  • ggml : simplify the SIMD code by @ggerganov in #324
  • ggml : add SSE3 and fp16 conversion lookup table by @abitofevrything in #368
  • ggml : utilise Accelerate's vDSP for some computations d51fc3e
  • ggml : speed-up softmax compute via Accelerate and loop unrolling d61d55c
  • ggml : do not start extra threads when using BLAS d347a59
  • whisper : do sample_to_timestamp calculation with 64 bit precision to avoid overflow by @boolemancer in #388
  • whisper : various code clean-up and improvements by @asmaloney in #317 #318 #319 #322 etc
  • whisper : improve decoding by @ggerganov in #291
  • whisper : account for speed_up flag for short audio #405

C-style API

  • Add loader class to allow loading from buffer and others by @prsyahmi in #353
  • Add whisper_token_data::plog
  • Add whisper_init_from_file()
  • Add whisper_init_from_buffer()
  • Change whisper_init()
  • Remove whisper_sample_best()
  • Remove whisper_sample_timestamp()
  • Add whisper_n_audio_ctx()
  • Add whisper_get_logits()
  • Remove whisper_get_probs()
  • Change struct whisper_full_params

Bindings

Examples

  • whisper.android : remove android ABI constraint by @Digipom in #301
  • whisper.swiftui : SwiftUI example by @Digipom in #308
  • main : add -ocsv, aka --output-csv for writing CSV file containing millisecond timestamps by @NielsMayer in #340
  • command : refactor to split command list & general transcription modes by @asmaloney in #331
  • command : always-prompt mode by @dnhkng in #383
  • stream : fix data race on bool + avoid division-by-zero a466c34
  • stream : fix a bug that inserted a lot of empty audio at the start a6dbd91
  • bench.wasm : print system info fafd789

New Contributors

Full Changelog: v1.0.4...v1.1.0

Highlights

image