Releases: ggerganov/whisper.cpp
v1.5.2
Overview
Minor maintenance release:
- Re-enable CPU BLAS processing after fixing a regression (#1583)
Add new example: wchess
wchess-0.mp4
Shoutout to @fraxy-v (implementation) and @ejones (grammar) for making it work!
What's Changed
- automatically convert audio on the server by @sapoepsilon in #1539
- CI : Rectify the Clang-Related workflow issues by @bobqianic in #1551
- CI : Add CUDA 11.8.0 support by @bobqianic in #1554
- Update main program help info by @bebound in #1560
- Set default CORS headers to allow all by @kasumi-1 in #1567
- cmake : install required ggml.h header by @gjasny in #1568
- Backport .srt output format to examples/server by @osdrv in #1565
- Added support for .vtt format to Whisper server by @aleksanderandrzejewski in #1578
- ggml : re-enable blas for src0 != F32 by @ggerganov in #1583
- Fix 32-bit compiler warning by @Digipom in #1575
- Remove #if arch(arm) check in Swift Package Manager by @finnvoor in #1561
- Pass max-len argument to server wparams by @osdrv in #1574
- sync : ggml (new ops, new backend, etc) by @ggerganov in #1602
- Fix
ggml_metal_log
on Intel macs by @finnvoor in #1606 - Update CMakeLists.txt by @Kreijstal in #1615
- target windows 8 or above for prefetchVirtualMemory in llama-talk by @Kreijstal in #1617
- sync : ggml (Metal fixes, new ops, tests) by @ggerganov in #1633
- wchess: whisper assisted chess by @fraxy-v in #1595
New Contributors
- @sapoepsilon made their first contribution in #1539
- @bebound made their first contribution in #1560
- @kasumi-1 made their first contribution in #1567
- @gjasny made their first contribution in #1568
- @osdrv made their first contribution in #1565
- @aleksanderandrzejewski made their first contribution in #1578
- @Kreijstal made their first contribution in #1615
- @fraxy-v made their first contribution in #1595
Full Changelog: v1.5.1...v1.5.2
v1.5.1
Overview
Minor update:
- With Metal, auto-fallback to CPU if device does not support Apple7 family
- Add server example
What's Changed
- ISSUE-1329: replace " with ' so it doesn't try to execute code in backticks by @spullara in #1364
- sync : ggml (ggml-alloc + linker + gguf fixes) by @ggerganov in #1501
- Fixed with_state methods, to use the correct state by @sandrohanea in #1519
- #1517 Redistribute CUDA DLLs by @tamo in #1522
- whisper : reuse whisper_decode_with_state by @ggerganov in #1521
- sdl : fix audio callback by @ggerganov in #1523
- update deprecated example by @MightyStud in #1529
- Super Simple Whisper Server by @felrock in #1380
- Close file after writing in server application by @felrock in #1533
- bench : multi-thread memcpy by @ggerganov in #1534
- Change temp file name for server application by @felrock in #1535
- Fixed Makefile for MacOS ARM 64 Go bindings by @gleicon in #1530
- Fixed metal build on macos-latest by @sandrohanea in #1544
- fix(server): typo in temperature parameter by @Okabintaro in #1545
- Request to add a new function to get the full language name by @bradmit in #1546
- server : add --print-realtime param by @ecneladis in #1541
- cuda : sync some minor stuff from llama.cpp by @ggerganov in #1548
- metal : add backend function to check device family support by @ggerganov in #1547
New Contributors
- @spullara made their first contribution in #1364
- @MightyStud made their first contribution in #1529
- @felrock made their first contribution in #1380
- @gleicon made their first contribution in #1530
- @Okabintaro made their first contribution in #1545
- @bradmit made their first contribution in #1546
- @ecneladis made their first contribution in #1541
Full Changelog: v1.5.0...v1.5.1
v1.5.0
Overview
This major release includes the following changes:
- Full GPU processing of the Encoder and the Decoder with CUDA and Metal is now supported
- Efficient beam-search implementation via batched decoding and unified KV cache
- Full quantization support of all available
ggml
quantization types - Support for grammar constrained sampling
- Support for Distil Whisper models
- Support for Whisper Large-v3
and more
Full GPU support
On Apple Silicon, GPU support has been available to a large extend since 15 Sep. However, part of the Encoder was still being executed on the CPU due to lack of MSL kernels for the convolution operations. These kernels are now available resulting in additional speed-up of the Encoder in this release:
Encoder performance on Apple M1 Max - before and after (plot by @dreness)
For NVIDIA hardware, the entire computation can now be offloaded to the GPU which results in significant performance boost. For detailed performance breakdown, checkout the Benchmarks section below.
The GPU processing on Apple Silicon is enabled by default, while for NVIDIA you need to build with WHISPER_CUBLAS=1
:
# Apple Silicon
make
# NVIDIA
WHISPER_CUBLAS=1 make
Implementation: #1472
Special credits to: @FSSRepo, @slaren
Batched decoding + efficient Beam Search
At last, whisper.cpp
now supports efficient Beam Search decoding. The missing piece was the implementation of batched decoding, which now follows closely the unified KV cache idea from llama.cpp. On modern NVIDIA hardware, the performance with 5 beams is the same as 1 beam thanks to the large amount of computing power available. With Metal, the speed with 5 beams is a bit slower compared to 1 beam, but it is significantly faster compared to 5x times the time for single batch which was observed with the old naive implementation.
Beam Search is now enabled by default in whisper.cpp
to match the OG implementation of OpenAI Whisper. For more performance details, checkout the Benchmarks section below.
Implementation: #1486
Quantization support
All ggml
quantization types are now supported. Quantization mixtures for Whisper model can be implemented. It's still unclear how the quality is affected from the quantization - this is an interesting area which can be explored in the future.
Grammar sampling
The decoder output can now be constrained with a GBNF grammar. This can be a useful technique for further improving the transcription quality in situations where the set of possible phrases are limited.
whisper-chess.mp4
Implementation: #1229
Special credits to @ejones
Distil Whisper
Recently, Distil Whisper models have been released: https://huggingface.co/distil-whisper
whisper.cpp
offers support for these models, although it still lacks full implementation of the proposed chunking strategy. Performance details for distilled models are included in the Benchmarks section below.
Implementation: #1424
Whisper Large-v3
Recently, OpenAI released a new version 3 of the Large model: openai/whisper#1761
Implementation: #1444
Benchmarks
Below is a breakdown of the performance of whisper.cpp
on Apple Silicon, NVIDIA and CPU. The tables show the Encoder and Decoder speed in ms/tok
. The Dec.
column corresponds to batch size 1. The Bch5
column corresponds to batch size 5. The PP
column corresponds to batch size 128.
For optimal Beam Search performance, the Bch5
number should be 5 times smaller than Dec.
Hw | Config | Model | Th | Enc. | Dec. | Bch5 | PP | Commit |
---|---|---|---|---|---|---|---|---|
M2 Ultra | METAL | tiny | 1 | 11.14 | 1.40 | 0.49 | 0.01 | ccc85b4 |
M2 Ultra | METAL | tiny-q5_0 | 1 | 11.51 | 1.41 | 0.52 | 0.01 | ccc85b4 |
M2 Ultra | METAL | tiny-q5_1 | 1 | 12.21 | 1.41 | 0.52 | 0.01 | ccc85b4 |
M2 Ultra | METAL | base | 1 | 20.21 | 2.05 | 0.77 | 0.02 | ccc85b4 |
M2 Ultra | METAL | base-q5_0 | 1 | 19.89 | 1.96 | 0.81 | 0.02 | ccc85b4 |
M2 Ultra | METAL | base-q5_1 | 1 | 20.14 | 2.02 | 0.81 | 0.02 | ccc85b4 |
M2 Ultra | METAL | small | 1 | 51.01 | 3.97 | 1.74 | 0.05 | ccc85b4 |
M2 Ultra | METAL | small-q5_0 | 1 | 56.86 | 4.09 | 1.85 | 0.06 | ccc85b4 |
M2 Ultra | METAL | small-q5_1 | 1 | 56.81 | 4.14 | 1.85 | 0.06 | ccc85b4 |
M2 Ultra | METAL | medium | 1 | 141.21 | 8.47 | 3.98 | 0.13 | ccc85b4 |
M2 Ultra | METAL | medium-q5_0 | 1 | 160.56 | 8.27 | 4.18 | 0.14 | ccc85b4 |
M2 Ultra | METAL | medium-q5_1 | 1 | 160.52 | 8.40 | 4.15 | 0.14 | ccc85b4 |
M2 Ultra | METAL | medium-dis | 1 | 128.14 | 1.13 | 0.43 | 0.02 | ccc85b4 |
M2 Ultra | METAL | large-v2 | 1 | 248.73 | 11.96 | 6.08 | 0.22 | ccc85b4 |
M2 Ultra | METAL | large-v2-q5_0 | 1 | 286.31 | 11.99 | 6.60 | 0.26 | ccc85b4 |
M2 Ultra | METAL | large-v2-q5_1 | 1 | 284.56 | 12.42 | 6.47 | 0.26 | ccc85b4 |
M2 Ultra | METAL | large-v2-dis | 1 | 224.31 | 1.26 | 0.49 | 0.02 | ccc85b4 |
Hw | Config | Model | Th | Enc. | Dec. | Bch5 | PP | Commit |
---|---|---|---|---|---|---|---|---|
M2 Ultra | COREML METAL | tiny | 1 | 7.60 | 1.41 | 0.50 | 0.01 | ccc85b4 |
M2 Ultra | COREML METAL | base | 1 | 11.90 | 2.07 | 0.78 | 0.02 | ccc85b4 |
M2 Ultra | COREML METAL | small | 1 | 32.19 | 4.10 | 1.78 | 0.05 | ccc85b4 |
M2 Ultra | COREML METAL | medium | 1 | 94.43 | 8.40 | 3.89 | 0.12 | ccc85b4 |
M2 Ultra | COREML METAL | large-v2 | 1 | 179.78 | 12.12 | 6.07 | 0.22 | ccc85b4 |
Hw | Config | Model | Th | Enc. | Dec. | Bch5 | PP | Commit |
---|---|---|---|---|---|---|---|---|
NVIDIA V100 | BLAS CUDA | tiny | 1 | 8.84 | 1.62 | 0.33 | 0.02 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | tiny-q5_0 | 1 | 8.43 | 1.19 | 0.31 | 0.02 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | tiny-q5_1 | 1 | 8.41 | 1.19 | 0.29 | 0.02 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | base | 1 | 14.79 | 2.31 | 0.46 | 0.03 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | base-q5_0 | 1 | 15.05 | 1.66 | 0.44 | 0.03 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | base-q5_1 | 1 | 15.01 | 1.68 | 0.46 | 0.03 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | small | 1 | 40.30 | 4.37 | 0.88 | 0.05 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | small-q5_0 | 1 | 41.17 | 3.11 | 0.94 | 0.05 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | small-q5_1 | 1 | 41.12 | 3.11 | 0.82 | 0.05 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | medium | 1 | 104.93 | 10.06 | 1.77 | 0.11 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | medium-q5_0 | 1 | 107.11 | 6.13 | 2.07 | 0.12 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | medium-q5_1 | 1 | 107.91 | 6.21 | 1.77 | 0.12 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | medium-dis | 1 | 103.45 | 1.11 | 0.24 | 0.02 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | large-v2 | 1 | 171.55 | 15.76 | 2.62 | 0.17 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | large-v2-q5_0 | 1 | 176.27 | 8.61 | 3.17 | 0.19 | ccc85b4 |
NVIDIA V100 | BLAS CUDA | large-v2-q5_1 | 1 | 176.23 | 8.67 | 2.59 | 0.19 | ccc85b4 |
Hw | Config | Model | Th | Enc. | Dec. | Bch5 | PP | Commit |
---|---|---|---|---|---|---|---|---|
AMD Ryzen 9 5950X | AVX2 | tiny | 8 | 197.47 | 1.22 | 0.44 | 0.25 | ccc85b4 |
AMD Ryzen 9 5950X | AVX2 | tiny-q5_0 | 8 | 222.92 | 0.87 | 0.45 | 0.30 | ccc85b4 |
AMD Ryzen 9 5950X | AVX2 | tiny-q5_1 | 8 | 221.25 | 0.89 | 0.45 | 0.30 | ccc85b4 |
AMD Ryzen 9 5950X | AVX2 | base | 8 | 427.14 | 3.11 | 0.88 | 0.43 | ccc85b4 |
AMD Ryzen 9 5950X | AVX2 | base-q5_0 | 8 | 474.96 | 1.41 | 0.72 | 0.51 | ccc85b4 |
AMD Ryzen 9 5950X | AVX2 | base-q5_1 | 8 | 485.05 | 1.48 | 0.73 | 0.52 | ccc85b4 |
AMD Ryzen 9 5950X | AVX2 | small | 8 | 1470.51 | 11.70 | 2.89 | 1.21 | ccc85b4 |
AMD Ryzen 9 5950X | AVX2 | small-q5_0 | 8 | 1700.43 | 5.48 | 1.98 | 1.41 | ccc85b4 |
AMD Ryzen 9 5950X | AVX2 | small-q5_1 | 8 | 1719.03 | 5.79 | 2.02 | 1.42 | ccc85b4 |
AMD Ryzen 9 5950X | AVX2 | medium | 8 | 4417.70 | 35.13 | 8.14... |
v1.4.3
This is a minor release, the main reason for which is that there hasn't been an official release for a few months now and some small things have accumulated on the master
branch that would be nice to be upstreamed. I am planning a major v1.5.0
release with some new and long-waited functionality soon:
- Full CUDA offloading
- Efficient Beam-Search implementation
- Grammar support
The current version v1.4.3
should be considered in beta as I haven't worked intensively on whisper.cpp
recently and there might be some issues that made their way in the code. I'll try to polish things in the next days and prepare a stable v1.5.0
release. In the meantime, any feedback will be highly appreciated.
Detailed API changes, features and new contributor recognitions will be included in the v1.5.0
release.
v1.4.0
Overview
This is a new major release adding integer quantization and partial GPU (NVIDIA) support
Integer quantization
This allows the ggml
Whisper models to be converted from the default 16-bit floating point weights to 4, 5 or 8 bit integer weights.
The resulting quantized models are smaller in disk size and memory usage and can be processed faster on some architectures. The transcription quality is degraded to some extend - not quantified at the moment.
- Supported quantization modes:
Q4_0
,Q4_1
,Q4_2
,Q5_0
,Q5_1
,Q8_0
- Implementation details: #540
- Usage instructions: README
- All WASM examples now support
Q5
quantized models: https://whisper.ggerganov.com
Here is a quantitative evaluation of the different quantization modes applied to the LLaMA and RWKV large language models. These results can give an impression about the expected quality, size and speed for quantized Whisper models:
LLaMA quantization (measured on M1 Pro)
Model | Measure | F16 | Q4_0 | Q4_1 | Q4_2 | Q5_0 | Q5_1 | Q8_0 |
---|---|---|---|---|---|---|---|---|
7B | perplexity | 5.9565 | 6.2103 | 6.1286 | 6.1698 | 6.0139 | 5.9934 | 5.9571 |
7B | file size | 13.0G | 4.0G | 4.8G | 4.0G | 4.4G | 4.8G | 7.1G |
7B | ms/tok @ 4th | 128 | 56 | 61 | 84 | 91 | 95 | 75 |
7B | ms/tok @ 8th | 128 | 47 | 55 | 48 | 53 | 59 | 75 |
7B | bits/weight | 16.0 | 5.0 | 6.0 | 5.0 | 5.5 | 6.0 | 9.0 |
13B | perplexity | 5.2455 | 5.3748 | 5.3471 | 5.3433 | 5.2768 | 5.2582 | 5.2458 |
13B | file size | 25.0G | 7.6G | 9.1G | 7.6G | 8.4G | 9.1G | 14G |
13B | ms/tok @ 4th | 239 | 104 | 113 | 160 | 176 | 185 | 141 |
13B | ms/tok @ 8th | 240 | 85 | 99 | 97 | 108 | 117 | 147 |
13B | bits/weight | 16.0 | 5.0 | 6.0 | 5.0 | 5.5 | 6.0 | 9.0 |
ref: https://github.com/ggerganov/llama.cpp#quantization
RWKV quantization
Format | Perplexity (169M) | Latency, ms (1.5B) | File size, GB (1.5B) |
---|---|---|---|
Q4_0 |
17.507 | 76 | 1.53 |
Q4_1 |
17.187 | 72 | 1.68 |
Q4_2 |
17.060 | 85 | 1.53 |
Q5_0 |
16.194 | 78 | 1.60 |
Q5_1 |
15.851 | 81 | 1.68 |
Q8_0 |
15.652 | 89 | 2.13 |
FP16 |
15.623 | 117 | 2.82 |
FP32 |
15.623 | 198 | 5.64 |
ref: ggerganov/ggml#89 (comment)
This feature is possible thanks to the many contributions in the llama.cpp project: https://github.com/users/ggerganov/projects/2
GPU support via cuBLAS
Using cuBLAS results mainly in improved Encoder inference speed. I haven't done proper timings, but one can expect at least 2-3 times faster Encoder evaluation with modern NVIDIA GPU cards compared to CPU-only processing. Feel free to post your Encoder benchmarks in issue #89.
This is another feature made possible by the llama.cpp project. Special recognition to @slaren for putting almost all of this work together
This release remains in "beta" stage as I haven't verified that everything works as expected.
What's Changed
- Updated escape_double_quotes() Function by @tauseefmohammed2 in #776
- examples : add missing #include by @pH5 in #798
- Flush upon finishing inference by @tarasglek in #811
- Escape quotes in csv output by @laytan in #815
- C++11style by @wuyudi in #768
- Optionally allow a Core ML build of Whisper to work with or without Core ML models by @Canis-UK in #812
- add some tips about in the readme of the android project folder by @Zolliner in #816
- whisper: Use correct seek_end when offset is used by @ThijsRay in #833
- ggml : fix 32-bit ARM NEON by @ggerganov in #836
- Add CUDA support via cuBLAS by @ggerganov in #834
- Integer quantisation support by @ggerganov in #540
New Contributors
- @tauseefmohammed2 made their first contribution in #776
- @pH5 made their first contribution in #798
- @tarasglek made their first contribution in #811
- @laytan made their first contribution in #815
- @wuyudi made their first contribution in #768
- @Canis-UK made their first contribution in #812
- @Zolliner made their first contribution in #816
- @ThijsRay made their first contribution in #833
Full Changelog: v1.3.0...v1.4.0
v1.3.0
Overview
This release should be considered in Beta stage, since I haven't done a lot of testing and I am not sure if I didn't break something.
But overall, I believe both the performance and the quality are improved.
- Added Core ML support #566
- Restored decoding fallbacks with default size of 2 instead of 5 (f19e23f)
- Pad the audio with zeros instead of the spectrogram (5108b30)
- Added talk-llama example
- Added
whisper_state
which allows parallel transcriptions with a single model in memory (#523)
The C-style API has been extended significantly to support the new whisper_state
, but in general should be backwards compatible.
The only breaking change is in the callbacks signatures.
Please provide feedback in the discussion if you observe any issues.
The next release v1.4.0
will follow up relatively soon and will provide 4-bit integer quantization support.
What's Changed
- update csv output format to match OpenAI's Whisper dataframe output by @hykelvinlee42 in #552
- Go binding: NewContext now returns a clean context by @polarmoon in #537
- Added whisper state + default state on the whisper_context by @sandrohanea in #523
- whisper.android: Enable fp16 instrinsics (FP16_VA) which is supported by ARMv8.2 or later. by @tinoue in #572
- Add quality comparison helper by @venkr in #569
- whisper.android: Support benchmark for Android example. by @tinoue in #542
- Fix MUSL Linux build by @ggerganov in #576
- Change default encoding to UTF-8 by @Kamilake in #605
- Provide option for creating JSON output by @tuxpoldo in #615
- readme : add react-native bindings by @jhen0409 in #619
- Fixed language auto-detection for state provided processing. by @sandrohanea in #627
- xcodeproj : add
-O3 -DNDEBUG
in release mode by @jhen0409 in #640 - Nodejs Addon blocking main thread. Implemented Napi::AsyncWorker by @LucasZNK in #642
- Include link to R wrapper in README by @jwijffels in #626
- Add a cmake flag to disable F16C by @a5huynh in #628
- Add talk-llama example by @ggerganov in #664
- Add Alpaca support to talk-llama example by @ejones in #668
- Update README.md by @razodactyl in #682
- issue #470 - working 32-bit ARM by @clach04 in #486
- whisper : add initial_prompt param by @jhen0409 in #645
- fix typo in JSON output by @egorFiNE in #648
- Fix shell script ./models/download-ggml-model.sh to handle spaces and special characters in paths by @be-next in #677
- Fixed test to new async implementation by @LucasZNK in #686
- Minor: fixing usage message for talk-llama by @InconsolableCellist in #687
- Small typo by @ZiggerZZ in #688
- feat: add progress callback by @pajowu in #600
- ggml : fix q4_1 dot product types by @novag in #759
- Exposed various parts to the Go Interface by @bmurray in #697
- Adds shell command example for --print-colors by @bocytko in #710
- Makefile: disable avx in case f16c is not available by @duthils in #706
- Making the quick start instructions clearer. by @Onlyartist9 in #716
- Add lrc output support by @WhichWho in #718
- Corrects default speak.sh path in talk-llama by @mab122 in #720
- Add msvc compiler args /utf-8 fix error C3688 by @WhichWho in #721
- Changed convert-pt-to-ggml.py to use .tiktoken tokenizer files by @ivan-gorin in #725
- talk/talk-llama: add basic example script for eleven-labs tts by @DGdev91 in #728
- readme : add Unity3d bindings by @Macoron in #733
- Update stream.cpp by @AliAlameh in #501
- Fix typos in whisper.h by @GitAritron in #737
- Update LICENSE by @masguit42 in #739
- fix potential memory leaks by @baderouaich in #740
- readme: Add alternate swift bindings by @exPHAT in #755
- Fix the bug related to word splitting errors in the "tokenize" function. by @AfryMask in #760
- Do not launch threads for
log_mel_spectrogram
when singlethreaded by @maxilevi in #763 - Core ML support by @ggerganov in #566
- ggml : fix build on whisper.android (ARM_NEON) by @jhen0409 in #764
New Contributors
- @hykelvinlee42 made their first contribution in #552
- @tinoue made their first contribution in #572
- @venkr made their first contribution in #569
- @Kamilake made their first contribution in #605
- @tuxpoldo made their first contribution in #615
- @jhen0409 made their first contribution in #619
- @LucasZNK made their first contribution in #642
- @jwijffels made their first contribution in #626
- @a5huynh made their first contribution in #628
- @ejones made their first contribution in #668
- @razodactyl made their first contribution in #682
- @clach04 made their first contribution in #486
- @egorFiNE made their first contribution in #648
- @be-next made their first contribution in #677
- @InconsolableCellist made their first contribution in #687
- @ZiggerZZ made their first contribution in #688
- @pajowu made their first contribution in #600
- @novag made their first contribution in #759
- @bmurray made their first contribution in #697
- @bocytko made their first contribution in #710
- @duthils made their first contribution in #706
- @Onlyartist9 made their first contribution in #716
- @WhichWho made their first contribution in #718
- @mab122 made their first contribution in #720
- @ivan-gorin made their first contribution in #725
- @DGdev91 made their first contribution in #728
- @Macoron made their first contribution in #733
- @AliAlameh made their first contribution in #501
- @GitAritron made their first contribution in #737
- @masguit42 made their first contribution in #739
- @baderouaich made their first contribution in #740
- @exPHAT made their first contribution in #755
- @AfryMask made their first contribution in #760
- @maxilevi made their first contribution in #763
Full Changelog: v1.2.1...v1.3.0
v1.2.1
Overview
This is a minor release. The main reason for it is a critical bug fix that causes the software to crash randomly when the language auto-detect option is used (i.e. whisper_lang_auto_detect()
).
Other than that, the release includes refactoring of the examples, ruby bindings and some minor changes to the C API.
You can provide feedback in the existing v1.2.0 discussion.
What's Changed
Core ggml
/ whisper
whisper
: whisper : add "split_on_word" flag when using using "max_len" option by @mightymatth in #455 and @boolemancer in #476whisper
: add whisper_full_lang_id() for getting the context lang by @kamranjon in #461whisper
: fixed Beam Search Strategy and exposed whisper_pcm_to_mel_phase_vocoder by @sandrohanea in #474whisper
: suppress non-speech-related token outputs by @shibukazu in #473cmake
: install whisper.h header by @aviks in #485whisper
: fix signedness compiler warning by @shikokuchuo in #506whisper
: by default disable non-speech tokens suppression #473whisper
: add API for applying custom logits filters during decoding 0d22916whisper
: fix uninitializedexp_n_audio_ctx
by @finnvoor in #520
Bindings
bindings
: add Ruby by @taf2 in #500readme
: add .NET repos (#303)readme
: add cython bindings (#9)readme
: add pybind11 bindings by @aarnphm in #538
Examples
ci
: add node addon test and optimize compilation configuration by @chenqianhe in #468yt-wsp.sh
: add unique filename generation by @genevera in #495examples
: refactor in order to reuse code and reduce duplication by @ggerganov in #482main
: fix stdin pipe stream by @conradg in #503make
: add "-mcpu=native" when building for aarch64 (#532)
C-style API
- Add
whisper_pcm_to_mel_phase_vocoder()
- Add
*(whisper_logits_filter_callback)()
- Change
struct whisper_full_params
- Add
whisper_full_lang_id()
New Contributors
- @mightymatth made their first contribution in #455
- @kamranjon made their first contribution in #461
- @sandrohanea made their first contribution in #474
- @shibukazu made their first contribution in #473
- @genevera made their first contribution in #495
- @shikokuchuo made their first contribution in #506
- @conradg made their first contribution in #503
- @taf2 made their first contribution in #500
- @finnvoor made their first contribution in #520
- @aarnphm made their first contribution in #538
- @FlippFuzz made their first contribution in #532
Full Changelog: v1.2.0...v1.2.1
Highlights
Recently, I have been making progress on adding integer quantisation support in the ggml
tensor library. This will eventually allow to use quantised models which require less memory and will hopefully run faster. I think the next major release v1.3.0
will officially add quantisation support. For now, you can keep track of the progress in #540
-
🎙️ MacWhisper by @jordibruin powered by whisper.cpp
v1.2.0
Overview
In this release we significantly reduce the memory usage during inference by introducing "scratch" buffers to ggml
.
The new memory requirements per model are as follows:
Model | Disk | Mem (Old) | Mem (New) |
---|---|---|---|
tiny | 75 MB | ~390 MB | ~125 MB |
base | 142 MB | ~500 MB | ~210 MB |
small | 466 MB | ~1.0 GB | ~600 MB |
medium | 1.5 GB | ~2.6 GB | ~1.7 GB |
large | 2.9 GB | ~4.7 GB | ~3.3 GB |
It's a simple idea that instead of creating a new memory buffer for each new tensor in the computation, we reuse the memory of old tensors that are no longer needed. The implementation is in PR #431. It's not very clean - I think there is some better way to do this, but for now it will work.
Additionally, there might be some inference speed improvements on Apple Silicon in the Decoder part of the transformer. I haven't done proper benchmarks, but seems there is about ~30% performance boost. The results are identical to v1.1.1
.
What's Changed
Core ggml
/ whisper
whisper
: PPC64 big-endian support by @fitzsim in #398whisper
: condition sampled timestamp tokens to be monotonically increasing by @ggerganov in #425wasm
: fix typo in helper.js by @bhbs in #459ggml
/whisper
: reduce memory usage during inference by @ggerganov in #431
Bindings
ci
: run workflows on pull requests + bindings depend on .h by @ggerganov in #446go
: added wrappers to reset and print timings by @glaslos in #436go
: add WhisperLangAutoDetect method to go binding by @RobinXL in #451go
: add wrapper for system info by @glaslos in #456go
: support "auto" as an option when set language by @polarmoon in #462
Examples
whisper.wasm
: add labels for easier radio selection by @kokes in #435livestream.sh
: run main with model arg instead of default by @EricTendian in #453main
: CSV format export trimmed spaces fix by @alex-bacart in #444addon.node
: using whisper as a Node.js addon by @chenqianhe in #443
New Contributors
- @kokes made their first contribution in #435
- @glaslos made their first contribution in #436
- @EricTendian made their first contribution in #453
- @RobinXL made their first contribution in #451
- @alex-bacart made their first contribution in #444
- @bhbs made their first contribution in #459
- @polarmoon made their first contribution in #462
- @chenqianhe made their first contribution in #443
Full Changelog: v1.1.1...v1.2.0
Highlights
I'll use these release notes to write some random thoughts about the project - sort of a short blog post.
I'm really happy with how whisper.cpp
turned out to be so far. There is a very positive reception in the ML community - most people seem to be excited by the simplicity of the implementation and the fact that it is quite self-contained. I receive a lot of questions about the project and about various ideas that it can be applied to. I really enjoy it and I try to respond to everyone!
I also find it very satisfying that there are so many contributions already happening by so many people. To me this illustrates the power of open-source collaboration. The contributions not only improve the functionality and the quality of the code, but also help to generate various new ideas and approaches to explore.
Another interesting thing is that the project keeps on giving. Every time I start to think that now is a good time to put it in the background for a while and focus on other stuff, some new cool idea pops up and I can't help but start working on it. Having this custom implementation allows me to interact with the model on a lower level which opens some interesting ways to explore it.
So far the development has been focused on improving the performance, expanding the platform coverage and having robust decoding strategies with a variety of examples. During this time, there have been several ideas that accumulated over-time which I find interesting to explore (diarization, token-level timestamps, improved timestamp accuracy, etc). I think I'll try to focus more on these in the future and see if I can achieve something interesting.
-
Windows port of
whisper.cpp
utilising vendor-agnostic GPGPU based on DirectCompute by @Const-me
- "The New Yorker" article featuring
whisper.cpp
v1.1.1
Overview
Since the v1.1.0 pre-release there have been several reports of improved transcription quality.
Together with my observations, I think we can declare version v1.1.1
as "stable".
There were actually a couple of bug-fixes implemented since v1.1.0
, so make sure to update to v1.1.1
for optimal results.
Another update is that the prototype for v1.2.0 is almost ready: #431
Initial results indicate that the memory usage can be reduced by a factor of 2-3 for the smaller models.
You can provide feedback in the existing v1.1.0 discussion.
What's Changed
Core ggml
/ whisper
whisper
: perform entropy check only when we have at least 32 tokens 1a91c19whisper
: fix condition for providing past prompt (critical) 78f1661
Bindings
go
: removesample_best
andsample_timestamp
bindings by @Trojan295 in #409
Examples
main
: re-enable temperature fallback f583e2dmain
: add an option to accept optional output filenames by @garychia in #424whisper.android
: use AssetManager for Android by @Digipom in #415whisper.wasm
: add small and small.en models 206fc93bench
: add memcpy and ggml_mul_mat benchmarks (experimental) 1290fc6
New Contributors
- @Trojan295 made their first contribution in #409
- @garychia made their first contribution in #424
Full Changelog: v1.1.0...v1.1.1
v1.1.0
Overview
The major change in this pre-release is the improved decoding implementation in whisper.cpp
:
- Support for average logprob and entropy based criteria for fallback
- Support for temperature
T > 0
- Improved Greedy decoder via
best_of
parameter forT > 0
- Add beam search decoding (a.k.a
beam_size
)
More information about the decoding changes can be found in #291
Additionally, there are a few performance improvements for Apple Silicon, WASM and non-F16C platforms.
Support for POWER9 architectures has been added.
The reason that this is a pre-release and not an official release is that the new implementation has not been sufficiently tested yet and the existing bindings for other languages have not been updated to support the API changes. The official release 1.1.x
will be created when there is enough feedback about the new decoding implementation and when the bindings have been updated. So make sure to send your feedback in the discussion created for this pre-release. For now, the 1.0.4
release should be considered more stable.
What's Changed
Core ggml
/ whisper
ggml
: POWER9 support by @fitzsim in #320, #349, #369ggml
: simplify the SIMD code by @ggerganov in #324ggml
: add SSE3 and fp16 conversion lookup table by @abitofevrything in #368ggml
: utilise Accelerate's vDSP for some computations d51fc3eggml
: speed-up softmax compute via Accelerate and loop unrolling d61d55cggml
: do not start extra threads when using BLAS d347a59whisper
: do sample_to_timestamp calculation with 64 bit precision to avoid overflow by @boolemancer in #388whisper
: various code clean-up and improvements by @asmaloney in #317 #318 #319 #322 etcwhisper
: improve decoding by @ggerganov in #291whisper
: account for speed_up flag for short audio #405
C-style API
- Add loader class to allow loading from buffer and others by @prsyahmi in #353
- Add
whisper_token_data::plog
- Add
whisper_init_from_file()
- Add
whisper_init_from_buffer()
- Change
whisper_init()
- Remove
whisper_sample_best()
- Remove
whisper_sample_timestamp()
- Add
whisper_n_audio_ctx()
- Add
whisper_get_logits()
- Remove
whisper_get_probs()
- Change
struct whisper_full_params
Bindings
Examples
whisper.android
: remove android ABI constraint by @Digipom in #301whisper.swiftui
: SwiftUI example by @Digipom in #308main
: add-ocsv
, aka--output-csv
for writing CSV file containing millisecond timestamps by @NielsMayer in #340command
: refactor to split command list & general transcription modes by @asmaloney in #331command
: always-prompt mode by @dnhkng in #383stream
: fix data race on bool + avoid division-by-zero a466c34stream
: fix a bug that inserted a lot of empty audio at the start a6dbd91bench.wasm
: print system info fafd789
New Contributors
- @djthorpe made their first contribution in #287
- @0xmohit made their first contribution in #296
- @asmaloney made their first contribution in #298
- @fitzsim made their first contribution in #320
- @NielsMayer made their first contribution in #340
- @aviks made their first contribution in #345
- @eltociear made their first contribution in #346
- @abitofevrything made their first contribution in #368
- @Mike-Bell made their first contribution in #381
- @dnhkng made their first contribution in #383
- @prsyahmi made their first contribution in #353
- @ianb made their first contribution in #391
Full Changelog: v1.0.4...v1.1.0
Highlights
- Sample SwiftUI application example/whisper.swiftui