Releases · ggerganov/whisper.cpp

14 Dec 16:06

ggerganov

v1.5.2

88112c8

v1.5.2

Overview

Minor maintenance release:

Re-enable CPU BLAS processing after fixing a regression (#1583)

Add new example: wchess

wchess-0.mp4

Shoutout to @fraxy-v (implementation) and @ejones (grammar) for making it work!

What's Changed

automatically convert audio on the server by @sapoepsilon in #1539
CI : Rectify the Clang-Related workflow issues by @bobqianic in #1551
CI : Add CUDA 11.8.0 support by @bobqianic in #1554
Update main program help info by @bebound in #1560
Set default CORS headers to allow all by @kasumi-1 in #1567
cmake : install required ggml.h header by @gjasny in #1568
Backport .srt output format to examples/server by @osdrv in #1565
Added support for .vtt format to Whisper server by @aleksanderandrzejewski in #1578
ggml : re-enable blas for src0 != F32 by @ggerganov in #1583
Fix 32-bit compiler warning by @Digipom in #1575
Remove #if arch(arm) check in Swift Package Manager by @finnvoor in #1561
Pass max-len argument to server wparams by @osdrv in #1574
sync : ggml (new ops, new backend, etc) by @ggerganov in #1602
Fix ggml_metal_log on Intel macs by @finnvoor in #1606
Update CMakeLists.txt by @Kreijstal in #1615
target windows 8 or above for prefetchVirtualMemory in llama-talk by @Kreijstal in #1617
sync : ggml (Metal fixes, new ops, tests) by @ggerganov in #1633
wchess: whisper assisted chess by @fraxy-v in #1595

New Contributors

@sapoepsilon made their first contribution in #1539
@bebound made their first contribution in #1560
@kasumi-1 made their first contribution in #1567
@gjasny made their first contribution in #1568
@osdrv made their first contribution in #1565
@aleksanderandrzejewski made their first contribution in #1578
@Kreijstal made their first contribution in #1615
@fraxy-v made their first contribution in #1595

Full Changelog: v1.5.1...v1.5.2

Contributors

osdrv, ejones, and 11 other contributors

Assets 11

0 Join discussion

24 Nov 10:45

ggerganov

v1.5.1

9d6ebd8

v1.5.1

Overview

Minor update:

With Metal, auto-fallback to CPU if device does not support Apple7 family
Add server example

What's Changed

ISSUE-1329: replace " with ' so it doesn't try to execute code in backticks by @spullara in #1364
sync : ggml (ggml-alloc + linker + gguf fixes) by @ggerganov in #1501
Fixed with_state methods, to use the correct state by @sandrohanea in #1519
#1517 Redistribute CUDA DLLs by @tamo in #1522
whisper : reuse whisper_decode_with_state by @ggerganov in #1521
sdl : fix audio callback by @ggerganov in #1523
update deprecated example by @MightyStud in #1529
Super Simple Whisper Server by @felrock in #1380
Close file after writing in server application by @felrock in #1533
bench : multi-thread memcpy by @ggerganov in #1534
Change temp file name for server application by @felrock in #1535
Fixed Makefile for MacOS ARM 64 Go bindings by @gleicon in #1530
Fixed metal build on macos-latest by @sandrohanea in #1544
fix(server): typo in temperature parameter by @Okabintaro in #1545
Request to add a new function to get the full language name by @bradmit in #1546
server : add --print-realtime param by @ecneladis in #1541
cuda : sync some minor stuff from llama.cpp by @ggerganov in #1548
metal : add backend function to check device family support by @ggerganov in #1547

New Contributors

@spullara made their first contribution in #1364
@MightyStud made their first contribution in #1529
@felrock made their first contribution in #1380
@gleicon made their first contribution in #1530
@Okabintaro made their first contribution in #1545
@bradmit made their first contribution in #1546
@ecneladis made their first contribution in #1541

Full Changelog: v1.5.0...v1.5.1

Contributors

spullara, gleicon, and 8 other contributors

Assets 11

15 Nov 21:06

ggerganov

v1.5.0

d38af15

v1.5.0

Overview

This major release includes the following changes:

Full GPU processing of the Encoder and the Decoder with CUDA and Metal is now supported
Efficient beam-search implementation via batched decoding and unified KV cache
Full quantization support of all available ggml quantization types
Support for grammar constrained sampling
Support for Distil Whisper models
Support for Whisper Large-v3

and more

Full GPU support

On Apple Silicon, GPU support has been available to a large extend since 15 Sep. However, part of the Encoder was still being executed on the CPU due to lack of MSL kernels for the convolution operations. These kernels are now available resulting in additional speed-up of the Encoder in this release:

Encoder performance on Apple M1 Max - before and after (plot by @dreness)

For NVIDIA hardware, the entire computation can now be offloaded to the GPU which results in significant performance boost. For detailed performance breakdown, checkout the Benchmarks section below.

The GPU processing on Apple Silicon is enabled by default, while for NVIDIA you need to build with WHISPER_CUBLAS=1:

# Apple Silicon
make

# NVIDIA
WHISPER_CUBLAS=1 make

Implementation: #1472

Special credits to: @FSSRepo, @slaren

Batched decoding + efficient Beam Search

At last, whisper.cpp now supports efficient Beam Search decoding. The missing piece was the implementation of batched decoding, which now follows closely the unified KV cache idea from llama.cpp. On modern NVIDIA hardware, the performance with 5 beams is the same as 1 beam thanks to the large amount of computing power available. With Metal, the speed with 5 beams is a bit slower compared to 1 beam, but it is significantly faster compared to 5x times the time for single batch which was observed with the old naive implementation.

Beam Search is now enabled by default in whisper.cpp to match the OG implementation of OpenAI Whisper. For more performance details, checkout the Benchmarks section below.

Implementation: #1486

Quantization support

All ggml quantization types are now supported. Quantization mixtures for Whisper model can be implemented. It's still unclear how the quality is affected from the quantization - this is an interesting area which can be explored in the future.

Grammar sampling

The decoder output can now be constrained with a GBNF grammar. This can be a useful technique for further improving the transcription quality in situations where the set of possible phrases are limited.

whisper-chess.mp4

Implementation: #1229

Special credits to @ejones

Distil Whisper

Recently, Distil Whisper models have been released: https://huggingface.co/distil-whisper

whisper.cpp offers support for these models, although it still lacks full implementation of the proposed chunking strategy. Performance details for distilled models are included in the Benchmarks section below.

Implementation: #1424

Whisper Large-v3

Recently, OpenAI released a new version 3 of the Large model: openai/whisper#1761

Implementation: #1444

Benchmarks

Below is a breakdown of the performance of whisper.cpp on Apple Silicon, NVIDIA and CPU. The tables show the Encoder and Decoder speed in ms/tok. The Dec. column corresponds to batch size 1. The Bch5 column corresponds to batch size 5. The PP column corresponds to batch size 128.

For optimal Beam Search performance, the Bch5 number should be 5 times smaller than Dec.

Hw	Config	Model	Th	Enc.	Dec.	Bch5	PP	Commit
M2 Ultra	METAL	tiny	1	11.14	1.40	0.49	0.01	`ccc85b4`
M2 Ultra	METAL	tiny-q5_0	1	11.51	1.41	0.52	0.01	`ccc85b4`
M2 Ultra	METAL	tiny-q5_1	1	12.21	1.41	0.52	0.01	`ccc85b4`
M2 Ultra	METAL	base	1	20.21	2.05	0.77	0.02	`ccc85b4`
M2 Ultra	METAL	base-q5_0	1	19.89	1.96	0.81	0.02	`ccc85b4`
M2 Ultra	METAL	base-q5_1	1	20.14	2.02	0.81	0.02	`ccc85b4`
M2 Ultra	METAL	small	1	51.01	3.97	1.74	0.05	`ccc85b4`
M2 Ultra	METAL	small-q5_0	1	56.86	4.09	1.85	0.06	`ccc85b4`
M2 Ultra	METAL	small-q5_1	1	56.81	4.14	1.85	0.06	`ccc85b4`
M2 Ultra	METAL	medium	1	141.21	8.47	3.98	0.13	`ccc85b4`
M2 Ultra	METAL	medium-q5_0	1	160.56	8.27	4.18	0.14	`ccc85b4`
M2 Ultra	METAL	medium-q5_1	1	160.52	8.40	4.15	0.14	`ccc85b4`
M2 Ultra	METAL	medium-dis	1	128.14	1.13	0.43	0.02	`ccc85b4`
M2 Ultra	METAL	large-v2	1	248.73	11.96	6.08	0.22	`ccc85b4`
M2 Ultra	METAL	large-v2-q5_0	1	286.31	11.99	6.60	0.26	`ccc85b4`
M2 Ultra	METAL	large-v2-q5_1	1	284.56	12.42	6.47	0.26	`ccc85b4`
M2 Ultra	METAL	large-v2-dis	1	224.31	1.26	0.49	0.02	`ccc85b4`

Hw	Config	Model	Th	Enc.	Dec.	Bch5	PP	Commit
M2 Ultra	COREML METAL	tiny	1	7.60	1.41	0.50	0.01	`ccc85b4`
M2 Ultra	COREML METAL	base	1	11.90	2.07	0.78	0.02	`ccc85b4`
M2 Ultra	COREML METAL	small	1	32.19	4.10	1.78	0.05	`ccc85b4`
M2 Ultra	COREML METAL	medium	1	94.43	8.40	3.89	0.12	`ccc85b4`
M2 Ultra	COREML METAL	large-v2	1	179.78	12.12	6.07	0.22	`ccc85b4`

Hw	Config	Model	Th	Enc.	Dec.	Bch5	PP	Commit
NVIDIA V100	BLAS CUDA	tiny	1	8.84	1.62	0.33	0.02	`ccc85b4`
NVIDIA V100	BLAS CUDA	tiny-q5_0	1	8.43	1.19	0.31	0.02	`ccc85b4`
NVIDIA V100	BLAS CUDA	tiny-q5_1	1	8.41	1.19	0.29	0.02	`ccc85b4`
NVIDIA V100	BLAS CUDA	base	1	14.79	2.31	0.46	0.03	`ccc85b4`
NVIDIA V100	BLAS CUDA	base-q5_0	1	15.05	1.66	0.44	0.03	`ccc85b4`
NVIDIA V100	BLAS CUDA	base-q5_1	1	15.01	1.68	0.46	0.03	`ccc85b4`
NVIDIA V100	BLAS CUDA	small	1	40.30	4.37	0.88	0.05	`ccc85b4`
NVIDIA V100	BLAS CUDA	small-q5_0	1	41.17	3.11	0.94	0.05	`ccc85b4`
NVIDIA V100	BLAS CUDA	small-q5_1	1	41.12	3.11	0.82	0.05	`ccc85b4`
NVIDIA V100	BLAS CUDA	medium	1	104.93	10.06	1.77	0.11	`ccc85b4`
NVIDIA V100	BLAS CUDA	medium-q5_0	1	107.11	6.13	2.07	0.12	`ccc85b4`
NVIDIA V100	BLAS CUDA	medium-q5_1	1	107.91	6.21	1.77	0.12	`ccc85b4`
NVIDIA V100	BLAS CUDA	medium-dis	1	103.45	1.11	0.24	0.02	`ccc85b4`
NVIDIA V100	BLAS CUDA	large-v2	1	171.55	15.76	2.62	0.17	`ccc85b4`
NVIDIA V100	BLAS CUDA	large-v2-q5_0	1	176.27	8.61	3.17	0.19	`ccc85b4`
NVIDIA V100	BLAS CUDA	large-v2-q5_1	1	176.23	8.67	2.59	0.19	`ccc85b4`

Hw	Config	Model	Th	Enc.	Dec.	Bch5	PP	Commit
AMD Ryzen 9 5950X	AVX2	tiny	8	197.47	1.22	0.44	0.25	`ccc85b4`
AMD Ryzen 9 5950X	AVX2	tiny-q5_0	8	222.92	0.87	0.45	0.30	`ccc85b4`
AMD Ryzen 9 5950X	AVX2	tiny-q5_1	8	221.25	0.89	0.45	0.30	`ccc85b4`
AMD Ryzen 9 5950X	AVX2	base	8	427.14	3.11	0.88	0.43	`ccc85b4`
AMD Ryzen 9 5950X	AVX2	base-q5_0	8	474.96	1.41	0.72	0.51	`ccc85b4`
AMD Ryzen 9 5950X	AVX2	base-q5_1	8	485.05	1.48	0.73	0.52	`ccc85b4`
AMD Ryzen 9 5950X	AVX2	small	8	1470.51	11.70	2.89	1.21	`ccc85b4`
AMD Ryzen 9 5950X	AVX2	small-q5_0	8	1700.43	5.48	1.98	1.41	`ccc85b4`
AMD Ryzen 9 5950X	AVX2	small-q5_1	8	1719.03	5.79	2.02	1.42	`ccc85b4`
AMD Ryzen 9 5950X	AVX2	medium	8	4417.70	35.13	8.14...

Contributors

cjheath, evmar, and 84 other contributors

Assets 9

07 Nov 14:29

ggerganov

v1.4.3

6a5d195

v1.4.3 Pre-release

Pre-release

This is a minor release, the main reason for which is that there hasn't been an official release for a few months now and some small things have accumulated on the master branch that would be nice to be upstreamed. I am planning a major v1.5.0 release with some new and long-waited functionality soon:

Full CUDA offloading
Efficient Beam-Search implementation
Grammar support

The current version v1.4.3 should be considered in beta as I haven't worked intensively on whisper.cpp recently and there might be some issues that made their way in the code. I'll try to polish things in the next days and prepare a stable v1.5.0 release. In the meantime, any feedback will be highly appreciated.

Detailed API changes, features and new contributor recognitions will be included in the v1.5.0 release.

Assets 2

3 Join discussion

30 Apr 16:56

ggerganov

v1.4.0

fa8dbdc

v1.4.0

Overview

This is a new major release adding integer quantization and partial GPU (NVIDIA) support

Integer quantization

This allows the ggml Whisper models to be converted from the default 16-bit floating point weights to 4, 5 or 8 bit integer weights.
The resulting quantized models are smaller in disk size and memory usage and can be processed faster on some architectures. The transcription quality is degraded to some extend - not quantified at the moment.

Supported quantization modes: Q4_0, Q4_1, Q4_2, Q5_0, Q5_1, Q8_0
Implementation details: #540
Usage instructions: README
All WASM examples now support Q5 quantized models: https://whisper.ggerganov.com

Here is a quantitative evaluation of the different quantization modes applied to the LLaMA and RWKV large language models. These results can give an impression about the expected quality, size and speed for quantized Whisper models:

LLaMA quantization (measured on M1 Pro)

Model	Measure	F16	Q4_0	Q4_1	Q4_2	Q5_0	Q5_1	Q8_0
7B	perplexity	5.9565	6.2103	6.1286	6.1698	6.0139	5.9934	5.9571
7B	file size	13.0G	4.0G	4.8G	4.0G	4.4G	4.8G	7.1G
7B	ms/tok @ 4th	128	56	61	84	91	95	75
7B	ms/tok @ 8th	128	47	55	48	53	59	75
7B	bits/weight	16.0	5.0	6.0	5.0	5.5	6.0	9.0
13B	perplexity	5.2455	5.3748	5.3471	5.3433	5.2768	5.2582	5.2458
13B	file size	25.0G	7.6G	9.1G	7.6G	8.4G	9.1G	14G
13B	ms/tok @ 4th	239	104	113	160	176	185	141
13B	ms/tok @ 8th	240	85	99	97	108	117	147
13B	bits/weight	16.0	5.0	6.0	5.0	5.5	6.0	9.0

ref: https://github.com/ggerganov/llama.cpp#quantization

RWKV quantization

Format	Perplexity (169M)	Latency, ms (1.5B)	File size, GB (1.5B)
`Q4_0`	17.507	76	1.53
`Q4_1`	17.187	72	1.68
`Q4_2`	17.060	85	1.53
`Q5_0`	16.194	78	1.60
`Q5_1`	15.851	81	1.68
`Q8_0`	15.652	89	2.13
`FP16`	15.623	117	2.82
`FP32`	15.623	198	5.64

ref: ggerganov/ggml#89 (comment)

This feature is possible thanks to the many contributions in the llama.cpp project: https://github.com/users/ggerganov/projects/2

GPU support via cuBLAS

Using cuBLAS results mainly in improved Encoder inference speed. I haven't done proper timings, but one can expect at least 2-3 times faster Encoder evaluation with modern NVIDIA GPU cards compared to CPU-only processing. Feel free to post your Encoder benchmarks in issue #89.

Implementation details: #834
Usage instructions: README

This is another feature made possible by the llama.cpp project. Special recognition to @slaren for putting almost all of this work together

This release remains in "beta" stage as I haven't verified that everything works as expected.

What's Changed

Updated escape_double_quotes() Function by @tauseefmohammed2 in #776
examples : add missing #include by @pH5 in #798
Flush upon finishing inference by @tarasglek in #811
Escape quotes in csv output by @laytan in #815
C++11style by @wuyudi in #768
Optionally allow a Core ML build of Whisper to work with or without Core ML models by @Canis-UK in #812
add some tips about in the readme of the android project folder by @Zolliner in #816
whisper: Use correct seek_end when offset is used by @ThijsRay in #833
ggml : fix 32-bit ARM NEON by @ggerganov in #836
Add CUDA support via cuBLAS by @ggerganov in #834
Integer quantisation support by @ggerganov in #540

New Contributors

@tauseefmohammed2 made their first contribution in #776
@pH5 made their first contribution in #798
@tarasglek made their first contribution in #811
@laytan made their first contribution in #815
@wuyudi made their first contribution in #768
@Canis-UK made their first contribution in #812
@Zolliner made their first contribution in #816
@ThijsRay made their first contribution in #833

Full Changelog: v1.3.0...v1.4.0

Contributors

pH5, tarasglek, and 8 other contributors

Assets 6

1 Join discussion

15 Apr 14:41

ggerganov

v1.3.0

c23588c

v1.3.0

Overview

This release should be considered in Beta stage, since I haven't done a lot of testing and I am not sure if I didn't break something.
But overall, I believe both the performance and the quality are improved.

Added Core ML support #566
Restored decoding fallbacks with default size of 2 instead of 5 (f19e23f)
Pad the audio with zeros instead of the spectrogram (5108b30)
Added talk-llama example
Added whisper_state which allows parallel transcriptions with a single model in memory (#523)

The C-style API has been extended significantly to support the new whisper_state, but in general should be backwards compatible.
The only breaking change is in the callbacks signatures.

Please provide feedback in the discussion if you observe any issues.

The next release v1.4.0 will follow up relatively soon and will provide 4-bit integer quantization support.

What's Changed

update csv output format to match OpenAI's Whisper dataframe output by @hykelvinlee42 in #552
Go binding: NewContext now returns a clean context by @polarmoon in #537
Added whisper state + default state on the whisper_context by @sandrohanea in #523
whisper.android: Enable fp16 instrinsics (FP16_VA) which is supported by ARMv8.2 or later. by @tinoue in #572
Add quality comparison helper by @venkr in #569
whisper.android: Support benchmark for Android example. by @tinoue in #542
Fix MUSL Linux build by @ggerganov in #576
Change default encoding to UTF-8 by @Kamilake in #605
Provide option for creating JSON output by @tuxpoldo in #615
readme : add react-native bindings by @jhen0409 in #619
Fixed language auto-detection for state provided processing. by @sandrohanea in #627
xcodeproj : add -O3 -DNDEBUG in release mode by @jhen0409 in #640
Nodejs Addon blocking main thread. Implemented Napi::AsyncWorker by @LucasZNK in #642
Include link to R wrapper in README by @jwijffels in #626
Add a cmake flag to disable F16C by @a5huynh in #628
Add talk-llama example by @ggerganov in #664
Add Alpaca support to talk-llama example by @ejones in #668
Update README.md by @razodactyl in #682
issue #470 - working 32-bit ARM by @clach04 in #486
whisper : add initial_prompt param by @jhen0409 in #645
fix typo in JSON output by @egorFiNE in #648
Fix shell script ./models/download-ggml-model.sh to handle spaces and special characters in paths by @be-next in #677
Fixed test to new async implementation by @LucasZNK in #686
Minor: fixing usage message for talk-llama by @InconsolableCellist in #687
Small typo by @ZiggerZZ in #688
feat: add progress callback by @pajowu in #600
ggml : fix q4_1 dot product types by @novag in #759
Exposed various parts to the Go Interface by @bmurray in #697
Adds shell command example for --print-colors by @bocytko in #710
Makefile: disable avx in case f16c is not available by @duthils in #706
Making the quick start instructions clearer. by @Onlyartist9 in #716
Add lrc output support by @WhichWho in #718
Corrects default speak.sh path in talk-llama by @mab122 in #720
Add msvc compiler args /utf-8 fix error C3688 by @WhichWho in #721
Changed convert-pt-to-ggml.py to use .tiktoken tokenizer files by @ivan-gorin in #725
talk/talk-llama: add basic example script for eleven-labs tts by @DGdev91 in #728
readme : add Unity3d bindings by @Macoron in #733
Update stream.cpp by @AliAlameh in #501
Fix typos in whisper.h by @GitAritron in #737
Update LICENSE by @masguit42 in #739
fix potential memory leaks by @baderouaich in #740
readme: Add alternate swift bindings by @exPHAT in #755
Fix the bug related to word splitting errors in the "tokenize" function. by @AfryMask in #760
Do not launch threads for log_mel_spectrogram when singlethreaded by @maxilevi in #763
Core ML support by @ggerganov in #566
ggml : fix build on whisper.android (ARM_NEON) by @jhen0409 in #764

New Contributors

@hykelvinlee42 made their first contribution in #552
@tinoue made their first contribution in #572
@venkr made their first contribution in #569
@Kamilake made their first contribution in #605
@tuxpoldo made their first contribution in #615
@jhen0409 made their first contribution in #619
@LucasZNK made their first contribution in #642
@jwijffels made their first contribution in #626
@a5huynh made their first contribution in #628
@ejones made their first contribution in #668
@razodactyl made their first contribution in #682
@clach04 made their first contribution in #486
@egorFiNE made their first contribution in #648
@be-next made their first contribution in #677
@InconsolableCellist made their first contribution in #687
@ZiggerZZ made their first contribution in #688
@pajowu made their first contribution in #600
@novag made their first contribution in #759
@bmurray made their first contribution in #697
@bocytko made their first contribution in #710
@duthils made their first contribution in #706
@Onlyartist9 made their first contribution in #716
@WhichWho made their first contribution in #718
@mab122 made their first contribution in #720
@ivan-gorin made their first contribution in #725
@DGdev91 made their first contribution in #728
@Macoron made their first contribution in #733
@AliAlameh made their first contribution in #501
@GitAritron made their first contribution in #737
@masguit42 made their first contribution in #739
@baderouaich made their first contribution in #740
@exPHAT made their first contribution in #755
@AfryMask made their first contribution in #760
@maxilevi made their first contribution in #763

Full Changelog: v1.2.1...v1.3.0

Contributors

a5huynh, egorFiNE, and 35 other contributors

Assets 6

0 Join discussion

28 Feb 20:30

ggerganov

v1.2.1

ad13890

v1.2.1

Overview

This is a minor release. The main reason for it is a critical bug fix that causes the software to crash randomly when the language auto-detect option is used (i.e. whisper_lang_auto_detect()).

Other than that, the release includes refactoring of the examples, ruby bindings and some minor changes to the C API.

You can provide feedback in the existing v1.2.0 discussion.

What's Changed

Core `ggml` / `whisper`

whisper : whisper : add "split_on_word" flag when using using "max_len" option by @mightymatth in #455 and @boolemancer in #476
whisper : add whisper_full_lang_id() for getting the context lang by @kamranjon in #461
whisper : fixed Beam Search Strategy and exposed whisper_pcm_to_mel_phase_vocoder by @sandrohanea in #474
whisper : suppress non-speech-related token outputs by @shibukazu in #473
cmake : install whisper.h header by @aviks in #485
whisper : fix signedness compiler warning by @shikokuchuo in #506
whisper : by default disable non-speech tokens suppression #473
whisper : add API for applying custom logits filters during decoding 0d22916
whisper : fix uninitialized exp_n_audio_ctx by @finnvoor in #520

Bindings

bindings : add Ruby by @taf2 in #500
readme : add .NET repos (#303)
readme : add cython bindings (#9)
readme : add pybind11 bindings by @aarnphm in #538

Examples

ci : add node addon test and optimize compilation configuration by @chenqianhe in #468
yt-wsp.sh : add unique filename generation by @genevera in #495
examples : refactor in order to reuse code and reduce duplication by @ggerganov in #482
main : fix stdin pipe stream by @conradg in #503
make : add "-mcpu=native" when building for aarch64 (#532)

C-style API

Add whisper_pcm_to_mel_phase_vocoder()
Add *(whisper_logits_filter_callback)()
Change struct whisper_full_params
Add whisper_full_lang_id()

New Contributors

@mightymatth made their first contribution in #455
@kamranjon made their first contribution in #461
@sandrohanea made their first contribution in #474
@shibukazu made their first contribution in #473
@genevera made their first contribution in #495
@shikokuchuo made their first contribution in #506
@conradg made their first contribution in #503
@taf2 made their first contribution in #500
@finnvoor made their first contribution in #520
@aarnphm made their first contribution in #538
@FlippFuzz made their first contribution in #532

Full Changelog: v1.2.0...v1.2.1

Highlights

Recently, I have been making progress on adding integer quantisation support in the ggml tensor library. This will eventually allow to use quantised models which require less memory and will hopefully run faster. I think the next major release v1.3.0 will officially add quantisation support. For now, you can keep track of the progress in #540

🎙️ MacWhisper by @jordibruin powered by whisper.cpp

https://goodsnooze.gumroad.com/l/macwhisper

Contributors

taf2, jordibruin, and 14 other contributors

Assets 6

04 Feb 08:55

ggerganov

v1.2.0

b2083c5

v1.2.0

Overview

In this release we significantly reduce the memory usage during inference by introducing "scratch" buffers to ggml.

The new memory requirements per model are as follows:

Model	Disk	Mem (Old)	Mem (New)
tiny	75 MB	~390 MB	~125 MB
base	142 MB	~500 MB	~210 MB
small	466 MB	~1.0 GB	~600 MB
medium	1.5 GB	~2.6 GB	~1.7 GB
large	2.9 GB	~4.7 GB	~3.3 GB

It's a simple idea that instead of creating a new memory buffer for each new tensor in the computation, we reuse the memory of old tensors that are no longer needed. The implementation is in PR #431. It's not very clean - I think there is some better way to do this, but for now it will work.

Additionally, there might be some inference speed improvements on Apple Silicon in the Decoder part of the transformer. I haven't done proper benchmarks, but seems there is about ~30% performance boost. The results are identical to v1.1.1.

What's Changed

Core `ggml` / `whisper`

whisper : PPC64 big-endian support by @fitzsim in #398
whisper : condition sampled timestamp tokens to be monotonically increasing by @ggerganov in #425
wasm : fix typo in helper.js by @bhbs in #459
ggml/whisper : reduce memory usage during inference by @ggerganov in #431

Bindings

ci : run workflows on pull requests + bindings depend on .h by @ggerganov in #446
go : added wrappers to reset and print timings by @glaslos in #436
go : add WhisperLangAutoDetect method to go binding by @RobinXL in #451
go : add wrapper for system info by @glaslos in #456
go : support "auto" as an option when set language by @polarmoon in #462

Examples

whisper.wasm : add labels for easier radio selection by @kokes in #435
livestream.sh : run main with model arg instead of default by @EricTendian in #453
main : CSV format export trimmed spaces fix by @alex-bacart in #444
addon.node : using whisper as a Node.js addon by @chenqianhe in #443

New Contributors

@kokes made their first contribution in #435
@glaslos made their first contribution in #436
@EricTendian made their first contribution in #453
@RobinXL made their first contribution in #451
@alex-bacart made their first contribution in #444
@bhbs made their first contribution in #459
@polarmoon made their first contribution in #462
@chenqianhe made their first contribution in #443

Full Changelog: v1.1.1...v1.2.0

Highlights

I'll use these release notes to write some random thoughts about the project - sort of a short blog post.

I'm really happy with how whisper.cpp turned out to be so far. There is a very positive reception in the ML community - most people seem to be excited by the simplicity of the implementation and the fact that it is quite self-contained. I receive a lot of questions about the project and about various ideas that it can be applied to. I really enjoy it and I try to respond to everyone!

I also find it very satisfying that there are so many contributions already happening by so many people. To me this illustrates the power of open-source collaboration. The contributions not only improve the functionality and the quality of the code, but also help to generate various new ideas and approaches to explore.

Another interesting thing is that the project keeps on giving. Every time I start to think that now is a good time to put it in the background for a while and focus on other stuff, some new cool idea pops up and I can't help but start working on it. Having this custom implementation allows me to interact with the model on a lower level which opens some interesting ways to explore it.

So far the development has been focused on improving the performance, expanding the platform coverage and having robust decoding strategies with a variety of examples. During this time, there have been several ideas that accumulated over-time which I find interesting to explore (diarization, token-level timestamps, improved timestamp accuracy, etc). I think I'll try to focus more on these in the future and see if I can achieve something interesting.

Windows port of whisper.cpp utilising vendor-agnostic GPGPU based on DirectCompute by @Const-me

https://github.com/Const-me/Whisper

"The New Yorker" article featuring whisper.cpp

Whispers of A.I.’s Modular Future

Contributors

EricTendian, fitzsim, and 9 other contributors

Assets 6

9 Join discussion

23 Jan 18:41

ggerganov

v1.1.1

2c3f50a

v1.1.1

Overview

Since the v1.1.0 pre-release there have been several reports of improved transcription quality.
Together with my observations, I think we can declare version v1.1.1 as "stable".

There were actually a couple of bug-fixes implemented since v1.1.0, so make sure to update to v1.1.1 for optimal results.

Another update is that the prototype for v1.2.0 is almost ready: #431
Initial results indicate that the memory usage can be reduced by a factor of 2-3 for the smaller models.

You can provide feedback in the existing v1.1.0 discussion.

What's Changed

Core `ggml` / `whisper`

whisper : perform entropy check only when we have at least 32 tokens 1a91c19
whisper : fix condition for providing past prompt (critical) 78f1661

Bindings

go : remove sample_best and sample_timestamp bindings by @Trojan295 in #409

Examples

main : re-enable temperature fallback f583e2d
main : add an option to accept optional output filenames by @garychia in #424
whisper.android : use AssetManager for Android by @Digipom in #415
whisper.wasm : add small and small.en models 206fc93
bench : add memcpy and ggml_mul_mat benchmarks (experimental) 1290fc6

New Contributors

@Trojan295 made their first contribution in #409
@garychia made their first contribution in #424

Full Changelog: v1.1.0...v1.1.1

Contributors

Digipom, Trojan295, and garychia

Assets 6

15 Jan 12:00

ggerganov

v1.1.0

8738427

v1.1.0 Pre-release

Pre-release

Overview

The major change in this pre-release is the improved decoding implementation in whisper.cpp:

Support for average logprob and entropy based criteria for fallback
Support for temperature T > 0
Improved Greedy decoder via best_of parameter for T > 0
Add beam search decoding (a.k.a beam_size)

More information about the decoding changes can be found in #291
Additionally, there are a few performance improvements for Apple Silicon, WASM and non-F16C platforms.
Support for POWER9 architectures has been added.

The reason that this is a pre-release and not an official release is that the new implementation has not been sufficiently tested yet and the existing bindings for other languages have not been updated to support the API changes. The official release 1.1.x will be created when there is enough feedback about the new decoding implementation and when the bindings have been updated. So make sure to send your feedback in the discussion created for this pre-release. For now, the 1.0.4 release should be considered more stable.

What's Changed

Core `ggml` / `whisper`

ggml : POWER9 support by @fitzsim in #320, #349, #369
ggml : simplify the SIMD code by @ggerganov in #324
ggml : add SSE3 and fp16 conversion lookup table by @abitofevrything in #368
ggml : utilise Accelerate's vDSP for some computations d51fc3e
ggml : speed-up softmax compute via Accelerate and loop unrolling d61d55c
ggml : do not start extra threads when using BLAS d347a59
whisper : do sample_to_timestamp calculation with 64 bit precision to avoid overflow by @boolemancer in #388
whisper : various code clean-up and improvements by @asmaloney in #317 #318 #319 #322 etc
whisper : improve decoding by @ggerganov in #291
whisper : account for speed_up flag for short audio #405

C-style API

Add loader class to allow loading from buffer and others by @prsyahmi in #353
Add whisper_token_data::plog
Add whisper_init_from_file()
Add whisper_init_from_buffer()
Change whisper_init()
Remove whisper_sample_best()
Remove whisper_sample_timestamp()
Add whisper_n_audio_ctx()
Add whisper_get_logits()
Remove whisper_get_probs()
Change struct whisper_full_params

Bindings

Golang bindings by @djthorpe in #287, #379, #384

Examples

whisper.android : remove android ABI constraint by @Digipom in #301
whisper.swiftui : SwiftUI example by @Digipom in #308
main : add -ocsv, aka --output-csv for writing CSV file containing millisecond timestamps by @NielsMayer in #340
command : refactor to split command list & general transcription modes by @asmaloney in #331
command : always-prompt mode by @dnhkng in #383
stream : fix data race on bool + avoid division-by-zero a466c34
stream : fix a bug that inserted a lot of empty audio at the start a6dbd91
bench.wasm : print system info fafd789

New Contributors

@djthorpe made their first contribution in #287
@0xmohit made their first contribution in #296
@asmaloney made their first contribution in #298
@fitzsim made their first contribution in #320
@NielsMayer made their first contribution in #340
@aviks made their first contribution in #345
@eltociear made their first contribution in #346
@abitofevrything made their first contribution in #368
@Mike-Bell made their first contribution in #381
@dnhkng made their first contribution in #383
@prsyahmi made their first contribution in #353
@ianb made their first contribution in #391

Full Changelog: v1.0.4...v1.1.0

Highlights

Sample SwiftUI application example/whisper.swiftui

Contributors

ianb, aviks, and 13 other contributors

Assets 6

22 Join discussion

Releases: ggerganov/whisper.cpp

v1.5.2

Overview

What's Changed

New Contributors

Contributors

v1.5.1

Overview

What's Changed

New Contributors

Contributors

v1.5.0

Overview

Full GPU support

Batched decoding + efficient Beam Search

Quantization support

Grammar sampling

Distil Whisper

Whisper Large-v3

Benchmarks

Contributors

v1.4.3

v1.4.0

Overview

Integer quantization

LLaMA quantization (measured on M1 Pro)

RWKV quantization

GPU support via cuBLAS

What's Changed

New Contributors

Contributors

v1.3.0

Overview

What's Changed

New Contributors

Contributors

v1.2.1

Overview

What's Changed

Core ggml / whisper

Bindings

Examples

C-style API

New Contributors

Highlights

Contributors

v1.2.0

Overview

What's Changed

Core ggml / whisper

Bindings

Examples

New Contributors

Highlights

Whispers of A.I.’s Modular Future

Contributors

v1.1.1

Overview

What's Changed

Core ggml / whisper

Bindings

Examples

New Contributors

Contributors

v1.1.0

Overview

What's Changed

Core ggml / whisper

C-style API

Bindings

Examples

New Contributors

Highlights

Contributors

Core `ggml` / `whisper`

Core `ggml` / `whisper`

Core `ggml` / `whisper`

Core `ggml` / `whisper`