whisper-turbo support #2439

jooray · 2024-10-01T08:00:16Z

OpenAI released Whisper-Turbo, which is a drop-in replacement for large, multilingual, 8x faster, less memory, but minimal degradation in performance.

https://github.com/openai/whisper

kth8 · 2024-10-01T10:12:58Z

I created the GGML with these commands and benchmark results using my desktop with Nvidia GPU:

$ hyperfine --warmup 1 --runs 5 "bin/main --flash-attn -m models/ggml-tiny-q5_1.bin -f samples/George_W_Bush_Columbia_FINAL.wav" \
    "bin/main --flash-attn -m models/ggml-base.en-q5_1.bin -f samples/George_W_Bush_Columbia_FINAL.wav" \
    "bin/main --flash-attn -m models/ggml-small.en-q5_1.bin -f samples/George_W_Bush_Columbia_FINAL.wav" \
    "bin/main --flash-attn -m models/ggml-medium.en-q5_0.bin -f samples/George_W_Bush_Columbia_FINAL.wav" \
    "bin/main --flash-attn -m models/ggml-large-v3-turbo-q5_0.bin -f samples/George_W_Bush_Columbia_FINAL.wav"
Benchmark 1: bin/main --flash-attn -m models/ggml-tiny-q5_1.bin -f samples/George_W_Bush_Columbia_FINAL.wav
  Time (mean ± σ):      4.354 s ±  0.480 s    [User: 5.907 s, System: 0.395 s]
  Range (min … max):    3.856 s …  5.146 s    5 runs
 
Benchmark 2: bin/main --flash-attn -m models/ggml-base.en-q5_1.bin -f samples/George_W_Bush_Columbia_FINAL.wav
  Time (mean ± σ):      5.280 s ±  0.130 s    [User: 6.837 s, System: 0.393 s]
  Range (min … max):    5.143 s …  5.478 s    5 runs
 
Benchmark 3: bin/main --flash-attn -m models/ggml-small.en-q5_1.bin -f samples/George_W_Bush_Columbia_FINAL.wav
  Time (mean ± σ):      8.489 s ±  0.142 s    [User: 9.968 s, System: 0.460 s]
  Range (min … max):    8.305 s …  8.635 s    5 runs
 
Benchmark 4: bin/main --flash-attn -m models/ggml-medium.en-q5_0.bin -f samples/George_W_Bush_Columbia_FINAL.wav
  Time (mean ± σ):     17.171 s ±  0.149 s    [User: 18.368 s, System: 0.610 s]
  Range (min … max):   16.995 s … 17.390 s    5 runs
 
Benchmark 5: bin/main --flash-attn -m models/ggml-large-v3-turbo-q5_0.bin -f samples/George_W_Bush_Columbia_FINAL.wav
  Time (mean ± σ):     14.601 s ±  0.086 s    [User: 15.714 s, System: 0.556 s]
  Range (min … max):   14.495 s … 14.691 s    5 runs
 
Summary
  bin/main --flash-attn -m models/ggml-tiny-q5_1.bin -f samples/George_W_Bush_Columbia_FINAL.wav ran
    1.21 ± 0.14 times faster than bin/main --flash-attn -m models/ggml-base.en-q5_1.bin -f samples/George_W_Bush_Columbia_FINAL.wav
    1.95 ± 0.22 times faster than bin/main --flash-attn -m models/ggml-small.en-q5_1.bin -f samples/George_W_Bush_Columbia_FINAL.wav
    3.35 ± 0.37 times faster than bin/main --flash-attn -m models/ggml-large-v3-turbo-q5_0.bin -f samples/George_W_Bush_Columbia_FINAL.wav
    3.94 ± 0.44 times faster than bin/main --flash-attn -m models/ggml-medium.en-q5_0.bin -f samples/George_W_Bush_Columbia_FINAL.wav

jordibruin · 2024-10-01T10:19:48Z

@kth8 nice results! Did you put the ggml online somewhere? Would save me some time converting 🙂

milsun · 2024-10-01T11:56:43Z

maybe share the GGML?

kth8 · 2024-10-01T12:11:48Z

I haven't but here is also my benchmarks for unquantized models:

$ hyperfine --warmup 1 --runs 5 --parameter-list model tiny.en,base.en,small.en,medium.en,large-v3-turbo 'bin/main --flash-attn -m models/ggml-{model}.bin -f samples/George_W_Bush_Columbia_FINAL.wav'
Benchmark 1: bin/main --flash-attn -m models/ggml-tiny.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav
  Time (mean ± σ):      5.432 s ±  0.201 s    [User: 6.871 s, System: 0.391 s]
  Range (min … max):    5.156 s …  5.720 s    5 runs
 
Benchmark 2: bin/main --flash-attn -m models/ggml-base.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav
  Time (mean ± σ):      6.389 s ±  0.408 s    [User: 7.970 s, System: 0.388 s]
  Range (min … max):    5.867 s …  6.991 s    5 runs
 
Benchmark 3: bin/main --flash-attn -m models/ggml-small.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav
  Time (mean ± σ):     12.351 s ±  0.239 s    [User: 13.748 s, System: 0.545 s]
  Range (min … max):   12.021 s … 12.587 s    5 runs
 
Benchmark 4: bin/main --flash-attn -m models/ggml-medium.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav
  Time (mean ± σ):     27.515 s ±  0.254 s    [User: 28.485 s, System: 0.777 s]
  Range (min … max):   27.084 s … 27.718 s    5 runs
 
Benchmark 5: bin/main --flash-attn -m models/ggml-large-v3-turbo.bin -f samples/George_W_Bush_Columbia_FINAL.wav
  Time (mean ± σ):     18.516 s ±  0.106 s    [User: 19.644 s, System: 0.827 s]
  Range (min … max):   18.392 s … 18.659 s    5 runs
 
Summary
  bin/main --flash-attn -m models/ggml-tiny.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav ran
    1.18 ± 0.09 times faster than bin/main --flash-attn -m models/ggml-base.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav
    2.27 ± 0.09 times faster than bin/main --flash-attn -m models/ggml-small.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav
    3.41 ± 0.13 times faster than bin/main --flash-attn -m models/ggml-large-v3-turbo.bin -f samples/George_W_Bush_Columbia_FINAL.wav
    5.07 ± 0.19 times faster than bin/main --flash-attn -m models/ggml-medium.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav

kth8 · 2024-10-01T12:20:32Z

It has been uploaded to the HG repo: https://huggingface.co/ggerganov/whisper.cpp/blob/main/ggml-large-v3-turbo.bin

milsun · 2024-10-01T12:24:26Z

great, thanks!!!

kth8 · 2024-10-01T12:35:14Z

But when I try using stream with this model it keeps printing out this:

$ ./bin/stream -fa -m models/ggml-large-v3-turbo.bin --step 1500 2>/dev/null
[Start speaking]
 Thank you.
 Thank you.
 Thank you.
 Thank you.
 Thank you.
 Thank you.
 Thank you.

solaoi · 2024-10-01T15:24:57Z

I've generated a CoreML version of Whisper-Turbo for Mac users.

You can find more information about it here:
https://huggingface.co/ggerganov/whisper.cpp/discussions/19

I've included a brief performance comparison for converting the samples/jfk.wav file:

Metal:

whisper_print_timings:     load time =   720.99 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     8.83 ms
whisper_print_timings:   sample time =    34.99 ms /   148 runs (    0.24 ms per run)
whisper_print_timings:   encode time =  1892.98 ms /     1 runs ( 1892.98 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   batchd time =   256.76 ms /   146 runs (    1.76 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  6028.79 ms

Metal & CoreML (first run):

whisper_print_timings:     load time =   511.85 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     9.03 ms
whisper_print_timings:   sample time =    34.10 ms /   148 runs (    0.23 ms per run)
whisper_print_timings:   encode time =   748.83 ms /     1 runs (  748.83 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   batchd time =   233.59 ms /   146 runs (    1.60 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time = 356773.66 ms

Metal & CoreML (second run):

whisper_print_timings:     load time =   643.47 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     8.74 ms
whisper_print_timings:   sample time =    35.67 ms /   148 runs (    0.24 ms per run)
whisper_print_timings:   encode time =   689.74 ms /     1 runs (  689.74 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   batchd time =   253.82 ms /   146 runs (    1.74 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  3488.60 ms

After updating the OS from Ventura to Sonoma, I ran the measurements again and found that the results were quite different. I've included the execution logs below for reference.

Metal:

whisper_print_timings:     load time =   506.43 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     6.22 ms
whisper_print_timings:   sample time =    31.84 ms /   148 runs (    0.22 ms per run)
whisper_print_timings:   encode time =  1786.60 ms /     1 runs ( 1786.60 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   batchd time =   238.42 ms /   146 runs (    1.63 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  2586.65 ms

Metal & CoreML (first run):

whisper_print_timings:     load time =   824.32 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     8.29 ms
whisper_print_timings:   sample time =    35.41 ms /   148 runs (    0.24 ms per run)
whisper_print_timings:   encode time =   734.31 ms /     1 runs (  734.31 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   batchd time =   239.68 ms /   146 runs (    1.64 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time = 43052.73 ms

Metal & CoreML (second run):

whisper_print_timings:     load time =   504.39 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     7.14 ms
whisper_print_timings:   sample time =    32.80 ms /   148 runs (    0.22 ms per run)
whisper_print_timings:   encode time =   575.92 ms /     1 runs (  575.92 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   batchd time =   258.99 ms /   146 runs (    1.77 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  2077.62 ms

kth8 · 2024-10-01T16:47:14Z

On base M1 MBA running macOS Sequoia:

% hyperfine --warmup 1 --runs 5 --parameter-list model tiny.en,base.en,small.en,medium.en,large-v3-turbo './main -m models/ggml-{model}.bin -f samples/George_W_Bush_Columbia_FINAL.wav'
Benchmark 1: ./main -m models/ggml-tiny.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav
  Time (mean ± σ):      8.404 s ±  0.060 s    [User: 7.612 s, System: 0.701 s]
  Range (min … max):    8.337 s …  8.478 s    5 runs
 
Benchmark 2: ./main -m models/ggml-base.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav
  Time (mean ± σ):      9.826 s ±  0.077 s    [User: 8.195 s, System: 0.763 s]
  Range (min … max):    9.702 s …  9.912 s    5 runs
 
Benchmark 3: ./main -m models/ggml-small.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav
  Time (mean ± σ):     15.564 s ±  0.024 s    [User: 9.338 s, System: 0.914 s]
  Range (min … max):   15.540 s … 15.603 s    5 runs
 
Benchmark 4: ./main -m models/ggml-medium.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav
  Time (mean ± σ):     36.906 s ±  0.213 s    [User: 11.062 s, System: 1.384 s]
  Range (min … max):   36.701 s … 37.249 s    5 runs
 
Benchmark 5: ./main -m models/ggml-large-v3-turbo.bin -f samples/George_W_Bush_Columbia_FINAL.wav
  Time (mean ± σ):     20.345 s ±  0.141 s    [User: 9.154 s, System: 1.442 s]
  Range (min … max):   20.182 s … 20.496 s    5 runs
 
Summary
  ./main -m models/ggml-tiny.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav ran
    1.17 ± 0.01 times faster than ./main -m models/ggml-base.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav
    1.85 ± 0.01 times faster than ./main -m models/ggml-small.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav
    2.42 ± 0.02 times faster than ./main -m models/ggml-large-v3-turbo.bin -f samples/George_W_Bush_Columbia_FINAL.wav
    4.39 ± 0.04 times faster than ./main -m models/ggml-medium.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav

but If I disable GPU the results are faster?

% hyperfine --warmup 1 --runs 5 --parameter-list model tiny.en,base.en,small.en,medium.en,large-v3-turbo './main -m models/ggml-{model}.bin -f samples/George_W_Bush_Columbia_FINAL.wav --no-gpu'
Benchmark 1: ./main -m models/ggml-tiny.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav --no-gpu
  Time (mean ± σ):      2.936 s ±  0.007 s    [User: 8.480 s, System: 0.260 s]
  Range (min … max):    2.928 s …  2.944 s    5 runs
 
Benchmark 2: ./main -m models/ggml-base.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav --no-gpu
  Time (mean ± σ):      4.360 s ±  0.050 s    [User: 13.033 s, System: 0.302 s]
  Range (min … max):    4.307 s …  4.426 s    5 runs
 
Benchmark 3: ./main -m models/ggml-small.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav --no-gpu
  Time (mean ± σ):     12.319 s ±  0.183 s    [User: 38.069 s, System: 0.539 s]
  Range (min … max):   12.069 s … 12.485 s    5 runs
 
Benchmark 4: ./main -m models/ggml-medium.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav --no-gpu
  Time (mean ± σ):     37.105 s ±  0.217 s    [User: 111.368 s, System: 1.305 s]
  Range (min … max):   36.786 s … 37.296 s    5 runs
 
Benchmark 5: ./main -m models/ggml-large-v3-turbo.bin -f samples/George_W_Bush_Columbia_FINAL.wav --no-gpu
  Time (mean ± σ):     17.132 s ±  0.180 s    [User: 33.666 s, System: 1.129 s]
  Range (min … max):   16.846 s … 17.340 s    5 runs
 
Summary
  ./main -m models/ggml-tiny.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav --no-gpu ran
    1.48 ± 0.02 times faster than ./main -m models/ggml-base.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav --no-gpu
    4.20 ± 0.06 times faster than ./main -m models/ggml-small.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav --no-gpu
    5.84 ± 0.06 times faster than ./main -m models/ggml-large-v3-turbo.bin -f samples/George_W_Bush_Columbia_FINAL.wav --no-gpu
   12.64 ± 0.08 times faster than ./main -m models/ggml-medium.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav --no-gpu

milenamilka755 · 2024-10-02T14:12:10Z

It has been uploaded to the HG repo: https://huggingface.co/ggerganov/whisper.cpp/blob/main/ggml-large-v3-turbo.bin

Is a whisper-medium-turbo.bin model planned for release?

spoeken · 2024-10-04T08:52:55Z

But when I try using stream with this model it keeps printing out this:

$ ./bin/stream -fa -m models/ggml-large-v3-turbo.bin --step 1500 2>/dev/null
[Start speaking]
 Thank you.
 Thank you.
 Thank you.
 Thank you.
 Thank you.
 Thank you.
 Thank you.

Did you find a solution to this @kth8 ?

kth8 · 2024-10-04T12:57:50Z

@spoeken no, I just went back to using the small model for streaming.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

whisper-turbo support #2439

whisper-turbo support #2439

jooray commented Oct 1, 2024

kth8 commented Oct 1, 2024

jordibruin commented Oct 1, 2024

milsun commented Oct 1, 2024

kth8 commented Oct 1, 2024

kth8 commented Oct 1, 2024

milsun commented Oct 1, 2024

kth8 commented Oct 1, 2024

solaoi commented Oct 1, 2024 •

edited

Loading

kth8 commented Oct 1, 2024

milenamilka755 commented Oct 2, 2024

spoeken commented Oct 4, 2024

kth8 commented Oct 4, 2024

whisper-turbo support #2439

whisper-turbo support #2439

Comments

jooray commented Oct 1, 2024

kth8 commented Oct 1, 2024

jordibruin commented Oct 1, 2024

milsun commented Oct 1, 2024

kth8 commented Oct 1, 2024

kth8 commented Oct 1, 2024

milsun commented Oct 1, 2024

kth8 commented Oct 1, 2024

solaoi commented Oct 1, 2024 • edited Loading

kth8 commented Oct 1, 2024

milenamilka755 commented Oct 2, 2024

spoeken commented Oct 4, 2024

kth8 commented Oct 4, 2024

solaoi commented Oct 1, 2024 •

edited

Loading