Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

whisper-turbo support #2439

Open
jooray opened this issue Oct 1, 2024 · 12 comments
Open

whisper-turbo support #2439

jooray opened this issue Oct 1, 2024 · 12 comments

Comments

@jooray
Copy link

jooray commented Oct 1, 2024

OpenAI released Whisper-Turbo, which is a drop-in replacement for large, multilingual, 8x faster, less memory, but minimal degradation in performance.

https://github.com/openai/whisper

@kth8
Copy link

kth8 commented Oct 1, 2024

I created the GGML with these commands and benchmark results using my desktop with Nvidia GPU:

$ hyperfine --warmup 1 --runs 5 "bin/main --flash-attn -m models/ggml-tiny-q5_1.bin -f samples/George_W_Bush_Columbia_FINAL.wav" \
    "bin/main --flash-attn -m models/ggml-base.en-q5_1.bin -f samples/George_W_Bush_Columbia_FINAL.wav" \
    "bin/main --flash-attn -m models/ggml-small.en-q5_1.bin -f samples/George_W_Bush_Columbia_FINAL.wav" \
    "bin/main --flash-attn -m models/ggml-medium.en-q5_0.bin -f samples/George_W_Bush_Columbia_FINAL.wav" \
    "bin/main --flash-attn -m models/ggml-large-v3-turbo-q5_0.bin -f samples/George_W_Bush_Columbia_FINAL.wav"
Benchmark 1: bin/main --flash-attn -m models/ggml-tiny-q5_1.bin -f samples/George_W_Bush_Columbia_FINAL.wav
  Time (mean ± σ):      4.354 s ±  0.480 s    [User: 5.907 s, System: 0.395 s]
  Range (min … max):    3.856 s …  5.146 s    5 runs
 
Benchmark 2: bin/main --flash-attn -m models/ggml-base.en-q5_1.bin -f samples/George_W_Bush_Columbia_FINAL.wav
  Time (mean ± σ):      5.280 s ±  0.130 s    [User: 6.837 s, System: 0.393 s]
  Range (min … max):    5.143 s …  5.478 s    5 runs
 
Benchmark 3: bin/main --flash-attn -m models/ggml-small.en-q5_1.bin -f samples/George_W_Bush_Columbia_FINAL.wav
  Time (mean ± σ):      8.489 s ±  0.142 s    [User: 9.968 s, System: 0.460 s]
  Range (min … max):    8.305 s …  8.635 s    5 runs
 
Benchmark 4: bin/main --flash-attn -m models/ggml-medium.en-q5_0.bin -f samples/George_W_Bush_Columbia_FINAL.wav
  Time (mean ± σ):     17.171 s ±  0.149 s    [User: 18.368 s, System: 0.610 s]
  Range (min … max):   16.995 s … 17.390 s    5 runs
 
Benchmark 5: bin/main --flash-attn -m models/ggml-large-v3-turbo-q5_0.bin -f samples/George_W_Bush_Columbia_FINAL.wav
  Time (mean ± σ):     14.601 s ±  0.086 s    [User: 15.714 s, System: 0.556 s]
  Range (min … max):   14.495 s … 14.691 s    5 runs
 
Summary
  bin/main --flash-attn -m models/ggml-tiny-q5_1.bin -f samples/George_W_Bush_Columbia_FINAL.wav ran
    1.21 ± 0.14 times faster than bin/main --flash-attn -m models/ggml-base.en-q5_1.bin -f samples/George_W_Bush_Columbia_FINAL.wav
    1.95 ± 0.22 times faster than bin/main --flash-attn -m models/ggml-small.en-q5_1.bin -f samples/George_W_Bush_Columbia_FINAL.wav
    3.35 ± 0.37 times faster than bin/main --flash-attn -m models/ggml-large-v3-turbo-q5_0.bin -f samples/George_W_Bush_Columbia_FINAL.wav
    3.94 ± 0.44 times faster than bin/main --flash-attn -m models/ggml-medium.en-q5_0.bin -f samples/George_W_Bush_Columbia_FINAL.wav

@jordibruin
Copy link

@kth8 nice results! Did you put the ggml online somewhere? Would save me some time converting 🙂

@milsun
Copy link

milsun commented Oct 1, 2024

maybe share the GGML?

@kth8
Copy link

kth8 commented Oct 1, 2024

I haven't but here is also my benchmarks for unquantized models:

$ hyperfine --warmup 1 --runs 5 --parameter-list model tiny.en,base.en,small.en,medium.en,large-v3-turbo 'bin/main --flash-attn -m models/ggml-{model}.bin -f samples/George_W_Bush_Columbia_FINAL.wav'
Benchmark 1: bin/main --flash-attn -m models/ggml-tiny.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav
  Time (mean ± σ):      5.432 s ±  0.201 s    [User: 6.871 s, System: 0.391 s]
  Range (min … max):    5.156 s …  5.720 s    5 runs
 
Benchmark 2: bin/main --flash-attn -m models/ggml-base.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav
  Time (mean ± σ):      6.389 s ±  0.408 s    [User: 7.970 s, System: 0.388 s]
  Range (min … max):    5.867 s …  6.991 s    5 runs
 
Benchmark 3: bin/main --flash-attn -m models/ggml-small.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav
  Time (mean ± σ):     12.351 s ±  0.239 s    [User: 13.748 s, System: 0.545 s]
  Range (min … max):   12.021 s … 12.587 s    5 runs
 
Benchmark 4: bin/main --flash-attn -m models/ggml-medium.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav
  Time (mean ± σ):     27.515 s ±  0.254 s    [User: 28.485 s, System: 0.777 s]
  Range (min … max):   27.084 s … 27.718 s    5 runs
 
Benchmark 5: bin/main --flash-attn -m models/ggml-large-v3-turbo.bin -f samples/George_W_Bush_Columbia_FINAL.wav
  Time (mean ± σ):     18.516 s ±  0.106 s    [User: 19.644 s, System: 0.827 s]
  Range (min … max):   18.392 s … 18.659 s    5 runs
 
Summary
  bin/main --flash-attn -m models/ggml-tiny.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav ran
    1.18 ± 0.09 times faster than bin/main --flash-attn -m models/ggml-base.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav
    2.27 ± 0.09 times faster than bin/main --flash-attn -m models/ggml-small.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav
    3.41 ± 0.13 times faster than bin/main --flash-attn -m models/ggml-large-v3-turbo.bin -f samples/George_W_Bush_Columbia_FINAL.wav
    5.07 ± 0.19 times faster than bin/main --flash-attn -m models/ggml-medium.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav

@kth8
Copy link

kth8 commented Oct 1, 2024

It has been uploaded to the HG repo: https://huggingface.co/ggerganov/whisper.cpp/blob/main/ggml-large-v3-turbo.bin

@milsun
Copy link

milsun commented Oct 1, 2024

great, thanks!!!

@kth8
Copy link

kth8 commented Oct 1, 2024

But when I try using stream with this model it keeps printing out this:

$ ./bin/stream -fa -m models/ggml-large-v3-turbo.bin --step 1500 2>/dev/null
[Start speaking]
 Thank you.
 Thank you.
 Thank you.
 Thank you.
 Thank you.
 Thank you.
 Thank you.

@solaoi
Copy link

solaoi commented Oct 1, 2024

I've generated a CoreML version of Whisper-Turbo for Mac users.

You can find more information about it here:
https://huggingface.co/ggerganov/whisper.cpp/discussions/19

I've included a brief performance comparison for converting the samples/jfk.wav file:

Metal:

whisper_print_timings:     load time =   720.99 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     8.83 ms
whisper_print_timings:   sample time =    34.99 ms /   148 runs (    0.24 ms per run)
whisper_print_timings:   encode time =  1892.98 ms /     1 runs ( 1892.98 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   batchd time =   256.76 ms /   146 runs (    1.76 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  6028.79 ms

Metal & CoreML (first run):

whisper_print_timings:     load time =   511.85 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     9.03 ms
whisper_print_timings:   sample time =    34.10 ms /   148 runs (    0.23 ms per run)
whisper_print_timings:   encode time =   748.83 ms /     1 runs (  748.83 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   batchd time =   233.59 ms /   146 runs (    1.60 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time = 356773.66 ms

Metal & CoreML (second run):

whisper_print_timings:     load time =   643.47 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     8.74 ms
whisper_print_timings:   sample time =    35.67 ms /   148 runs (    0.24 ms per run)
whisper_print_timings:   encode time =   689.74 ms /     1 runs (  689.74 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   batchd time =   253.82 ms /   146 runs (    1.74 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  3488.60 ms

After updating the OS from Ventura to Sonoma, I ran the measurements again and found that the results were quite different. I've included the execution logs below for reference.

Metal:

whisper_print_timings:     load time =   506.43 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     6.22 ms
whisper_print_timings:   sample time =    31.84 ms /   148 runs (    0.22 ms per run)
whisper_print_timings:   encode time =  1786.60 ms /     1 runs ( 1786.60 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   batchd time =   238.42 ms /   146 runs (    1.63 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  2586.65 ms

Metal & CoreML (first run):

whisper_print_timings:     load time =   824.32 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     8.29 ms
whisper_print_timings:   sample time =    35.41 ms /   148 runs (    0.24 ms per run)
whisper_print_timings:   encode time =   734.31 ms /     1 runs (  734.31 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   batchd time =   239.68 ms /   146 runs (    1.64 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time = 43052.73 ms

Metal & CoreML (second run):

whisper_print_timings:     load time =   504.39 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     7.14 ms
whisper_print_timings:   sample time =    32.80 ms /   148 runs (    0.22 ms per run)
whisper_print_timings:   encode time =   575.92 ms /     1 runs (  575.92 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   batchd time =   258.99 ms /   146 runs (    1.77 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  2077.62 ms

@kth8
Copy link

kth8 commented Oct 1, 2024

On base M1 MBA running macOS Sequoia:

% hyperfine --warmup 1 --runs 5 --parameter-list model tiny.en,base.en,small.en,medium.en,large-v3-turbo './main -m models/ggml-{model}.bin -f samples/George_W_Bush_Columbia_FINAL.wav'
Benchmark 1: ./main -m models/ggml-tiny.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav
  Time (mean ± σ):      8.404 s ±  0.060 s    [User: 7.612 s, System: 0.701 s]
  Range (min … max):    8.337 s …  8.478 s    5 runs
 
Benchmark 2: ./main -m models/ggml-base.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav
  Time (mean ± σ):      9.826 s ±  0.077 s    [User: 8.195 s, System: 0.763 s]
  Range (min … max):    9.702 s …  9.912 s    5 runs
 
Benchmark 3: ./main -m models/ggml-small.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav
  Time (mean ± σ):     15.564 s ±  0.024 s    [User: 9.338 s, System: 0.914 s]
  Range (min … max):   15.540 s … 15.603 s    5 runs
 
Benchmark 4: ./main -m models/ggml-medium.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav
  Time (mean ± σ):     36.906 s ±  0.213 s    [User: 11.062 s, System: 1.384 s]
  Range (min … max):   36.701 s … 37.249 s    5 runs
 
Benchmark 5: ./main -m models/ggml-large-v3-turbo.bin -f samples/George_W_Bush_Columbia_FINAL.wav
  Time (mean ± σ):     20.345 s ±  0.141 s    [User: 9.154 s, System: 1.442 s]
  Range (min … max):   20.182 s … 20.496 s    5 runs
 
Summary
  ./main -m models/ggml-tiny.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav ran
    1.17 ± 0.01 times faster than ./main -m models/ggml-base.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav
    1.85 ± 0.01 times faster than ./main -m models/ggml-small.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav
    2.42 ± 0.02 times faster than ./main -m models/ggml-large-v3-turbo.bin -f samples/George_W_Bush_Columbia_FINAL.wav
    4.39 ± 0.04 times faster than ./main -m models/ggml-medium.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav

but If I disable GPU the results are faster?

% hyperfine --warmup 1 --runs 5 --parameter-list model tiny.en,base.en,small.en,medium.en,large-v3-turbo './main -m models/ggml-{model}.bin -f samples/George_W_Bush_Columbia_FINAL.wav --no-gpu'
Benchmark 1: ./main -m models/ggml-tiny.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav --no-gpu
  Time (mean ± σ):      2.936 s ±  0.007 s    [User: 8.480 s, System: 0.260 s]
  Range (min … max):    2.928 s …  2.944 s    5 runs
 
Benchmark 2: ./main -m models/ggml-base.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav --no-gpu
  Time (mean ± σ):      4.360 s ±  0.050 s    [User: 13.033 s, System: 0.302 s]
  Range (min … max):    4.307 s …  4.426 s    5 runs
 
Benchmark 3: ./main -m models/ggml-small.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav --no-gpu
  Time (mean ± σ):     12.319 s ±  0.183 s    [User: 38.069 s, System: 0.539 s]
  Range (min … max):   12.069 s … 12.485 s    5 runs
 
Benchmark 4: ./main -m models/ggml-medium.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav --no-gpu
  Time (mean ± σ):     37.105 s ±  0.217 s    [User: 111.368 s, System: 1.305 s]
  Range (min … max):   36.786 s … 37.296 s    5 runs
 
Benchmark 5: ./main -m models/ggml-large-v3-turbo.bin -f samples/George_W_Bush_Columbia_FINAL.wav --no-gpu
  Time (mean ± σ):     17.132 s ±  0.180 s    [User: 33.666 s, System: 1.129 s]
  Range (min … max):   16.846 s … 17.340 s    5 runs
 
Summary
  ./main -m models/ggml-tiny.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav --no-gpu ran
    1.48 ± 0.02 times faster than ./main -m models/ggml-base.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav --no-gpu
    4.20 ± 0.06 times faster than ./main -m models/ggml-small.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav --no-gpu
    5.84 ± 0.06 times faster than ./main -m models/ggml-large-v3-turbo.bin -f samples/George_W_Bush_Columbia_FINAL.wav --no-gpu
   12.64 ± 0.08 times faster than ./main -m models/ggml-medium.en.bin -f samples/George_W_Bush_Columbia_FINAL.wav --no-gpu

@milenamilka755
Copy link

It has been uploaded to the HG repo: https://huggingface.co/ggerganov/whisper.cpp/blob/main/ggml-large-v3-turbo.bin

Is a whisper-medium-turbo.bin model planned for release?

@spoeken
Copy link

spoeken commented Oct 4, 2024

But when I try using stream with this model it keeps printing out this:

$ ./bin/stream -fa -m models/ggml-large-v3-turbo.bin --step 1500 2>/dev/null
[Start speaking]
 Thank you.
 Thank you.
 Thank you.
 Thank you.
 Thank you.
 Thank you.
 Thank you.

Did you find a solution to this @kth8 ?

@kth8
Copy link

kth8 commented Oct 4, 2024

@spoeken no, I just went back to using the small model for streaming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants