Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

examples: Fix the encoding issues on Windows #1313

Closed
wants to merge 35 commits into from

Conversation

bobqianic
Copy link
Collaborator

@bobqianic bobqianic commented Sep 20, 2023

The problem with Windows is that it has a heavy historical burden. Many narrow string (char) APIs do not support Unicode, and only wide character (wchar_t) APIs support Unicode. Using narrow string (char) APIs might result in garbled text for some languages. This PR addresses this issue.

In this PR, we've enabled the Windows terminal to accept wchar_t arrays encoded in UTF-16LE. Before processing, we convert them to char arrays encoded in UTF-8. By using SetConsoleOutputCP, we've adjusted the Windows terminal's output encoding to UTF-8, ensuring that multiple languages can be correctly displayed in the Windows terminal. Additionally, based on the documentation provided by Microsoft, we've enabled Virtual Terminal Processing in the Windows terminal, allowing text colors to be displayed correctly. If you have a better solution, please feel free to make suggestions.

The current issue: print color causes garbled text in non-alphabetic languages.

#399 #554 #1151

  • multi-language filenames
  • multi-language prompt
  • multi-language voice to text
  • printing color
Test Results

English:

C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release>C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\main.exe -m C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\ggml-model-whisper-small.bin -f C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\micro-machine.wav -pc
whisper_init_from_file_no_state: loading model from 'C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\ggml-model-whisper-small.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 3
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     =  464.68 MB
whisper_model_load: model size    =  464.44 MB
whisper_init_state: kv self size  =   15.75 MB
whisper_init_state: kv cross size =   52.73 MB
whisper_init_state: compute buffer (conv)   =   19.96 MB
whisper_init_state: compute buffer (encode) =  122.04 MB
whisper_init_state: compute buffer (cross)  =    5.86 MB
whisper_init_state: compute buffer (decode) =   36.17 MB

system_info: n_threads = 4 / 20 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | COREML = 0 | OPENVINO = 0 |

init: processing 'C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\micro-machine.wav' (478214 samples, 29.9 sec), 4 threads, 1 processors, lang = en, task = false, timestamps = 1 ...


[00:00:00.000 --> 00:00:03.080]   This is the Micro Machine Man presenting the most midget miniature motorcade of Micro Machine.
[00:00:03.080 --> 00:00:06.520]   Each one has dramatic details, terrific trims, precision paint jobs, plus incredible Micro Machine Pocket Playsets.
[00:00:06.520 --> 00:00:08.640]   There's a police station, fire station, restaurant, service station, and more.
[00:00:08.640 --> 00:00:10.160]   Perfect pocket portables to take any place.
[00:00:10.160 --> 00:00:12.640]   And there are many miniature play sets to play with, and each one comes with its own special edition.
[00:00:12.640 --> 00:00:15.120]   Micro Machine Vehicle and fun, fantastic features that miraculously move.
[00:00:15.120 --> 00:00:17.440]   Raise the boat lift at the airport, Marina Man, the gun turret at the Army Base,
[00:00:17.440 --> 00:00:19.080]   clean your car at the car wash, raise the toll bridge.
[00:00:19.080 --> 00:00:21.040]   And these play sets fit together to form a Micro Machine World.
[00:00:21.040 --> 00:00:24.040]   Micro Machine Pocket Playsets are tremendously tiny, so perfectly precise, so dazzlingly detailed.
[00:00:24.040 --> 00:00:25.160]   You all want to pocket them all.
[00:00:25.160 --> 00:00:27.600]   Micro Machines and Micro Machine Pocket Playsets sold separately from Golube.
[00:00:27.600 --> 00:00:29.480]   The smaller they are, the better they are.


whisper_print_timings:     load time =   248.23 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    21.02 ms
whisper_print_timings:   sample time =   184.69 ms /   280 runs (    0.66 ms per run)
whisper_print_timings:   encode time =  3915.51 ms /     2 runs ( 1957.75 ms per run)
whisper_print_timings:   decode time =  2681.14 ms /   275 runs (    9.75 ms per run)
whisper_print_timings:   prompt time =    52.34 ms /     4 runs (   13.09 ms per run)
whisper_print_timings:    total time =  7263.82 ms
image

Chinese:

C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release>C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\main.exe -m C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\ggml-model-whisper-small.bin -f C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\01-03(轻松学中文+第二版+课本2).wav -l zh
whisper_init_from_file_no_state: loading model from 'C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\ggml-model-whisper-small.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 3
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     =  464.68 MB
whisper_model_load: model size    =  464.44 MB
whisper_init_state: kv self size  =   15.75 MB
whisper_init_state: kv cross size =   52.73 MB
whisper_init_state: compute buffer (conv)   =   19.96 MB
whisper_init_state: compute buffer (encode) =  122.04 MB
whisper_init_state: compute buffer (cross)  =    5.86 MB
whisper_init_state: compute buffer (decode) =   36.17 MB

system_info: n_threads = 4 / 20 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | COREML = 0 | OPENVINO = 0 |

init: processing 'C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\01-03(轻松学中文+第二版+课本2).wav' (713721 samples, 44.6 sec), 4 threads, 1 processors, lang = zh, task = false, timestamps = 1 ...


[00:00:00.000 --> 00:00:05.500]  你爷爷、奶奶住在哪儿?
[00:00:05.500 --> 00:00:08.200]  他们住在南京。
[00:00:08.200 --> 00:00:13.200]  他们跟我叔叔和沈人一起住。
[00:00:13.200 --> 00:00:17.700]  你叔叔是哪一年结婚的?
[00:00:17.700 --> 00:00:20.600]  他是去年结婚的。
[00:00:20.600 --> 00:00:24.400]  你叔叔做什么工作?
[00:00:24.400 --> 00:00:26.300]  在哪儿工作?
[00:00:26.300 --> 00:00:28.800]  我叔叔是老师。
[00:00:28.800 --> 00:00:32.800]  他在同名中学工作。
[00:00:32.800 --> 00:00:38.300]  你常常跟你爸爸家的亲戚见面吗?
[00:00:38.300 --> 00:00:41.800]  我常常跟他们见面。
[00:00:41.800 --> 00:00:42.800]  我赶紧讲讲
[00:00:42.800 --> 00:00:44.920]  不许你把东西收回去


whisper_print_timings:     load time =   244.23 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    29.08 ms
whisper_print_timings:   sample time =   140.36 ms /   195 runs (    0.72 ms per run)
whisper_print_timings:   encode time =  5813.29 ms /     3 runs ( 1937.76 ms per run)
whisper_print_timings:   decode time =  1766.77 ms /   188 runs (    9.40 ms per run)
whisper_print_timings:   prompt time =   340.00 ms /     5 runs (   68.00 ms per run)
whisper_print_timings:    total time =  8417.58 ms
image

Japanese:

C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release>C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\main.exe -m C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\ggml-model-whisper-small.bin -f "C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\日本語での自然な自己紹介The Best Way to Introduce Yourself in Japanese [TubeRipper.com].wav" -l ja
whisper_init_from_file_no_state: loading model from 'C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\ggml-model-whisper-small.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 3
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     =  464.68 MB
whisper_model_load: model size    =  464.44 MB
whisper_init_state: kv self size  =   15.75 MB
whisper_init_state: kv cross size =   52.73 MB
whisper_init_state: compute buffer (conv)   =   19.96 MB
whisper_init_state: compute buffer (encode) =  122.04 MB
whisper_init_state: compute buffer (cross)  =    5.86 MB
whisper_init_state: compute buffer (decode) =   36.17 MB

system_info: n_threads = 4 / 20 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | COREML = 0 | OPENVINO = 0 |

init: processing 'C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\日本語での自然な自己紹介The Best Way to Introduce Yourself in Japanese [TubeRipper.com].wav' (960000 samples, 60.0 sec), 4 threads, 1 processors, lang = ja, task = false, timestamps = 1 ...


[00:00:00.000 --> 00:00:04.600]  こんにちは、はじめまして、私はアッキーです。
[00:00:04.600 --> 00:00:08.800]  日本から来ました。どうぞよろしく。
[00:00:08.800 --> 00:00:11.000]  3本塾
[00:00:11.000 --> 00:00:14.160]  チャス!3本塾のアッキーです。
[00:00:14.160 --> 00:00:19.080]  今回は自己紹介についてお話します。
[00:00:19.080 --> 00:00:26.600]  自己紹介は知らない人と初めて会った時にするとても大事な表現です。
[00:00:26.600 --> 00:00:36.000]  日本語を勉強する時も初級のはじめの方で勉強をするとても基本的な表現です。
[00:00:36.000 --> 00:00:44.520]  でも皆さん、自己紹介しっかりと自然に上手にできていますか?
[00:00:44.520 --> 00:00:51.280]  今回は上手な、自然な自己紹介を覚えていきましょう。
[00:00:51.280 --> 00:00:57.760]  まず、最初に日本語の初級の教科書の自己紹介はこんな感じ。
[00:00:57.760 --> 00:00:59.960]  こんにちは、はじめまして。


whisper_print_timings:     load time =   242.21 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    40.68 ms
whisper_print_timings:   sample time =   119.10 ms /   211 runs (    0.56 ms per run)
whisper_print_timings:   encode time =  5848.74 ms /     3 runs ( 1949.58 ms per run)
whisper_print_timings:   decode time =  1948.26 ms /   208 runs (    9.37 ms per run)
whisper_print_timings:   prompt time =   849.27 ms /     3 runs (  283.09 ms per run)
whisper_print_timings:    total time =  9134.42 ms
image

Russian:

C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release>C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\main.exe -m C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\ggml-model-whisper-small.bin -f "C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\Introduce Yourself in Russian Super Easy Russian 28 [TubeRipper.com].wav" -l ru
whisper_init_from_file_no_state: loading model from 'C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\ggml-model-whisper-small.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 3
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     =  464.68 MB
whisper_model_load: model size    =  464.44 MB
whisper_init_state: kv self size  =   15.75 MB
whisper_init_state: kv cross size =   52.73 MB
whisper_init_state: compute buffer (conv)   =   19.96 MB
whisper_init_state: compute buffer (encode) =  122.04 MB
whisper_init_state: compute buffer (cross)  =    5.86 MB
whisper_init_state: compute buffer (decode) =   36.17 MB

system_info: n_threads = 4 / 20 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | COREML = 0 | OPENVINO = 0 |

run: processing 'C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\Introduce Yourself in Russian Super Easy Russian 28 [TubeRipper.com].wav' (960000 samples, 60.0 sec), 4 threads, 1 processors, lang = ru, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:04.600]   Всем привет и добро пожаловать в новый выпуск "Super Easy Russian".
[00:00:04.600 --> 00:00:10.760]   В этот раз мы решили спросить у московских прохожих ответы на самые простые вопросы,
[00:00:10.760 --> 00:00:13.800]   которые обычно задают, чтобы узнать людей получше.
[00:00:13.800 --> 00:00:18.600]   Например, как вас зовут, какой у вас возраст, чем вы занимаетесь.
[00:00:18.600 --> 00:00:25.040]   Это видео был записано нашим любезным и отважным коллегой из команды 14-20.
[00:00:25.040 --> 00:00:26.240]   Давайте посмотрим.
[00:00:26.240 --> 00:00:40.640]   [музыка]
[00:00:40.640 --> 00:00:50.200]   Первый вопрос - как тебя зовут? Или как вас зовут? Или как твое имя? Или как ваше имя?
[00:00:50.200 --> 00:00:58.600]   Можно спросить просто, но не очень вежливо - кто ты или кто вы? Например, я - Никита.
[00:00:58.600 --> 00:00:59.860]   Меня зовут.


whisper_print_timings:     load time =   253.56 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    44.06 ms
whisper_print_timings:   sample time =   125.90 ms /   203 runs (    0.62 ms per run)
whisper_print_timings:   encode time =  8671.46 ms /     4 runs ( 2167.86 ms per run)
whisper_print_timings:   decode time =  1907.80 ms /   195 runs (    9.78 ms per run)
whisper_print_timings:   prompt time =   912.13 ms /     6 runs (  152.02 ms per run)
whisper_print_timings:    total time = 11975.46 ms
image

French:

C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release>C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\main.exe -m C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\ggml-model-whisper-small.bin -f "C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\Introduce yourself in French Super Easy French 62 [TubeRipper.com].wav" -l fr
whisper_init_from_file_no_state: loading model from 'C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\ggml-model-whisper-small.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 3
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     =  464.68 MB
whisper_model_load: model size    =  464.44 MB
whisper_init_state: kv self size  =   15.75 MB
whisper_init_state: kv cross size =   52.73 MB
whisper_init_state: compute buffer (conv)   =   19.96 MB
whisper_init_state: compute buffer (encode) =  122.04 MB
whisper_init_state: compute buffer (cross)  =    5.86 MB
whisper_init_state: compute buffer (decode) =   36.17 MB

system_info: n_threads = 4 / 20 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | COREML = 0 | OPENVINO = 0 |

run: processing 'C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\Introduce yourself in French Super Easy French 62 [TubeRipper.com].wav' (960000 samples, 60.0 sec), 4 threads, 1 processors, lang = fr, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:03.600]   Bonjour et bienvenue dans ce nouvel épisode de Super Easy French.
[00:00:03.600 --> 00:00:08.440]   Aujourd'hui, vous allez apprendre à vous présenter en français de manière formelle et informal.
[00:00:08.440 --> 00:00:09.440]   Allons-y !
[00:00:09.440 --> 00:00:16.000]   Bonjour.
[00:00:16.000 --> 00:00:17.000]   Bonjour.
[00:00:17.000 --> 00:00:18.400]   Comment vous appelez-vous ?
[00:00:18.400 --> 00:00:20.600]   Je m'appelle Rita et vous ?
[00:00:20.600 --> 00:00:22.000]   Moi, je m'appelle Judith.
[00:00:22.000 --> 00:00:23.600]   Comment tu t'appelles ?
[00:00:23.600 --> 00:00:25.000]   Je m'appelle Sory Carme.
[00:00:25.000 --> 00:00:26.000]   Et toi ?
[00:00:26.000 --> 00:00:28.000]   Je m'appelle Hélène.
[00:00:28.000 --> 00:00:29.200]   D'où venez-vous ?
[00:00:29.200 --> 00:00:31.200]   Je suis française et vous ?
[00:00:31.200 --> 00:00:32.800]   Je suis française également.
[00:00:32.800 --> 00:00:33.800]   Tu viens d'où ?
[00:00:33.800 --> 00:00:35.200]   Je viens de Istanbul.
[00:00:35.200 --> 00:00:36.800]   Et toi, Hélène ?
[00:00:36.800 --> 00:00:39.200]   Moi, je viens de Bordeaux.
[00:00:39.200 --> 00:00:40.600]   Quelle âge avez-vous ?
[00:00:40.600 --> 00:00:41.800]   J'ai 32 ans.
[00:00:41.800 --> 00:00:43.000]   Quelle âge as-tu ?
[00:00:43.000 --> 00:00:44.400]   J'ai 23 ans.
[00:00:44.400 --> 00:00:46.000]   Et toi ?
[00:00:46.000 --> 00:00:47.000]   J'ai 24 ans.
[00:00:47.000 --> 00:00:48.400]   Où habitez-vous ?
[00:00:48.400 --> 00:00:51.200]   J'habite à Saint-Germain-en-Laye près de Paris.
[00:00:51.200 --> 00:00:52.200]   Et vous ?
[00:00:52.200 --> 00:00:55.000]   Moi, j'habite à Paris dans le 14e arrondissement.
[00:00:55.000 --> 00:00:56.400]   Tu habites où ?
[00:00:56.400 --> 00:01:00.000]   j'habite dans le 20e arrondissement à Paris et toi ?


whisper_print_timings:     load time =   339.25 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    38.36 ms
whisper_print_timings:   sample time =   170.33 ms /   310 runs (    0.55 ms per run)
whisper_print_timings:   encode time =  5573.36 ms /     3 runs ( 1857.79 ms per run)
whisper_print_timings:   decode time =  2906.41 ms /   307 runs (    9.47 ms per run)
whisper_print_timings:   prompt time =   373.41 ms /     3 runs (  124.47 ms per run)
whisper_print_timings:    total time =  9468.37 ms
image

Vietnamese:

C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release>C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\main.exe -m C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\ggml-model-whisper-small.bin -f "C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\Easy Vietnamese 1 - Whats typical Vietnamese [TubeRipper.com].wav" -l vi
whisper_init_from_file_no_state: loading model from 'C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\ggml-model-whisper-small.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 3
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     =  464.68 MB
whisper_model_load: model size    =  464.44 MB
whisper_init_state: kv self size  =   15.75 MB
whisper_init_state: kv cross size =   52.73 MB
whisper_init_state: compute buffer (conv)   =   19.96 MB
whisper_init_state: compute buffer (encode) =  122.04 MB
whisper_init_state: compute buffer (cross)  =    5.86 MB
whisper_init_state: compute buffer (decode) =   36.17 MB

system_info: n_threads = 4 / 20 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | COREML = 0 | OPENVINO = 0 |

run: processing 'C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\Easy Vietnamese 1 - Whats typical Vietnamese [TubeRipper.com].wav' (960000 samples, 60.0 sec), 4 threads, 1 processors, lang = vi, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:01.720]   Chị có từ hào là người Việt Nam không?
[00:00:01.720 --> 00:00:03.620]   Rất là từ hào là người Việt Nam
[00:00:03.620 --> 00:00:05.120]   Chị hãy nói về những đặc trưng
[00:00:05.120 --> 00:00:09.360]   Người Việt Nam thì phải nói đến tạo dài và nói đến chiếc nông lá
[00:00:09.360 --> 00:00:12.520]   Và hơn nữa phẩm chất hãy nói về sự duyên gián
[00:00:12.520 --> 00:00:14.860]   Công dung nguồn hạnh vân vân rất là nhiều
[00:00:14.860 --> 00:00:17.460]   Vậy Việt Nam về du lịch thì thế nhớ nào có chị hả?
[00:00:17.460 --> 00:00:21.360]   Về du lịch ở Việt Nam thì khá phổ biến
[00:00:21.360 --> 00:00:27.160]   Thường thì người khách du lịch từ nước ngoài đến đây thì có thể đi du lịch ở miền Tây
[00:00:27.160 --> 00:00:32.360]   Có thể đến những tỉnh như là Bến Che, Long An, Cần Thơ, Bạc Liêu, Vân Vân
[00:00:32.360 --> 00:00:36.960]   Còn nếu miền Đông Nam Bộ thì có thể đi Tây Ninh, Củ Chi
[00:00:36.960 --> 00:00:40.280]   Có thể đi chiến qua Hà Nội
[00:00:40.280 --> 00:00:44.020]   Nếu mà đi ra khu vực miền Bắc miền Trung Hịt đi Hà Nội
[00:00:44.020 --> 00:00:47.020]   Mình thích ở Việt Nam đó là về con người Việt Nam
[00:00:47.020 --> 00:00:50.560]   Tâm hồi Việt Nam là thân thiện và họ hiếu khách nữa
[00:00:50.560 --> 00:00:55.880]   Ở Việt Nam thì mình thấy rất là nhiều nơi rất là đẹp và cũng có nhiều món ăn ngon nữa
[00:00:55.880 --> 00:01:00.000]   Món ăn thì nếu chúng đứng mình thì đã có một vài món ăn ở trên kênh đây


whisper_print_timings:     load time =   249.39 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    39.15 ms
whisper_print_timings:   sample time =   199.92 ms /   356 runs (    0.56 ms per run)
whisper_print_timings:   encode time =  5546.89 ms /     3 runs ( 1848.96 ms per run)
whisper_print_timings:   decode time =  3320.92 ms /   353 runs (    9.41 ms per run)
whisper_print_timings:   prompt time =   494.16 ms /     3 runs (  164.72 ms per run)
whisper_print_timings:    total time =  9910.90 ms
image

@bobqianic bobqianic marked this pull request as ready for review September 21, 2023 12:27
@bobqianic
Copy link
Collaborator Author

I'll provide a description of how this works tomorrow and update the test result. I was so busy last week because I was traveling.

examples/main/main.cpp Outdated Show resolved Hide resolved
examples/main/main.cpp Outdated Show resolved Hide resolved
Comment on lines 457 to 465
std::ofstream open(const std::string & path) {
#if WIN32
std::ofstream file_out(ConvertUTF8toUTF16(path));
#else
std::ofstream file_out(path);
#endif
return file_out;
}

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid this change and simply call: std::ofstream fout(ConvertUTF8toUTF16(fname)); regardless if WIN32?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid this change and simply call: std::ofstream fout(ConvertUTF8toUTF16(fname)); regardless if WIN32?

AFAIK, Linux might have some issues with UTF-16LE support. It either doesn't play well with it, or if it does, it's kinda glitchy. So, I was thinking we could have two versions of this ConvertUTF8toUTF16 function: one that genuinely converts and another basic one that doesn't really do anything. If we use those #if and #endif tags, we can have the do-nothing version for non-Windows stuff like Linux. But that's kinda misleading, right? I mean, the function's name literally says it's converting. We've gotta rethink this. I guess we're gonna have to throw in some platform-specific code using #if and #endif. It's a bit of a pain, but if we want this to work across different platforms, we might not have much choice.

@linmi
Copy link

linmi commented Oct 14, 2023

@bobqianic Thanks to the hard work, when I tested ml=1, some of the Chinese language could not be output.
2023-10-14 at 17 55 00@2x
Originally:
2023-10-14 at 17 58 02@2x

@bobqianic bobqianic added the need feedback Testing and feedback with results are needed label Oct 17, 2023
@bobqianic bobqianic linked an issue Oct 20, 2023 that may be closed by this pull request
@bobqianic
Copy link
Collaborator Author

When using long sentences, swallowing words appear.

This issue has now been fixed. I forgot to flush the buffer at the end of the segment.

And this is an English comma , not Chinese

This is due to the limitations of whisper models. there's hardly anything we can do about it.

English is not output according to words.

I need more time to look into this problem.

@linmi
Copy link

linmi commented Oct 22, 2023

@bobqianic Still not normal output. 破产 and 小组.
2023-10-22 at 21 40 06@2x

@bobqianic
Copy link
Collaborator Author

Still not normal output. 破产 and 小组.

Strange. Could you send over the audio?
I tested https://thatisbiz.fireside.fm/107, and it looks good.

@linmi
Copy link

linmi commented Oct 22, 2023

Still not normal output. 破产 and 小组.

Strange. Could you send over the audio?奇怪的。可以把音频发过来吗? I tested https://thatisbiz.fireside.fm/107, and it looks good.我测试了 https://thatisbiz.fireside.fm/107,看起来不错。

十五年中国商业极简史.mp3.zip

./main -m models/ggml-medium.bin -l auto -f

@bobqianic
Copy link
Collaborator Author

bobqianic commented Oct 22, 2023

Still not normal output. 破产 and 小组.

Strange. Could you send over the audio?奇怪的。可以把音频发过来吗? I tested https://thatisbiz.fireside.fm/107, and it looks good.我测试了 https://thatisbiz.fireside.fm/107,看起来不错。

十五年中国商业极简史.mp3.zip

./main -m models/ggml-medium.bin -l auto -f

image image

@@ -272,6 +273,51 @@ void whisper_print_progress_callback(struct whisper_context * /*ctx*/, struct wh
}
}

whisper_merged_tokens whisper_merge_tokens(struct whisper_context * ctx, const whisper_params & params, int s0, int n_segments) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, AFAIK such "post-processing" should not be necessary. If it makes a difference now, it most likely means that we have a bug in the tokenizer, which is actually very likely.

Will need to review this later in more details after #1422

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just discovered some interesting code in OpenAI Whisper.
https://github.com/openai/whisper/blob/0efe52add20d980c850b0bac972adefd54d4eb8e/whisper/tokenizer.py#L277-L327

tokenizer.py

    def split_to_word_tokens(self, tokens: List[int]):
        if self.language in {"zh", "ja", "th", "lo", "my", "yue"}:
            # These languages don't typically use spaces, so it is difficult to split words
            # without morpheme analysis. Here, we instead split words at any
            # position where the tokens are decoded as valid unicode points
            return self.split_tokens_on_unicode(tokens)

        return self.split_tokens_on_spaces(tokens)

    def split_tokens_on_unicode(self, tokens: List[int]):
        decoded_full = self.decode_with_timestamps(tokens)
        replacement_char = "\ufffd"

        words = []
        word_tokens = []
        current_tokens = []
        unicode_offset = 0

        for token in tokens:
            current_tokens.append(token)
            decoded = self.decode_with_timestamps(current_tokens)

            if (
                replacement_char not in decoded
                or decoded_full[unicode_offset + decoded.index(replacement_char)]
                == replacement_char
            ):
                words.append(decoded)
                word_tokens.append(current_tokens)
                current_tokens = []
                unicode_offset += len(decoded)

        return words, word_tokens

    def split_tokens_on_spaces(self, tokens: List[int]):
        subwords, subword_tokens_list = self.split_tokens_on_unicode(tokens)
        words = []
        word_tokens = []

        for subword, subword_tokens in zip(subwords, subword_tokens_list):
            special = subword_tokens[0] >= self.eot
            with_space = subword.startswith(" ")
            punctuation = subword.strip() in string.punctuation
            if special or with_space or punctuation or len(words) == 0:
                words.append(subword)
                word_tokens.append(subword_tokens)
            else:
                words[-1] = words[-1] + subword
                word_tokens[-1].extend(subword_tokens)

        return words, word_tokens

@morukutsu
Copy link

Regarding the tokenization, I just started today to get familiar with whisper.cpp and Japanese audio files, especially with per token timestamps. While I don't have a fix, I can provide a test case:

Audio file: test_1.wav.zip
Ground truth: 5月1日のメーデーに、パリでは毎年働く人たちがデモを行なっています
Whisper: 5月1日のメイデーに、パリでは毎年働く人たちがデモを行っています。
./main -m models/ggml-large.bin -f samples/test_1.wav -l japanese
=> Almost perfect result

With -ml 1, some tokens output as "�", as in the log below:

Details

[00:00:00.000 --> 00:00:00.260] 5
[00:00:00.260 --> 00:00:00.380] 月
[00:00:00.380 --> 00:00:00.760] 1
[00:00:00.760 --> 00:00:00.760] 日
[00:00:00.760 --> 00:00:00.950] の
[00:00:00.950 --> 00:00:01.140] メ
[00:00:01.140 --> 00:00:01.330] イ
[00:00:01.330 --> 00:00:01.520] デ
[00:00:01.520 --> 00:00:01.710] ー
[00:00:01.710 --> 00:00:02.140] に
[00:00:02.140 --> 00:00:02.300] 、
[00:00:02.300 --> 00:00:02.320] パ
[00:00:02.320 --> 00:00:02.460] リ
[00:00:02.460 --> 00:00:02.790] では
[00:00:02.790 --> 00:00:02.900] �
[00:00:02.900 --> 00:00:02.950] �
[00:00:02.950 --> 00:00:03.110] 年
[00:00:03.110 --> 00:00:03.230] �
[00:00:03.230 --> 00:00:03.270] �
[00:00:03.270 --> 00:00:03.430] く
[00:00:03.430 --> 00:00:03.590] 人
[00:00:03.590 --> 00:00:03.770] た
[00:00:03.770 --> 00:00:03.950] ち
[00:00:03.950 --> 00:00:04.070] が
[00:00:04.070 --> 00:00:04.230] デ
[00:00:04.230 --> 00:00:04.390] モ
[00:00:04.390 --> 00:00:04.550] を
[00:00:04.550 --> 00:00:04.710] 行
[00:00:04.710 --> 00:00:05.040] って
[00:00:05.040 --> 00:00:05.700] います
[00:00:05.700 --> 00:00:05.700] 。

By debug printing tokens in the code

# printing tokens in whisper_wrap_segment()
DBGDBG 15: � token id: 7256
DBGDBG 16: � token id: 236
DBGDBG 17: 年 token id: 5157
DBGDBG 18: � token id: 9502
DBGDBG 19: � token id: 235

# by printing tokens in whisper_model_load() we can see that:
token 7256 is: 0xE6AF
token 236 is: 0x8E
=> the utf8 sequence of 毎

token 9502 is: 0xE583
token 235 is: 0x8D
=> the utf8 sequence of 働

So utf-8 sequences are cut in the middle of characters. This can be solved by the post processing step mentioned above.
But I am wondering: Is it something that can be fixed during model loading/conversion? Or was the model also trained with some tokens that are incomplete utf-8 sequences?

This test was done with the latest large model and code.
Thank you!

@magicse
Copy link

magicse commented Nov 26, 2023

Use _WIN32 instead of WIN32 when possible.
GCC (MINGW) does not define WIN32 on Windows when it compiles based on a C++ standard, such as -std=c++11 or other year.

@bobqianic
Copy link
Collaborator Author

So utf-8 sequences are cut in the middle of characters. This can be solved by the post processing step mentioned above.
But I am wondering: Is it something that can be fixed during model loading/conversion? Or was the model also trained with some tokens that are incomplete utf-8 sequences?

I recently did some research and found out why whisper.cpp behaves oddly. It skips the final step where token_ids are decoded by the tokenizer. Instead, it just maps token_ids directly to their token text. OpenAI's Whisper uses a tokenizer called tiktoken, which merges internally using regex pattern string and mergeable ranks. Tiktoken isn't too big, the whole repo has around 3000 lines of code, so I believe creating a full cpp implementation of it is feasible. @ggerganov @morukutsu

whisper/decoding.py Line 757

        texts: List[str] = [tokenizer.decode(t).strip() for t in tokens]

whisper/tokenizer.py Line 164-166

    def decode(self, token_ids: List[int], **kwargs) -> str:
        token_ids = [t for t in token_ids if t < self.timestamp_begin]
        return self.encoding.decode(token_ids, **kwargs)

whisper/tokenizer.py Line 330-363

@lru_cache(maxsize=None)
def get_encoding(name: str = "gpt2", num_languages: int = 99):
    vocab_path = os.path.join(os.path.dirname(__file__), "assets", f"{name}.tiktoken")
    ranks = {
        base64.b64decode(token): int(rank)
        for token, rank in (line.split() for line in open(vocab_path) if line)
    }
    n_vocab = len(ranks)
    special_tokens = {}

    specials = [
        "<|endoftext|>",
        "<|startoftranscript|>",
        *[f"<|{lang}|>" for lang in list(LANGUAGES.keys())[:num_languages]],
        "<|translate|>",
        "<|transcribe|>",
        "<|startoflm|>",
        "<|startofprev|>",
        "<|nospeech|>",
        "<|notimestamps|>",
        *[f"<|{i * 0.02:.2f}|>" for i in range(1501)],
    ]

    for token in specials:
        special_tokens[token] = n_vocab
        n_vocab += 1

    return tiktoken.Encoding(
        name=os.path.basename(vocab_path),
        explicit_n_vocab=n_vocab,
        pat_str=r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""",
        mergeable_ranks=ranks,
        special_tokens=special_tokens,
    )

tiktoken/core.py Line 12-50

class Encoding:
    def __init__(
        self,
        name: str,
        *,
        pat_str: str,
        mergeable_ranks: dict[bytes, int],
        special_tokens: dict[str, int],
        explicit_n_vocab: Optional[int] = None,
    ):
        """Creates an Encoding object.

        See openai_public.py for examples of how to construct an Encoding object.

        Args:
            name: The name of the encoding. It should be clear from the name of the encoding
                what behaviour to expect, in particular, encodings with different special tokens
                should have different names.
            pat_str: A regex pattern string that is used to split the input text.
            mergeable_ranks: A dictionary mapping mergeable token bytes to their ranks. The ranks
                must correspond to merge priority.
            special_tokens: A dictionary mapping special token strings to their token values.
            explicit_n_vocab: The number of tokens in the vocabulary. If provided, it is checked
                that the number of mergeable tokens and special tokens is equal to this number.
        """
        self.name = name

        self._pat_str = pat_str
        self._mergeable_ranks = mergeable_ranks
        self._special_tokens = special_tokens

        self.max_token_value = max(
            max(mergeable_ranks.values()), max(special_tokens.values(), default=0)
        )
        if explicit_n_vocab:
            assert len(mergeable_ranks) + len(special_tokens) == explicit_n_vocab
            assert self.max_token_value == explicit_n_vocab - 1

        self._core_bpe = _tiktoken.CoreBPE(mergeable_ranks, special_tokens, pat_str)

@ggerganov
Copy link
Owner

Probably we can utilize the BPE tokenizer implementation from llama.cpp

@morukutsu
Copy link

@bobqianic Thank you for the research! So I had a look at the tiktoken documentation and it answered my original question. The tokenizer works byte by byte so it can lead to a representation where tokens are incomplete utf-8 sequences. See here, section 4, "Warning: although .decode() can be applied to single tokens, beware that it can be lossy for tokens that aren't on utf-8 boundaries.".

So as long as text is encoded and decoded with the same tokenizer, it should work correctly.
The current tokenize function seems to be a reimplementation of this one.

In my current understanding:

If we decode the tokens as a byte sequence (for each token, join the byte representation of each) and then interpret it as a utf-8 string then it works OK.

But since the model is trained on abstract byte sequences, the -ml 1 will sometimes lead to incomplete utf-8 sequences when trying to convert to a string token by token. Because it exposes the tokens the model has been trained with, in that specific use case, a post processing step is required to merge timestamps of contiguous "invalid" tokens. If that makes sense...
I don't think the tokenizer will have an impact on this problem? The issue doesn't show up in English (or other languages, as long as a character is not larger than a byte).

Please correct me if I'm wrong, that just my analysis after re-thinking about the problem with the info you added!

@bobqianic bobqianic mentioned this pull request Jan 13, 2024
@bobqianic
Copy link
Collaborator Author

But since the model is trained on abstract byte sequences, the -ml 1 will sometimes lead to incomplete utf-8 sequences when trying to convert to a string token by token. Because it exposes the tokens the model has been trained with, in that specific use case, a post processing step is required to merge timestamps of contiguous "invalid" tokens.

Yes, this is correct.

I don't think the tokenizer will have an impact on this problem?

I'm not entirely certain, but it seems there's an ongoing issue with whisper.cpp. When you look at the decoding outputs at the token level, you'll notice that whisper.cpp often requires two tokens to represent the same text, whereas OpenAI's Whisper only needs one token for the same text (for example, [47911] versus [35414, 9497]). Also, there's a problem with the timestamp tokens. Ideally, each segment should have both a start and an end timestamp token, but in whisper.cpp, this isn't consistently happening.

@bobqianic
Copy link
Collaborator Author

I'm going to close this PR and move everything to #1768

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need feedback Testing and feedback with results are needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants