examples: Fix the encoding issues on Windows #1313

bobqianic · 2023-09-20T14:27:16Z

The problem with Windows is that it has a heavy historical burden. Many narrow string (char) APIs do not support Unicode, and only wide character (wchar_t) APIs support Unicode. Using narrow string (char) APIs might result in garbled text for some languages. This PR addresses this issue.

In this PR, we've enabled the Windows terminal to accept wchar_t arrays encoded in UTF-16LE. Before processing, we convert them to char arrays encoded in UTF-8. By using SetConsoleOutputCP, we've adjusted the Windows terminal's output encoding to UTF-8, ensuring that multiple languages can be correctly displayed in the Windows terminal. Additionally, based on the documentation provided by Microsoft, we've enabled Virtual Terminal Processing in the Windows terminal, allowing text colors to be displayed correctly. If you have a better solution, please feel free to make suggestions.

~~The current issue: print color causes garbled text in non-alphabetic languages.~~

#399 #554 #1151

multi-language filenames
multi-language prompt
multi-language voice to text
printing color

Test Results

English:

C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release>C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\main.exe -m C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\ggml-model-whisper-small.bin -f C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\micro-machine.wav -pc
whisper_init_from_file_no_state: loading model from 'C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\ggml-model-whisper-small.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 3
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     =  464.68 MB
whisper_model_load: model size    =  464.44 MB
whisper_init_state: kv self size  =   15.75 MB
whisper_init_state: kv cross size =   52.73 MB
whisper_init_state: compute buffer (conv)   =   19.96 MB
whisper_init_state: compute buffer (encode) =  122.04 MB
whisper_init_state: compute buffer (cross)  =    5.86 MB
whisper_init_state: compute buffer (decode) =   36.17 MB

system_info: n_threads = 4 / 20 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | COREML = 0 | OPENVINO = 0 |

init: processing 'C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\micro-machine.wav' (478214 samples, 29.9 sec), 4 threads, 1 processors, lang = en, task = false, timestamps = 1 ...


[00:00:00.000 --> 00:00:03.080]   This is the Micro Machine Man presenting the most midget miniature motorcade of Micro Machine.
[00:00:03.080 --> 00:00:06.520]   Each one has dramatic details, terrific trims, precision paint jobs, plus incredible Micro Machine Pocket Playsets.
[00:00:06.520 --> 00:00:08.640]   There's a police station, fire station, restaurant, service station, and more.
[00:00:08.640 --> 00:00:10.160]   Perfect pocket portables to take any place.
[00:00:10.160 --> 00:00:12.640]   And there are many miniature play sets to play with, and each one comes with its own special edition.
[00:00:12.640 --> 00:00:15.120]   Micro Machine Vehicle and fun, fantastic features that miraculously move.
[00:00:15.120 --> 00:00:17.440]   Raise the boat lift at the airport, Marina Man, the gun turret at the Army Base,
[00:00:17.440 --> 00:00:19.080]   clean your car at the car wash, raise the toll bridge.
[00:00:19.080 --> 00:00:21.040]   And these play sets fit together to form a Micro Machine World.
[00:00:21.040 --> 00:00:24.040]   Micro Machine Pocket Playsets are tremendously tiny, so perfectly precise, so dazzlingly detailed.
[00:00:24.040 --> 00:00:25.160]   You all want to pocket them all.
[00:00:25.160 --> 00:00:27.600]   Micro Machines and Micro Machine Pocket Playsets sold separately from Golube.
[00:00:27.600 --> 00:00:29.480]   The smaller they are, the better they are.


whisper_print_timings:     load time =   248.23 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    21.02 ms
whisper_print_timings:   sample time =   184.69 ms /   280 runs (    0.66 ms per run)
whisper_print_timings:   encode time =  3915.51 ms /     2 runs ( 1957.75 ms per run)
whisper_print_timings:   decode time =  2681.14 ms /   275 runs (    9.75 ms per run)
whisper_print_timings:   prompt time =    52.34 ms /     4 runs (   13.09 ms per run)
whisper_print_timings:    total time =  7263.82 ms

Chinese:

C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release>C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\main.exe -m C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\ggml-model-whisper-small.bin -f C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\01-03（轻松学中文+第二版+课本2）.wav -l zh
whisper_init_from_file_no_state: loading model from 'C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\ggml-model-whisper-small.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 3
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     =  464.68 MB
whisper_model_load: model size    =  464.44 MB
whisper_init_state: kv self size  =   15.75 MB
whisper_init_state: kv cross size =   52.73 MB
whisper_init_state: compute buffer (conv)   =   19.96 MB
whisper_init_state: compute buffer (encode) =  122.04 MB
whisper_init_state: compute buffer (cross)  =    5.86 MB
whisper_init_state: compute buffer (decode) =   36.17 MB

system_info: n_threads = 4 / 20 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | COREML = 0 | OPENVINO = 0 |

init: processing 'C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\01-03（轻松学中文+第二版+课本2）.wav' (713721 samples, 44.6 sec), 4 threads, 1 processors, lang = zh, task = false, timestamps = 1 ...


[00:00:00.000 --> 00:00:05.500]  你爷爷、奶奶住在哪儿?
[00:00:05.500 --> 00:00:08.200]  他们住在南京。
[00:00:08.200 --> 00:00:13.200]  他们跟我叔叔和沈人一起住。
[00:00:13.200 --> 00:00:17.700]  你叔叔是哪一年结婚的?
[00:00:17.700 --> 00:00:20.600]  他是去年结婚的。
[00:00:20.600 --> 00:00:24.400]  你叔叔做什么工作?
[00:00:24.400 --> 00:00:26.300]  在哪儿工作?
[00:00:26.300 --> 00:00:28.800]  我叔叔是老师。
[00:00:28.800 --> 00:00:32.800]  他在同名中学工作。
[00:00:32.800 --> 00:00:38.300]  你常常跟你爸爸家的亲戚见面吗?
[00:00:38.300 --> 00:00:41.800]  我常常跟他们见面。
[00:00:41.800 --> 00:00:42.800]  我赶紧讲讲
[00:00:42.800 --> 00:00:44.920]  不许你把东西收回去


whisper_print_timings:     load time =   244.23 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    29.08 ms
whisper_print_timings:   sample time =   140.36 ms /   195 runs (    0.72 ms per run)
whisper_print_timings:   encode time =  5813.29 ms /     3 runs ( 1937.76 ms per run)
whisper_print_timings:   decode time =  1766.77 ms /   188 runs (    9.40 ms per run)
whisper_print_timings:   prompt time =   340.00 ms /     5 runs (   68.00 ms per run)
whisper_print_timings:    total time =  8417.58 ms

Japanese:

C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release>C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\main.exe -m C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\ggml-model-whisper-small.bin -f "C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\日本語での自然な自己紹介The Best Way to Introduce Yourself in Japanese [TubeRipper.com].wav" -l ja
whisper_init_from_file_no_state: loading model from 'C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\ggml-model-whisper-small.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 3
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     =  464.68 MB
whisper_model_load: model size    =  464.44 MB
whisper_init_state: kv self size  =   15.75 MB
whisper_init_state: kv cross size =   52.73 MB
whisper_init_state: compute buffer (conv)   =   19.96 MB
whisper_init_state: compute buffer (encode) =  122.04 MB
whisper_init_state: compute buffer (cross)  =    5.86 MB
whisper_init_state: compute buffer (decode) =   36.17 MB

system_info: n_threads = 4 / 20 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | COREML = 0 | OPENVINO = 0 |

init: processing 'C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\日本語での自然な自己紹介The Best Way to Introduce Yourself in Japanese [TubeRipper.com].wav' (960000 samples, 60.0 sec), 4 threads, 1 processors, lang = ja, task = false, timestamps = 1 ...


[00:00:00.000 --> 00:00:04.600]  こんにちは、はじめまして、私はアッキーです。
[00:00:04.600 --> 00:00:08.800]  日本から来ました。どうぞよろしく。
[00:00:08.800 --> 00:00:11.000]  3本塾
[00:00:11.000 --> 00:00:14.160]  チャス!3本塾のアッキーです。
[00:00:14.160 --> 00:00:19.080]  今回は自己紹介についてお話します。
[00:00:19.080 --> 00:00:26.600]  自己紹介は知らない人と初めて会った時にするとても大事な表現です。
[00:00:26.600 --> 00:00:36.000]  日本語を勉強する時も初級のはじめの方で勉強をするとても基本的な表現です。
[00:00:36.000 --> 00:00:44.520]  でも皆さん、自己紹介しっかりと自然に上手にできていますか?
[00:00:44.520 --> 00:00:51.280]  今回は上手な、自然な自己紹介を覚えていきましょう。
[00:00:51.280 --> 00:00:57.760]  まず、最初に日本語の初級の教科書の自己紹介はこんな感じ。
[00:00:57.760 --> 00:00:59.960]  こんにちは、はじめまして。


whisper_print_timings:     load time =   242.21 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    40.68 ms
whisper_print_timings:   sample time =   119.10 ms /   211 runs (    0.56 ms per run)
whisper_print_timings:   encode time =  5848.74 ms /     3 runs ( 1949.58 ms per run)
whisper_print_timings:   decode time =  1948.26 ms /   208 runs (    9.37 ms per run)
whisper_print_timings:   prompt time =   849.27 ms /     3 runs (  283.09 ms per run)
whisper_print_timings:    total time =  9134.42 ms

Russian:

C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release>C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\main.exe -m C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\ggml-model-whisper-small.bin -f "C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\Introduce Yourself in Russian Super Easy Russian 28 [TubeRipper.com].wav" -l ru
whisper_init_from_file_no_state: loading model from 'C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\ggml-model-whisper-small.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 3
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     =  464.68 MB
whisper_model_load: model size    =  464.44 MB
whisper_init_state: kv self size  =   15.75 MB
whisper_init_state: kv cross size =   52.73 MB
whisper_init_state: compute buffer (conv)   =   19.96 MB
whisper_init_state: compute buffer (encode) =  122.04 MB
whisper_init_state: compute buffer (cross)  =    5.86 MB
whisper_init_state: compute buffer (decode) =   36.17 MB

system_info: n_threads = 4 / 20 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | COREML = 0 | OPENVINO = 0 |

run: processing 'C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\Introduce Yourself in Russian Super Easy Russian 28 [TubeRipper.com].wav' (960000 samples, 60.0 sec), 4 threads, 1 processors, lang = ru, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:04.600]   Всем привет и добро пожаловать в новый выпуск "Super Easy Russian".
[00:00:04.600 --> 00:00:10.760]   В этот раз мы решили спросить у московских прохожих ответы на самые простые вопросы,
[00:00:10.760 --> 00:00:13.800]   которые обычно задают, чтобы узнать людей получше.
[00:00:13.800 --> 00:00:18.600]   Например, как вас зовут, какой у вас возраст, чем вы занимаетесь.
[00:00:18.600 --> 00:00:25.040]   Это видео был записано нашим любезным и отважным коллегой из команды 14-20.
[00:00:25.040 --> 00:00:26.240]   Давайте посмотрим.
[00:00:26.240 --> 00:00:40.640]   [музыка]
[00:00:40.640 --> 00:00:50.200]   Первый вопрос - как тебя зовут? Или как вас зовут? Или как твое имя? Или как ваше имя?
[00:00:50.200 --> 00:00:58.600]   Можно спросить просто, но не очень вежливо - кто ты или кто вы? Например, я - Никита.
[00:00:58.600 --> 00:00:59.860]   Меня зовут.


whisper_print_timings:     load time =   253.56 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    44.06 ms
whisper_print_timings:   sample time =   125.90 ms /   203 runs (    0.62 ms per run)
whisper_print_timings:   encode time =  8671.46 ms /     4 runs ( 2167.86 ms per run)
whisper_print_timings:   decode time =  1907.80 ms /   195 runs (    9.78 ms per run)
whisper_print_timings:   prompt time =   912.13 ms /     6 runs (  152.02 ms per run)
whisper_print_timings:    total time = 11975.46 ms

French:

C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release>C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\main.exe -m C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\ggml-model-whisper-small.bin -f "C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\Introduce yourself in French Super Easy French 62 [TubeRipper.com].wav" -l fr
whisper_init_from_file_no_state: loading model from 'C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\ggml-model-whisper-small.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 3
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     =  464.68 MB
whisper_model_load: model size    =  464.44 MB
whisper_init_state: kv self size  =   15.75 MB
whisper_init_state: kv cross size =   52.73 MB
whisper_init_state: compute buffer (conv)   =   19.96 MB
whisper_init_state: compute buffer (encode) =  122.04 MB
whisper_init_state: compute buffer (cross)  =    5.86 MB
whisper_init_state: compute buffer (decode) =   36.17 MB

system_info: n_threads = 4 / 20 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | COREML = 0 | OPENVINO = 0 |

run: processing 'C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\Introduce yourself in French Super Easy French 62 [TubeRipper.com].wav' (960000 samples, 60.0 sec), 4 threads, 1 processors, lang = fr, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:03.600]   Bonjour et bienvenue dans ce nouvel épisode de Super Easy French.
[00:00:03.600 --> 00:00:08.440]   Aujourd'hui, vous allez apprendre à vous présenter en français de manière formelle et informal.
[00:00:08.440 --> 00:00:09.440]   Allons-y !
[00:00:09.440 --> 00:00:16.000]   Bonjour.
[00:00:16.000 --> 00:00:17.000]   Bonjour.
[00:00:17.000 --> 00:00:18.400]   Comment vous appelez-vous ?
[00:00:18.400 --> 00:00:20.600]   Je m'appelle Rita et vous ?
[00:00:20.600 --> 00:00:22.000]   Moi, je m'appelle Judith.
[00:00:22.000 --> 00:00:23.600]   Comment tu t'appelles ?
[00:00:23.600 --> 00:00:25.000]   Je m'appelle Sory Carme.
[00:00:25.000 --> 00:00:26.000]   Et toi ?
[00:00:26.000 --> 00:00:28.000]   Je m'appelle Hélène.
[00:00:28.000 --> 00:00:29.200]   D'où venez-vous ?
[00:00:29.200 --> 00:00:31.200]   Je suis française et vous ?
[00:00:31.200 --> 00:00:32.800]   Je suis française également.
[00:00:32.800 --> 00:00:33.800]   Tu viens d'où ?
[00:00:33.800 --> 00:00:35.200]   Je viens de Istanbul.
[00:00:35.200 --> 00:00:36.800]   Et toi, Hélène ?
[00:00:36.800 --> 00:00:39.200]   Moi, je viens de Bordeaux.
[00:00:39.200 --> 00:00:40.600]   Quelle âge avez-vous ?
[00:00:40.600 --> 00:00:41.800]   J'ai 32 ans.
[00:00:41.800 --> 00:00:43.000]   Quelle âge as-tu ?
[00:00:43.000 --> 00:00:44.400]   J'ai 23 ans.
[00:00:44.400 --> 00:00:46.000]   Et toi ?
[00:00:46.000 --> 00:00:47.000]   J'ai 24 ans.
[00:00:47.000 --> 00:00:48.400]   Où habitez-vous ?
[00:00:48.400 --> 00:00:51.200]   J'habite à Saint-Germain-en-Laye près de Paris.
[00:00:51.200 --> 00:00:52.200]   Et vous ?
[00:00:52.200 --> 00:00:55.000]   Moi, j'habite à Paris dans le 14e arrondissement.
[00:00:55.000 --> 00:00:56.400]   Tu habites où ?
[00:00:56.400 --> 00:01:00.000]   j'habite dans le 20e arrondissement à Paris et toi ?


whisper_print_timings:     load time =   339.25 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    38.36 ms
whisper_print_timings:   sample time =   170.33 ms /   310 runs (    0.55 ms per run)
whisper_print_timings:   encode time =  5573.36 ms /     3 runs ( 1857.79 ms per run)
whisper_print_timings:   decode time =  2906.41 ms /   307 runs (    9.47 ms per run)
whisper_print_timings:   prompt time =   373.41 ms /     3 runs (  124.47 ms per run)
whisper_print_timings:    total time =  9468.37 ms

Vietnamese:

C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release>C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\main.exe -m C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\ggml-model-whisper-small.bin -f "C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\Easy Vietnamese 1 - Whats typical Vietnamese [TubeRipper.com].wav" -l vi
whisper_init_from_file_no_state: loading model from 'C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\ggml-model-whisper-small.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 3
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     =  464.68 MB
whisper_model_load: model size    =  464.44 MB
whisper_init_state: kv self size  =   15.75 MB
whisper_init_state: kv cross size =   52.73 MB
whisper_init_state: compute buffer (conv)   =   19.96 MB
whisper_init_state: compute buffer (encode) =  122.04 MB
whisper_init_state: compute buffer (cross)  =    5.86 MB
whisper_init_state: compute buffer (decode) =   36.17 MB

system_info: n_threads = 4 / 20 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | COREML = 0 | OPENVINO = 0 |

run: processing 'C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\Easy Vietnamese 1 - Whats typical Vietnamese [TubeRipper.com].wav' (960000 samples, 60.0 sec), 4 threads, 1 processors, lang = vi, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:01.720]   Chị có từ hào là người Việt Nam không?
[00:00:01.720 --> 00:00:03.620]   Rất là từ hào là người Việt Nam
[00:00:03.620 --> 00:00:05.120]   Chị hãy nói về những đặc trưng
[00:00:05.120 --> 00:00:09.360]   Người Việt Nam thì phải nói đến tạo dài và nói đến chiếc nông lá
[00:00:09.360 --> 00:00:12.520]   Và hơn nữa phẩm chất hãy nói về sự duyên gián
[00:00:12.520 --> 00:00:14.860]   Công dung nguồn hạnh vân vân rất là nhiều
[00:00:14.860 --> 00:00:17.460]   Vậy Việt Nam về du lịch thì thế nhớ nào có chị hả?
[00:00:17.460 --> 00:00:21.360]   Về du lịch ở Việt Nam thì khá phổ biến
[00:00:21.360 --> 00:00:27.160]   Thường thì người khách du lịch từ nước ngoài đến đây thì có thể đi du lịch ở miền Tây
[00:00:27.160 --> 00:00:32.360]   Có thể đến những tỉnh như là Bến Che, Long An, Cần Thơ, Bạc Liêu, Vân Vân
[00:00:32.360 --> 00:00:36.960]   Còn nếu miền Đông Nam Bộ thì có thể đi Tây Ninh, Củ Chi
[00:00:36.960 --> 00:00:40.280]   Có thể đi chiến qua Hà Nội
[00:00:40.280 --> 00:00:44.020]   Nếu mà đi ra khu vực miền Bắc miền Trung Hịt đi Hà Nội
[00:00:44.020 --> 00:00:47.020]   Mình thích ở Việt Nam đó là về con người Việt Nam
[00:00:47.020 --> 00:00:50.560]   Tâm hồi Việt Nam là thân thiện và họ hiếu khách nữa
[00:00:50.560 --> 00:00:55.880]   Ở Việt Nam thì mình thấy rất là nhiều nơi rất là đẹp và cũng có nhiều món ăn ngon nữa
[00:00:55.880 --> 00:01:00.000]   Món ăn thì nếu chúng đứng mình thì đã có một vài món ăn ở trên kênh đây


whisper_print_timings:     load time =   249.39 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    39.15 ms
whisper_print_timings:   sample time =   199.92 ms /   356 runs (    0.56 ms per run)
whisper_print_timings:   encode time =  5546.89 ms /     3 runs ( 1848.96 ms per run)
whisper_print_timings:   decode time =  3320.92 ms /   353 runs (    9.41 ms per run)
whisper_print_timings:   prompt time =   494.16 ms /     3 runs (  164.72 ms per run)
whisper_print_timings:    total time =  9910.90 ms

bobqianic · 2023-10-03T21:06:13Z

I'll provide a description of how this works tomorrow and update the test result. I was so busy last week because I was traveling.

examples/main/main.cpp

ggerganov · 2023-10-04T08:57:14Z

examples/main/main.cpp

+std::ofstream open(const std::string & path) {
+#if WIN32
+    std::ofstream file_out(ConvertUTF8toUTF16(path));
+#else
+    std::ofstream file_out(path);
+#endif
+    return file_out;
+}
+


Can we avoid this change and simply call: std::ofstream fout(ConvertUTF8toUTF16(fname)); regardless if WIN32?

Can we avoid this change and simply call: std::ofstream fout(ConvertUTF8toUTF16(fname)); regardless if WIN32?

AFAIK, Linux might have some issues with UTF-16LE support. It either doesn't play well with it, or if it does, it's kinda glitchy. So, I was thinking we could have two versions of this ConvertUTF8toUTF16 function: one that genuinely converts and another basic one that doesn't really do anything. If we use those #if and #endif tags, we can have the do-nothing version for non-Windows stuff like Linux. But that's kinda misleading, right? I mean, the function's name literally says it's converting. We've gotta rethink this. I guess we're gonna have to throw in some platform-specific code using #if and #endif. It's a bit of a pain, but if we want this to work across different platforms, we might not have much choice.

linmi · 2023-10-14T09:58:51Z

@bobqianic Thanks to the hard work, when I tested ml=1, some of the Chinese language could not be output.

Originally:

bobqianic · 2023-10-22T13:20:35Z

When using long sentences, swallowing words appear.

This issue has now been fixed. I forgot to flush the buffer at the end of the segment.

And this is an English comma , not Chinese ，

This is due to the limitations of whisper models. there's hardly anything we can do about it.

English is not output according to words.

I need more time to look into this problem.

linmi · 2023-10-22T13:42:18Z

@bobqianic Still not normal output. 破产 and 小组.

bobqianic · 2023-10-22T13:52:39Z

Still not normal output. 破产 and 小组.

Strange. Could you send over the audio?
I tested https://thatisbiz.fireside.fm/107, and it looks good.

linmi · 2023-10-22T14:02:26Z

Still not normal output. 破产 and 小组.

Strange. Could you send over the audio?奇怪的。可以把音频发过来吗？ I tested https://thatisbiz.fireside.fm/107, and it looks good.我测试了 https://thatisbiz.fireside.fm/107，看起来不错。

十五年中国商业极简史.mp3.zip

./main -m models/ggml-medium.bin -l auto -f

bobqianic · 2023-10-22T14:15:58Z

Still not normal output. 破产 and 小组.

Strange. Could you send over the audio?奇怪的。可以把音频发过来吗？ I tested https://thatisbiz.fireside.fm/107, and it looks good.我测试了 https://thatisbiz.fireside.fm/107，看起来不错。

十五年中国商业极简史.mp3.zip

./main -m models/ggml-medium.bin -l auto -f

ggerganov · 2023-11-03T08:49:51Z

examples/main/main.cpp

@@ -272,6 +273,51 @@ void whisper_print_progress_callback(struct whisper_context * /*ctx*/, struct wh
    }
 }

+whisper_merged_tokens whisper_merge_tokens(struct whisper_context * ctx, const whisper_params & params, int s0, int n_segments) {


Hm, AFAIK such "post-processing" should not be necessary. If it makes a difference now, it most likely means that we have a bug in the tokenizer, which is actually very likely.

Will need to review this later in more details after #1422

I just discovered some interesting code in OpenAI Whisper.
https://github.com/openai/whisper/blob/0efe52add20d980c850b0bac972adefd54d4eb8e/whisper/tokenizer.py#L277-L327

tokenizer.py

def split_to_word_tokens(self, tokens: List[int]): if self.language in {"zh", "ja", "th", "lo", "my", "yue"}: # These languages don't typically use spaces, so it is difficult to split words # without morpheme analysis. Here, we instead split words at any # position where the tokens are decoded as valid unicode points return self.split_tokens_on_unicode(tokens) return self.split_tokens_on_spaces(tokens) def split_tokens_on_unicode(self, tokens: List[int]): decoded_full = self.decode_with_timestamps(tokens) replacement_char = "\ufffd" words = [] word_tokens = [] current_tokens = [] unicode_offset = 0 for token in tokens: current_tokens.append(token) decoded = self.decode_with_timestamps(current_tokens) if ( replacement_char not in decoded or decoded_full[unicode_offset + decoded.index(replacement_char)] == replacement_char ): words.append(decoded) word_tokens.append(current_tokens) current_tokens = [] unicode_offset += len(decoded) return words, word_tokens def split_tokens_on_spaces(self, tokens: List[int]): subwords, subword_tokens_list = self.split_tokens_on_unicode(tokens) words = [] word_tokens = [] for subword, subword_tokens in zip(subwords, subword_tokens_list): special = subword_tokens[0] >= self.eot with_space = subword.startswith(" ") punctuation = subword.strip() in string.punctuation if special or with_space or punctuation or len(words) == 0: words.append(subword) word_tokens.append(subword_tokens) else: words[-1] = words[-1] + subword word_tokens[-1].extend(subword_tokens) return words, word_tokens

morukutsu · 2023-11-07T17:56:11Z

Regarding the tokenization, I just started today to get familiar with whisper.cpp and Japanese audio files, especially with per token timestamps. While I don't have a fix, I can provide a test case:

Audio file: test_1.wav.zip
Ground truth: ５月１日のメーデーに、パリでは毎年働く人たちがデモを行なっています
Whisper: 5月1日のメイデーに、パリでは毎年働く人たちがデモを行っています。
./main -m models/ggml-large.bin -f samples/test_1.wav -l japanese
=> Almost perfect result

With -ml 1, some tokens output as "�", as in the log below:

Details

[00:00:00.000 --> 00:00:00.260] 5
[00:00:00.260 --> 00:00:00.380] 月
[00:00:00.380 --> 00:00:00.760] 1
[00:00:00.760 --> 00:00:00.760] 日
[00:00:00.760 --> 00:00:00.950] の
[00:00:00.950 --> 00:00:01.140] メ
[00:00:01.140 --> 00:00:01.330] イ
[00:00:01.330 --> 00:00:01.520] デ
[00:00:01.520 --> 00:00:01.710] ー
[00:00:01.710 --> 00:00:02.140] に
[00:00:02.140 --> 00:00:02.300] 、
[00:00:02.300 --> 00:00:02.320] パ
[00:00:02.320 --> 00:00:02.460] リ
[00:00:02.460 --> 00:00:02.790] では
[00:00:02.790 --> 00:00:02.900] �
[00:00:02.900 --> 00:00:02.950] �
[00:00:02.950 --> 00:00:03.110] 年
[00:00:03.110 --> 00:00:03.230] �
[00:00:03.230 --> 00:00:03.270] �
[00:00:03.270 --> 00:00:03.430] く
[00:00:03.430 --> 00:00:03.590] 人
[00:00:03.590 --> 00:00:03.770] た
[00:00:03.770 --> 00:00:03.950] ち
[00:00:03.950 --> 00:00:04.070] が
[00:00:04.070 --> 00:00:04.230] デ
[00:00:04.230 --> 00:00:04.390] モ
[00:00:04.390 --> 00:00:04.550] を
[00:00:04.550 --> 00:00:04.710] 行
[00:00:04.710 --> 00:00:05.040] って
[00:00:05.040 --> 00:00:05.700] います
[00:00:05.700 --> 00:00:05.700] 。

By debug printing tokens in the code

# printing tokens in whisper_wrap_segment()
DBGDBG 15: � token id: 7256
DBGDBG 16: � token id: 236
DBGDBG 17: 年 token id: 5157
DBGDBG 18: � token id: 9502
DBGDBG 19: � token id: 235

# by printing tokens in whisper_model_load() we can see that:
token 7256 is: 0xE6AF
token 236 is: 0x8E
=> the utf8 sequence of 毎

token 9502 is: 0xE583
token 235 is: 0x8D
=> the utf8 sequence of 働

So utf-8 sequences are cut in the middle of characters. This can be solved by the post processing step mentioned above.
But I am wondering: Is it something that can be fixed during model loading/conversion? Or was the model also trained with some tokens that are incomplete utf-8 sequences?

This test was done with the latest large model and code.
Thank you!

magicse · 2023-11-26T02:24:51Z

Use _WIN32 instead of WIN32 when possible.
GCC (MINGW) does not define WIN32 on Windows when it compiles based on a C++ standard, such as -std=c++11 or other year.

bobqianic · 2024-01-01T15:55:01Z

So utf-8 sequences are cut in the middle of characters. This can be solved by the post processing step mentioned above.
But I am wondering: Is it something that can be fixed during model loading/conversion? Or was the model also trained with some tokens that are incomplete utf-8 sequences?

I recently did some research and found out why whisper.cpp behaves oddly. It skips the final step where token_ids are decoded by the tokenizer. Instead, it just maps token_ids directly to their token text. OpenAI's Whisper uses a tokenizer called tiktoken, which merges internally using regex pattern string and mergeable ranks. Tiktoken isn't too big, the whole repo has around 3000 lines of code, so I believe creating a full cpp implementation of it is feasible. @ggerganov @morukutsu

whisper/decoding.py Line 757

        texts: List[str] = [tokenizer.decode(t).strip() for t in tokens]

whisper/tokenizer.py Line 164-166

    def decode(self, token_ids: List[int], **kwargs) -> str:
        token_ids = [t for t in token_ids if t < self.timestamp_begin]
        return self.encoding.decode(token_ids, **kwargs)

whisper/tokenizer.py Line 330-363

@lru_cache(maxsize=None)
def get_encoding(name: str = "gpt2", num_languages: int = 99):
    vocab_path = os.path.join(os.path.dirname(__file__), "assets", f"{name}.tiktoken")
    ranks = {
        base64.b64decode(token): int(rank)
        for token, rank in (line.split() for line in open(vocab_path) if line)
    }
    n_vocab = len(ranks)
    special_tokens = {}

    specials = [
        "<|endoftext|>",
        "<|startoftranscript|>",
        *[f"<|{lang}|>" for lang in list(LANGUAGES.keys())[:num_languages]],
        "<|translate|>",
        "<|transcribe|>",
        "<|startoflm|>",
        "<|startofprev|>",
        "<|nospeech|>",
        "<|notimestamps|>",
        *[f"<|{i * 0.02:.2f}|>" for i in range(1501)],
    ]

    for token in specials:
        special_tokens[token] = n_vocab
        n_vocab += 1

    return tiktoken.Encoding(
        name=os.path.basename(vocab_path),
        explicit_n_vocab=n_vocab,
        pat_str=r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""",
        mergeable_ranks=ranks,
        special_tokens=special_tokens,
    )

tiktoken/core.py Line 12-50

class Encoding:
    def __init__(
        self,
        name: str,
        *,
        pat_str: str,
        mergeable_ranks: dict[bytes, int],
        special_tokens: dict[str, int],
        explicit_n_vocab: Optional[int] = None,
    ):
        """Creates an Encoding object.

        See openai_public.py for examples of how to construct an Encoding object.

        Args:
            name: The name of the encoding. It should be clear from the name of the encoding
                what behaviour to expect, in particular, encodings with different special tokens
                should have different names.
            pat_str: A regex pattern string that is used to split the input text.
            mergeable_ranks: A dictionary mapping mergeable token bytes to their ranks. The ranks
                must correspond to merge priority.
            special_tokens: A dictionary mapping special token strings to their token values.
            explicit_n_vocab: The number of tokens in the vocabulary. If provided, it is checked
                that the number of mergeable tokens and special tokens is equal to this number.
        """
        self.name = name

        self._pat_str = pat_str
        self._mergeable_ranks = mergeable_ranks
        self._special_tokens = special_tokens

        self.max_token_value = max(
            max(mergeable_ranks.values()), max(special_tokens.values(), default=0)
        )
        if explicit_n_vocab:
            assert len(mergeable_ranks) + len(special_tokens) == explicit_n_vocab
            assert self.max_token_value == explicit_n_vocab - 1

        self._core_bpe = _tiktoken.CoreBPE(mergeable_ranks, special_tokens, pat_str)

ggerganov · 2024-01-03T12:01:24Z

Probably we can utilize the BPE tokenizer implementation from llama.cpp

morukutsu · 2024-01-03T19:04:48Z

@bobqianic Thank you for the research! So I had a look at the tiktoken documentation and it answered my original question. The tokenizer works byte by byte so it can lead to a representation where tokens are incomplete utf-8 sequences. See here, section 4, "Warning: although .decode() can be applied to single tokens, beware that it can be lossy for tokens that aren't on utf-8 boundaries.".

So as long as text is encoded and decoded with the same tokenizer, it should work correctly.
The current tokenize function seems to be a reimplementation of this one.

In my current understanding:

If we decode the tokens as a byte sequence (for each token, join the byte representation of each) and then interpret it as a utf-8 string then it works OK.

But since the model is trained on abstract byte sequences, the -ml 1 will sometimes lead to incomplete utf-8 sequences when trying to convert to a string token by token. Because it exposes the tokens the model has been trained with, in that specific use case, a post processing step is required to merge timestamps of contiguous "invalid" tokens. If that makes sense...
I don't think the tokenizer will have an impact on this problem? The issue doesn't show up in English (or other languages, as long as a character is not larger than a byte).

Please correct me if I'm wrong, that just my analysis after re-thinking about the problem with the info you added!

bobqianic · 2024-01-14T15:29:03Z

But since the model is trained on abstract byte sequences, the -ml 1 will sometimes lead to incomplete utf-8 sequences when trying to convert to a string token by token. Because it exposes the tokens the model has been trained with, in that specific use case, a post processing step is required to merge timestamps of contiguous "invalid" tokens.

Yes, this is correct.

I don't think the tokenizer will have an impact on this problem?

I'm not entirely certain, but it seems there's an ongoing issue with whisper.cpp. When you look at the decoding outputs at the token level, you'll notice that whisper.cpp often requires two tokens to represent the same text, whereas OpenAI's Whisper only needs one token for the same text (for example, [47911] versus [35414, 9497]). Also, there's a problem with the timestamp tokens. Ideally, each segment should have both a start and an end timestamp token, but in whisper.cpp, this isn't consistently happening.

bobqianic · 2024-01-14T15:31:00Z

I'm going to close this PR and move everything to #1768

bobqianic added 10 commits September 20, 2023 20:37

Fix the encoding issues on Windows

e428dd7

Fix the encoding issues on Windows

244846c

Fix the encoding issues on Windows

a5c5dff

Revert changes

1dc458c

Revert changes

63f87ad

Fix encoding issues on windows

a0530ee

Fix encoding issues on windows

127b9b6

Add console.h

a79ef03

no known conversion error

4ddca01

fix static_cast not allowed

5c95e9f

bobqianic marked this pull request as ready for review September 21, 2023 12:27

bobqianic added 2 commits September 23, 2023 00:04

Add stream.cpp support

8e98329

Merge branch 'master' into master

6356051

bobqianic requested a review from ggerganov September 23, 2023 14:26

bobqianic added 2 commits September 28, 2023 12:03

Fix issue ggerganov#399

88c8976

Fix issue ggerganov#399

013d434

This was referenced Sep 28, 2023

When using -pc output in the terminal, some Chinese characters cannot be displayed normally #399

Closed

--print-colors output become garbled when the language is not English #1339

Closed

Merge branch 'master' into master

1b94412

ggerganov requested changes Oct 4, 2023

View reviewed changes

bobqianic added 2 commits October 7, 2023 21:34

move functions to common

c340b6b

refactor some functions

15c74d2

bobqianic requested a review from ggerganov October 10, 2023 16:59

bobqianic added the need feedback Testing and feedback with results are needed label Oct 17, 2023

bobqianic linked an issue Oct 20, 2023 that may be closed by this pull request

Windows 11 - Failled to built whisper.cpp for Nvidia cublas #1287

Closed

bobqianic removed a link to an issue Oct 20, 2023

Windows 11 - Failled to built whisper.cpp for Nvidia cublas #1287

Closed

bobqianic added 2 commits October 22, 2023 02:23

fix bug triggered by -ml

16bb889

bug fix

cecf59f

Merge branch 'master' into master

78fe1ed

ggerganov reviewed Nov 3, 2023

View reviewed changes

Merge branch 'master' into master

85b8d1e

bobqianic added 8 commits November 7, 2023 18:55

Merge branch 'ggerganov:master' into master

39a240b

Set default temperature_inc to 0.2f

4b3e480

Revert changes

6ce9893

Merge branch 'ggerganov:master' into master

8cdb9f6

Merge branch 'ggerganov:master' into master

921d4dc

fix missing clang compiler in workflow

7ecfb22

revert change

d9145c7

Merge branch 'ggerganov:master' into master

aa04db6

Fix winsock2.h is included before Windows.h issue

3fcd234

takezie mentioned this pull request Nov 30, 2023

Broken transcribing for russian language locaal-ai/obs-localvocal#59

Closed

bobqianic mentioned this pull request Dec 9, 2023

add #1610

Closed

bobqianic mentioned this pull request Dec 29, 2023

Unicode Error for Hindi transcription #1700

Open

bobqianic mentioned this pull request Jan 13, 2024

Invalid encoding #1761

Open

bobqianic closed this Jan 14, 2024

aiaimimi0920 mentioned this pull request Jan 20, 2024

Invalid encoding V-Sekai/godot-whisper#44

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

examples: Fix the encoding issues on Windows #1313

examples: Fix the encoding issues on Windows #1313

bobqianic commented Sep 20, 2023 •

edited

Loading

bobqianic commented Oct 3, 2023

ggerganov Oct 4, 2023

bobqianic Oct 4, 2023

linmi commented Oct 14, 2023

bobqianic commented Oct 22, 2023

linmi commented Oct 22, 2023

bobqianic commented Oct 22, 2023

linmi commented Oct 22, 2023 •

edited

Loading

bobqianic commented Oct 22, 2023 •

edited

Loading

ggerganov Nov 3, 2023

bobqianic Nov 7, 2023

morukutsu commented Nov 7, 2023

magicse commented Nov 26, 2023

bobqianic commented Jan 1, 2024

ggerganov commented Jan 3, 2024

morukutsu commented Jan 3, 2024

bobqianic commented Jan 14, 2024

bobqianic commented Jan 14, 2024

examples: Fix the encoding issues on Windows #1313

examples: Fix the encoding issues on Windows #1313

Conversation

bobqianic commented Sep 20, 2023 • edited Loading

English:

Chinese:

Japanese:

Russian:

French:

Vietnamese:

bobqianic commented Oct 3, 2023

ggerganov Oct 4, 2023

Choose a reason for hiding this comment

bobqianic Oct 4, 2023

Choose a reason for hiding this comment

linmi commented Oct 14, 2023

bobqianic commented Oct 22, 2023

linmi commented Oct 22, 2023

bobqianic commented Oct 22, 2023

linmi commented Oct 22, 2023 • edited Loading

bobqianic commented Oct 22, 2023 • edited Loading

ggerganov Nov 3, 2023

Choose a reason for hiding this comment

bobqianic Nov 7, 2023

Choose a reason for hiding this comment

morukutsu commented Nov 7, 2023

magicse commented Nov 26, 2023

bobqianic commented Jan 1, 2024

ggerganov commented Jan 3, 2024

morukutsu commented Jan 3, 2024

bobqianic commented Jan 14, 2024

bobqianic commented Jan 14, 2024

bobqianic commented Sep 20, 2023 •

edited

Loading

linmi commented Oct 22, 2023 •

edited

Loading

bobqianic commented Oct 22, 2023 •

edited

Loading