Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken transcribing for russian language #59

Closed
takezie opened this issue Nov 26, 2023 · 11 comments
Closed

Broken transcribing for russian language #59

takezie opened this issue Nov 26, 2023 · 11 comments
Milestone

Comments

@takezie
Copy link

takezie commented Nov 26, 2023

After voice input: "Это проверка русского языка для... "
I got transcribed text: "Это пѐогеѐка ѐууукого џзыка длџ..."

It looks like cyrillic "я" replaced with "џ", "ѐ" for cyrillic "р", "о" for cyrillic "н", "у" for cyrillic "с" and so on.

@royshil
Copy link
Collaborator

royshil commented Nov 27, 2023

thanks for the issue report.
are you on Windows?

@royshil royshil added this to the 0.0.8 milestone Nov 27, 2023
@takezie
Copy link
Author

takezie commented Nov 27, 2023

Yes. Microsoft Windows [Version 10.0.22631.2715]

@royshil
Copy link
Collaborator

royshil commented Nov 28, 2023

@takezie can you perhaps give me a recording of audio that produces this problem so i can test on my end?
i know some Russian but not good enough to effectively debug

@takezie
Copy link
Author

takezie commented Nov 28, 2023

@royshil Here is set of pangramms - every phrase contains full set of cyrillic characters.

mp3: google.drive

transcribing:

А ещё хорошо бы уметь всем на зависть чётко и наглядно писать буквы и цифры.

Аэрофотосъёмка ландшафта уже выявила земли богачей и процветающих крестьян.

Бегом марш! У месторождения кварцующихся фей без слёз хочется электрическую пыль.

Безмозглый широковещательный цифровой передатчик сужающихся экспонент.

Блеф разъедает ум, чаще цыгана живёшь беспокойно, юля — грех это!

В чащах юга жил бы цитрус? Да, но фальшивый экземпляр!

Вопрос футбольных энциклопедий замещая чушью: эй, где съеден ёж?

Всё ускоряющаяся эволюция компьютерных технологий предъявила жёсткие требования к производителям как собственно вычислительной техники, так и периферийных устройств.

Вступив в бой с шипящими змеями — эфой и гадюкой, — маленький, цепкий, храбрый ёж съел их.

Государев указ: душегубцев да шваль всякую высечь, да калёным железом по щекам этих физиономий съездить!

Друг мой эльф! Яшке б свёз птиц южных чащ!

Завершён ежегодный съезд эрудированных школьников, мечтающих глубоко проникнуть в тайны физических явлений и химических реакций.

@royshil
Copy link
Collaborator

royshil commented Nov 29, 2023

works for me...
image

@takezie
Copy link
Author

takezie commented Nov 29, 2023

Tried on another PC, same problem.
Screenshot_1

Very stange, ok, I'll try to build it on my PC, may be something wrong with installed locales...

@royshil
Copy link
Collaborator

royshil commented Nov 29, 2023

so this image you attached is wrong? it looks like the Cyrillic letters are showing up... are there specific letters that have problems?

@takezie
Copy link
Author

takezie commented Nov 30, 2023

Your variant also have wrong characters, but in another way.

I tried to build on my PC, got same problem, I'm getting atm:

Original:
А ещё хорошо бы уметь всем на зависть.

Transcribed:
А еще хоѐошо было уметќ гуем оа загџзќ.

ѐ \xD1\x90 should be \xD1\x80
ќ \xD1\x9C should be \xD1\x8C
and so on...

it looks like its error for x90...x9A range, but then things get weirder
In гуем, у \xD1\x83 shoud be \xD1\x81 but in уметќ have same code and right transcribing.

And at your sample, уметќ broken, right spelling уметь, and it decodes in same way for both of us, but хоѐошо decoded wrong for me, and correct for you.

I don't understand how this is possible.

@takezie
Copy link
Author

takezie commented Nov 30, 2023

@royshil could you take a look at this PR, that probably solve the same problem? I'm not strong with cpp, but may be it will be useful?

github.com/ggerganov/whisper.cpp/pull/1313

@royshil
Copy link
Collaborator

royshil commented Nov 30, 2023

@takezie yes ive seen it. i have my own fix which i think is more complete https://github.com/occ-ai/obs-localvocal/blob/master/src/transcription-filter.cpp#L249
however it looks like there's a bit more work needed
it's not just Russian that's affected by this Whisper.cpp bug. i've had people say Polish, Greek, Chinese, Korean..
so anything i do here needs to support all languages

@royshil
Copy link
Collaborator

royshil commented Jun 6, 2024

stale

@royshil royshil closed this as completed Jun 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants