Unicode Error for Hindi transcription #1700

rahulshivajipawar · 2023-12-29T15:05:25Z

When doing transcription in Hindi for a file, I encounter invalid unicode character.

I have noticed this with many Hindi files.

Used whisper-large-v2 mode for inference on CPU. Have noticed the same issue when inferencing on GPU as well.

I am guessing the issue is: whisper model token output (BPE encoded) is not getting correctly mapped to unicode characters.

rahulshivajipawar · 2023-12-29T15:08:09Z

This is an example of a file on which it fails - https://drive.google.com/file/d/1_BFuNOAqM3yv4P2A0i8KOT_RZ6LnYSCt/view?usp=sharing

bobqianic · 2023-12-29T15:31:23Z

I'll explore this further. There might be an issue with the tokenizer.
Take a look at this discussion for more insight: #1313 (comment)

rahulshivajipawar · 2023-12-29T15:42:10Z

I'll explore this further. There might be an issue with the tokenizer. Take a look at this discussion for more insight: #1313 (comment)

Looked at the issue you referenced, looks like it is similar to this #1313 (comment)

rahulshivajipawar · 2023-12-30T17:07:09Z

Another side observation - I have found the accuracy for Hindi transcription via whisper.cpp to be much lower than when using openai API directly.

bobqianic · 2023-12-30T17:14:02Z

Another side observation - I have found the accuracy for Hindi transcription via whisper.cpp to be much lower than when using openai API directly.

Agreed. You don't even need to use their API. Just by using pip install openai-whisper and running large-v2 locally, you can achieve better results than whisper.cpp in most languages apart from English.

bobqianic added the bug Something isn't working label Dec 29, 2023

bobqianic linked a pull request Jan 14, 2024 that will close this issue

Fix the decoding issues #1768

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode Error for Hindi transcription #1700

Unicode Error for Hindi transcription #1700

rahulshivajipawar commented Dec 29, 2023

rahulshivajipawar commented Dec 29, 2023

bobqianic commented Dec 29, 2023

rahulshivajipawar commented Dec 29, 2023

rahulshivajipawar commented Dec 30, 2023

bobqianic commented Dec 30, 2023

Unicode Error for Hindi transcription #1700

Unicode Error for Hindi transcription #1700

Comments

rahulshivajipawar commented Dec 29, 2023

rahulshivajipawar commented Dec 29, 2023

bobqianic commented Dec 29, 2023

rahulshivajipawar commented Dec 29, 2023

rahulshivajipawar commented Dec 30, 2023

bobqianic commented Dec 30, 2023