Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode Error for Hindi transcription #1700

Open
rahulshivajipawar opened this issue Dec 29, 2023 · 5 comments · May be fixed by #1768
Open

Unicode Error for Hindi transcription #1700

rahulshivajipawar opened this issue Dec 29, 2023 · 5 comments · May be fixed by #1768
Labels
bug Something isn't working

Comments

@rahulshivajipawar
Copy link

When doing transcription in Hindi for a file, I encounter invalid unicode character.

Screenshot 2023-12-29 at 8 29 09 PM

I have noticed this with many Hindi files.

Used whisper-large-v2 mode for inference on CPU. Have noticed the same issue when inferencing on GPU as well.

I am guessing the issue is: whisper model token output (BPE encoded) is not getting correctly mapped to unicode characters.

@rahulshivajipawar
Copy link
Author

This is an example of a file on which it fails - https://drive.google.com/file/d/1_BFuNOAqM3yv4P2A0i8KOT_RZ6LnYSCt/view?usp=sharing

@bobqianic
Copy link
Collaborator

I'll explore this further. There might be an issue with the tokenizer.
Take a look at this discussion for more insight: #1313 (comment)

@bobqianic bobqianic added the bug Something isn't working label Dec 29, 2023
@rahulshivajipawar
Copy link
Author

I'll explore this further. There might be an issue with the tokenizer. Take a look at this discussion for more insight: #1313 (comment)

Looked at the issue you referenced, looks like it is similar to this #1313 (comment)

@rahulshivajipawar
Copy link
Author

Another side observation - I have found the accuracy for Hindi transcription via whisper.cpp to be much lower than when using openai API directly.

@bobqianic
Copy link
Collaborator

Another side observation - I have found the accuracy for Hindi transcription via whisper.cpp to be much lower than when using openai API directly.

Agreed. You don't even need to use their API. Just by using pip install openai-whisper and running large-v2 locally, you can achieve better results than whisper.cpp in most languages apart from English.

@bobqianic bobqianic linked a pull request Jan 14, 2024 that will close this issue
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants