Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid encoding #1761

Open
thewh1teagle opened this issue Jan 12, 2024 · 2 comments
Open

Invalid encoding #1761

thewh1teagle opened this issue Jan 12, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@thewh1teagle
Copy link
Contributor

thewh1teagle commented Jan 12, 2024

When transcribing audio files in hebrew language I receive error from whisper-rs of invalid utf-8, so I guess that basically it fails to decode some of them.
it happens only when getting individual segments with the function
whisper.cpp#L5988::whisper_full_get_token_text_from_state

but with
whisper.cpp#L5972::whisper_full_get_segment_text_from_state

it works

tazz4843/whisper-rs#115
audio.mp3

@bobqianic bobqianic added bug Something isn't working enhancement New feature or request labels Jan 13, 2024
@bobqianic
Copy link
Collaborator

I plan to address this issue over the weekend. Many users have reported it, and it seems to stem from the absence of a tokenizer in the decoding stage. #1313 (comment)

@bobqianic bobqianic linked a pull request Jan 14, 2024 that will close this issue
11 tasks
@bobqianic
Copy link
Collaborator

Unfortunately, we won't be able to resolve this issue due to its origin: BPE tokenization, which divides Unicode characters into subtokens, resulting in incomplete tokens. However, in the updated version I've introduced in proposal #1768, there's a workaround. You have the option to set max_len=1, ensuring you receive the smallest valid segment. Alternatively, you can continue utilizing the whisper_full_get_token_text_from_state function. Adding a buffer and applying the newly recommended whisper_utf8_is_valid function to verify the buffer's validity is also a viable approach.

@bobqianic bobqianic removed a link to a pull request Feb 5, 2024
11 tasks
@bobqianic bobqianic removed the bug Something isn't working label Feb 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants