You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello @nerixim! I'm here to help you with any bugs, questions, or contributions. Let's work together to solve the issue you're facing.
The issue you're experiencing with processing Japanese PDFs in MegaParse might be related to the library's handling of non-Latin characters. The DoctrParser class in the MegaParse codebase is responsible for text extraction from PDFs and includes language-specific configurations through the TextDetConfig and TextRecoConfig classes [1]. However, it seems that there might be limitations or issues with the current support for non-Latin scripts, as highlighted in the related GitHub issue you mentioned.
To address this, you might want to ensure that the language settings in your configuration are correctly set to Japanese. Additionally, check if there are any updates or patches in the repository that address this issue. If the problem persists, it might be worth exploring alternative text extraction libraries that have robust support for Japanese and other non-Latin scripts, or contributing to the discussion in the GitHub issue to see if there are any workarounds or planned improvements.
Hi @nerixim, are you using the pypi python package megaparse or the cloned repository ? (The current codebase is not yet released so there are a lot of changes). Using MegaParse with the Unstructured Parser should work with the PyPi package !
Hi.
When I try to process PDF in Japanese, it produces gibberish like following.
I've tried setting the language based on code I found inside this repo.
probably related to #92
The text was updated successfully, but these errors were encountered: