Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use with languages with non-Latin characters? #219

Open
nerixim opened this issue Jan 13, 2025 · 2 comments
Open

How to use with languages with non-Latin characters? #219

nerixim opened this issue Jan 13, 2025 · 2 comments
Labels
bug Something isn't working question Further information is requested

Comments

@nerixim
Copy link

nerixim commented Jan 13, 2025

Hi.
When I try to process PDF in Japanese, it produces gibberish like following.

E -
6
) -
18 BE E# #B
BE
B
5E
A
471 64123
- 5

I've tried setting the language based on code I found inside this repo.

model = ChatOpenAI(model="gpt-4o", api_key=os.getenv("OPENAI_API_KEY"))  # type: ignore
parser_config = ParseFileConfig(
    llm_model_name="gpt-4o",
    # method=method,
    # strategy=strategy,
    model=model,
    language="ja",
    # parsing_instruction=parsing_instruction,
)

parser_builder = ParserBuilder()
parser = parser_builder.build(parser_config)
megaparse = MegaParse(parser)
response = megaparse.load("./document.pdf")
print(response)
megaparse.save("./document.md")

probably related to #92

Copy link

dosubot bot commented Jan 13, 2025

Hello @nerixim! I'm here to help you with any bugs, questions, or contributions. Let's work together to solve the issue you're facing.

The issue you're experiencing with processing Japanese PDFs in MegaParse might be related to the library's handling of non-Latin characters. The DoctrParser class in the MegaParse codebase is responsible for text extraction from PDFs and includes language-specific configurations through the TextDetConfig and TextRecoConfig classes [1]. However, it seems that there might be limitations or issues with the current support for non-Latin scripts, as highlighted in the related GitHub issue you mentioned.

To address this, you might want to ensure that the language settings in your configuration are correctly set to Japanese. Additionally, check if there are any updates or patches in the repository that address this issue. If the problem persists, it might be worth exploring alternative text extraction libraries that have robust support for Japanese and other non-Latin scripts, or contributing to the discussion in the GitHub issue to see if there are any workarounds or planned improvements.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@dosubot dosubot bot added bug Something isn't working labels Jan 13, 2025
@chloedia
Copy link
Collaborator

Hi @nerixim, are you using the pypi python package megaparse or the cloned repository ? (The current codebase is not yet released so there are a lot of changes). Using MegaParse with the Unstructured Parser should work with the PyPi package !

@dosubot dosubot bot added the question Further information is requested label Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants