How to use with languages with non-Latin characters? #219

nerixim · 2025-01-13T17:02:35Z

Hi.
When I try to process PDF in Japanese, it produces gibberish like following.

E -
6
) -
18 BE E# #B
BE
B
5E
A
471 64123
- 5

I've tried setting the language based on code I found inside this repo.

model = ChatOpenAI(model="gpt-4o", api_key=os.getenv("OPENAI_API_KEY"))  # type: ignore
parser_config = ParseFileConfig(
    llm_model_name="gpt-4o",
    # method=method,
    # strategy=strategy,
    model=model,
    language="ja",
    # parsing_instruction=parsing_instruction,
)

parser_builder = ParserBuilder()
parser = parser_builder.build(parser_config)
megaparse = MegaParse(parser)
response = megaparse.load("./document.pdf")
print(response)
megaparse.save("./document.md")

probably related to #92

The text was updated successfully, but these errors were encountered:

dosubot · 2025-01-13T17:03:49Z

Hello @nerixim! I'm here to help you with any bugs, questions, or contributions. Let's work together to solve the issue you're facing.

The issue you're experiencing with processing Japanese PDFs in MegaParse might be related to the library's handling of non-Latin characters. The DoctrParser class in the MegaParse codebase is responsible for text extraction from PDFs and includes language-specific configurations through the TextDetConfig and TextRecoConfig classes ^[1]. However, it seems that there might be limitations or issues with the current support for non-Latin scripts, as highlighted in the related GitHub issue you mentioned.

To address this, you might want to ensure that the language settings in your configuration are correctly set to Japanese. Additionally, check if there are any updates or patches in the repository that address this issue. If the problem persists, it might be worth exploring alternative text extraction libraries that have robust support for Japanese and other non-Latin scripts, or contributing to the discussion in the GitHub issue to see if there are any workarounds or planned improvements.

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

chloedia · 2025-01-14T09:16:53Z

Hi @nerixim, are you using the pypi python package megaparse or the cloned repository ? (The current codebase is not yet released so there are a lot of changes). Using MegaParse with the Unstructured Parser should work with the PyPi package !

dosubot bot added bug Something isn't working labels Jan 13, 2025

dosubot bot added the question Further information is requested label Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use with languages with non-Latin characters? #219

How to use with languages with non-Latin characters? #219

nerixim commented Jan 13, 2025

dosubot bot commented Jan 13, 2025

chloedia commented Jan 14, 2025

How to use with languages with non-Latin characters? #219

How to use with languages with non-Latin characters? #219

Comments

nerixim commented Jan 13, 2025

dosubot bot commented Jan 13, 2025

chloedia commented Jan 14, 2025