Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use docling package with default PdfConverter #192

Merged
merged 2 commits into from
Mar 6, 2025

Conversation

danieltremblay
Copy link
Collaborator

@danieltremblay danieltremblay commented Mar 5, 2025

Keep pdfminer in PdfMinerConverter to allow for comparisons between 2 alternatives

docling library is pinned at 2.15.0 (January 2025) because later versions have a stricter transformers=~4.42.0 dependency which conflicts with our optional finetune group that requires transformers~=4.45

If this is an issue, we could look into Docling MCP or look into applying conflict-groups (astral-sh/uv#8976)

Description by Korbit AI

What change is being made?

Introduce the docling package as the default PDF converter in the configuration, replacing the use of pdfminer where applicable, and integrate the respective changes across the codebase to support this transition.

Why are these changes being made?

The docling package provides enhanced functionality for PDF conversion, including better handling of tables, making it a more suitable default option over pdfminer. This change aims to improve document conversion accuracy and efficiency, ensuring users get precise and detailed content extraction results. Additionally, maintaining compatibility with both docling and pdfminer provides flexibility in handling PDF documents.

Is this description stale? Ask me to generate a new description by commenting /korbit-generate-pr-description

Keep `pdfminer` in PdfMinerConverter to allow for comparisons between 2 alternatives
Copy link

@korbit-ai korbit-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review by Korbit AI

Korbit automatically attempts to detect when you fix issues in new commits.
Category Issue Fix Detected
Error Handling Silent module import failure ▹ view
Error Handling Missing Error Handling for Failed Conversion ▹ view
Design Poor separation of concerns in document reading logic ▹ view
Error Handling Redundant Variable Initialization Masks Errors ▹ view
Files scanned
File Path Reviewed
examples/convert_document.py
tapeagents/tools/document_reader.py
tapeagents/tools/converters.py

Explore our documentation to understand the languages and file types we support and the files we ignore.

Need a new review? Comment /korbit-review on this PR and I'll review your latest changes.

Korbit Guide: Usage and Customization

Interacting with Korbit

  • You can manually ask Korbit to review your PR using the /korbit-review command in a comment at the root of your PR.
  • You can ask Korbit to generate a new PR description using the /korbit-generate-pr-description command in any comment on your PR.
  • Too many Korbit comments? I can resolve all my comment threads if you use the /korbit-resolve command in any comment on your PR.
  • Chat with Korbit on issues we post by tagging @korbit-ai in your reply.
  • Help train Korbit to improve your reviews by giving a 👍 or 👎 on the comments Korbit posts.

Customizing Korbit

  • Check out our docs on how you can make Korbit work best for you and your team.
  • Customize Korbit for your organization through the Korbit Console.

Feedback and Support

@danieltremblay
Copy link
Collaborator Author

/korbit-review

Copy link

@korbit-ai korbit-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review by Korbit AI

Korbit automatically attempts to detect when you fix issues in new commits.
Category Issue Fix Detected
Logging Replace print with proper logging ▹ view
Logging Inconsistent Logger Usage ▹ view
Files scanned
File Path Reviewed
examples/convert_document.py
tapeagents/tools/document_reader.py
tapeagents/tools/converters.py

Explore our documentation to understand the languages and file types we support and the files we ignore.

Need a new review? Comment /korbit-review on this PR and I'll review your latest changes.

Korbit Guide: Usage and Customization

Interacting with Korbit

  • You can manually ask Korbit to review your PR using the /korbit-review command in a comment at the root of your PR.
  • You can ask Korbit to generate a new PR description using the /korbit-generate-pr-description command in any comment on your PR.
  • Too many Korbit comments? I can resolve all my comment threads if you use the /korbit-resolve command in any comment on your PR.
  • Chat with Korbit on issues we post by tagging @korbit-ai in your reply.
  • Help train Korbit to improve your reviews by giving a 👍 or 👎 on the comments Korbit posts.

Customizing Korbit

  • Check out our docs on how you can make Korbit work best for you and your team.
  • Customize Korbit for your organization through the Korbit Console.

Current Korbit Configuration

General Settings
Setting Value
Review Schedule Automatic excluding drafts
Max Issue Count 10
Automatic PR Descriptions
Issue Categories
Category Enabled
Documentation
Logging
Error Handling
Readability
Design
Performance
Security
Functionality

Feedback and Support

Note

Korbit Pro is free for open source projects 🎉

Looking to add Korbit to your team? Get started with a free 2 week trial here

@ollmer
Copy link
Collaborator

ollmer commented Mar 5, 2025

High-level question: Docling landing page states that it's capable of parsing XLSX, DOCX, and others, not just PDF. Could we replace our entire convert_document.py borrowed from MS Autogen with it?

@ollmer
Copy link
Collaborator

ollmer commented Mar 5, 2025

I would propose to rename the DocumentReader to LegacyDocumentReader and create a new DocumentReader(Tool) completely around the Docling. @danieltremblay, does it make sense to do it in this PR or make it a separate one?

@danieltremblay
Copy link
Collaborator Author

I would propose to rename the DocumentReader to LegacyDocumentReader and create a new DocumentReader(Tool) completely around the Docling. @danieltremblay, does it make sense to do it in this PR or make it a separate one?

@ollmer Sure, that can be done. We could replace the underlying FileConverter class for one that uses Docling for more document types. I'm tempted to do it in a separate PR so others can experiment with the Docling PDF converter in main. I noticed that some LLMs perform better with pdfminer's output than with docling's output for certain table formats.

Copy link
Collaborator

@ollmer ollmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Lets make the full replacement of the DocumentReader in the next PR

@danieltremblay danieltremblay merged commit 07b660f into main Mar 6, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants