-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use docling
package with default PdfConverter
#192
Conversation
Keep `pdfminer` in PdfMinerConverter to allow for comparisons between 2 alternatives
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review by Korbit AI
Korbit automatically attempts to detect when you fix issues in new commits.
Category | Issue | Fix Detected |
---|---|---|
Silent module import failure ▹ view | ✅ | |
Missing Error Handling for Failed Conversion ▹ view | ✅ | |
Poor separation of concerns in document reading logic ▹ view | ✅ | |
Redundant Variable Initialization Masks Errors ▹ view | ✅ |
Files scanned
File Path | Reviewed |
---|---|
examples/convert_document.py | ✅ |
tapeagents/tools/document_reader.py | ✅ |
tapeagents/tools/converters.py | ✅ |
Explore our documentation to understand the languages and file types we support and the files we ignore.
Need a new review? Comment
/korbit-review
on this PR and I'll review your latest changes.Korbit Guide: Usage and Customization
Interacting with Korbit
- You can manually ask Korbit to review your PR using the
/korbit-review
command in a comment at the root of your PR.- You can ask Korbit to generate a new PR description using the
/korbit-generate-pr-description
command in any comment on your PR.- Too many Korbit comments? I can resolve all my comment threads if you use the
/korbit-resolve
command in any comment on your PR.- Chat with Korbit on issues we post by tagging @korbit-ai in your reply.
- Help train Korbit to improve your reviews by giving a 👍 or 👎 on the comments Korbit posts.
Customizing Korbit
- Check out our docs on how you can make Korbit work best for you and your team.
- Customize Korbit for your organization through the Korbit Console.
Feedback and Support
/korbit-review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review by Korbit AI
Korbit automatically attempts to detect when you fix issues in new commits.
Category | Issue | Fix Detected |
---|---|---|
Replace print with proper logging ▹ view | ||
Inconsistent Logger Usage ▹ view |
Files scanned
File Path | Reviewed |
---|---|
examples/convert_document.py | ✅ |
tapeagents/tools/document_reader.py | ✅ |
tapeagents/tools/converters.py | ✅ |
Explore our documentation to understand the languages and file types we support and the files we ignore.
Need a new review? Comment
/korbit-review
on this PR and I'll review your latest changes.Korbit Guide: Usage and Customization
Interacting with Korbit
- You can manually ask Korbit to review your PR using the
/korbit-review
command in a comment at the root of your PR.- You can ask Korbit to generate a new PR description using the
/korbit-generate-pr-description
command in any comment on your PR.- Too many Korbit comments? I can resolve all my comment threads if you use the
/korbit-resolve
command in any comment on your PR.- Chat with Korbit on issues we post by tagging @korbit-ai in your reply.
- Help train Korbit to improve your reviews by giving a 👍 or 👎 on the comments Korbit posts.
Customizing Korbit
- Check out our docs on how you can make Korbit work best for you and your team.
- Customize Korbit for your organization through the Korbit Console.
Current Korbit Configuration
General Settings
Setting Value Review Schedule Automatic excluding drafts Max Issue Count 10 Automatic PR Descriptions ✅ Issue Categories
Category Enabled Documentation ✅ Logging ✅ Error Handling ✅ Readability ✅ Design ✅ Performance ✅ Security ✅ Functionality ✅ Feedback and Support
Note
Korbit Pro is free for open source projects 🎉
Looking to add Korbit to your team? Get started with a free 2 week trial here
12e0b08
to
6abb531
Compare
High-level question: Docling landing page states that it's capable of parsing XLSX, DOCX, and others, not just PDF. Could we replace our entire convert_document.py borrowed from MS Autogen with it? |
I would propose to rename the DocumentReader to LegacyDocumentReader and create a new DocumentReader(Tool) completely around the Docling. @danieltremblay, does it make sense to do it in this PR or make it a separate one? |
@ollmer Sure, that can be done. We could replace the underlying FileConverter class for one that uses Docling for more document types. I'm tempted to do it in a separate PR so others can experiment with the Docling PDF converter in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Lets make the full replacement of the DocumentReader in the next PR
Keep
pdfminer
in PdfMinerConverter to allow for comparisons between 2 alternativesdocling
library is pinned at 2.15.0 (January 2025) because later versions have a strictertransformers=~4.42.0
dependency which conflicts with our optionalfinetune
group that requirestransformers~=4.45
If this is an issue, we could look into Docling MCP or look into applying conflict-groups (astral-sh/uv#8976)
Description by Korbit AI
What change is being made?
Introduce the
docling
package as the default PDF converter in the configuration, replacing the use ofpdfminer
where applicable, and integrate the respective changes across the codebase to support this transition.Why are these changes being made?
The
docling
package provides enhanced functionality for PDF conversion, including better handling of tables, making it a more suitable default option overpdfminer
. This change aims to improve document conversion accuracy and efficiency, ensuring users get precise and detailed content extraction results. Additionally, maintaining compatibility with bothdocling
andpdfminer
provides flexibility in handling PDF documents.