Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added lxml feature #27001

Closed
wants to merge 4 commits into from
Closed

added lxml feature #27001

wants to merge 4 commits into from

Conversation

Ahmetyasin
Copy link
Contributor

PR title: "community: switch GrobidParser to lxml for XML parsing"

PR message: Delete this entire checklist and replace with:
Description:
This PR modifies the GrobidParser in the langchain_community module to switch the XML parsing from the default "xml" parser in BeautifulSoup to the more robust lxml parser. The purpose of this change is to take advantage of the numerous benefits that lxml offers over the default XML parser:

Better handling of malformed XML: In some cases, the documents being parsed may not conform to strict XML standards (e.g., missing or improperly nested tags). The default "xml" parser in BeautifulSoup expects well-formed XML and may fail or return incomplete results when the XML structure is not perfect. lxml, on the other hand, is more forgiving and can handle slightly malformed XML better, making it more suitable for real-world use cases where document formats are not always ideal.

Performance improvements: lxml is known for being faster and more efficient than the default parser when working with large or complex XML documents. This helps improve the overall performance of the GrobidParser when dealing with larger datasets or heavily nested XML structures.

Enhanced feature set: lxml provides more advanced XML processing features, such as better support for namespaces, XPath, and schema validation, which could be useful for future extensions or enhancements of the GrobidParser.

Given that the documents being processed by GrobidParser may not always adhere to ideal XML formatting, switching to lxml ensures more robust parsing and better overall resilience to minor document inconsistencies.

Modified file:

langchain_community/document_loaders/parsers/grobid.py

Dependency:
The lxml library must be installed. If it’s not installed, it can be added via:

pip install lxml

Lint and test:

Code has been formatted with make format.
Linting was performed with make lint.
All tests were run using make test, with no issues encountered.

Copy link

vercel bot commented Sep 30, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback Oct 9, 2024 2:57pm

@dosubot dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) labels Sep 30, 2024
@Ahmetyasin
Copy link
Contributor Author

Hi @eyurtsev,

Could you take a look at this?

@langcarl
Copy link

langcarl bot commented Nov 4, 2024

Thank you for the PR. This PR is marked Needs Support and has not yet received the 5 upvotes required by maintainers for review. It has been open for at least 25 days. Per the LangChain review process, this PR will be closed in 5 days if it does not reach the required threshold.

The Needs Support status is intended to prioritize review time on features that have demonstrated community support. If you feel this status was assigned in error or need more time to gather the required upvotes, please ping (at)ccurme and (at)efriis.

@langcarl langcarl bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Nov 4, 2024
@langcarl langcarl bot closed this Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) size:XS This PR changes 0-9 lines, ignoring generated files. stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed
Projects
Status: Closed
Development

Successfully merging this pull request may close these issues.

1 participant