Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(community): Adds an HTML loader for URLS #7184

Merged
merged 5 commits into from
Nov 17, 2024

Conversation

philnash
Copy link
Contributor

I recently discovered the same issue that was reported in #2467. The documentation and examples suggest chaining the CheerioWebBaseLoader with MozillaReadabilityTransformer to load and transform HTML documents. However, the CheerioWebBaseLoader uses Cheerio's text method to extract the text from the HTML document, or provided selector. This is great if you want the text, but the MozillaReadabilityTransformer needs to act on HTML.

I have added an HTMLWebBaseLoader that simply uses fetch to get an HTML document and returns the full HTML content.

I've also updated the MozillaReadabilityTransformer example to use the HTMLWebBaseLoader instead of the CheerioWebBaseLoader.

Fixes #2467

I'm @philnash if you're still doing shout outs :)

@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Nov 11, 2024
Copy link

vercel bot commented Nov 11, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchainjs-docs ✅ Ready (Inspect) Visit Preview Nov 17, 2024 0:32am
1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
langchainjs-api-refs ⬜️ Ignored (Inspect) Nov 17, 2024 0:32am

@dosubot dosubot bot added the auto:improvement Medium size change to existing code to handle new use-cases label Nov 11, 2024
@philnash philnash force-pushed the philnash-html_loader branch 2 times, most recently from ce78d96 to 4572f1d Compare November 12, 2024 00:01
@jacoblee93 jacoblee93 changed the title Adds an HTML loader for URLS feat(community): Adds an HTML loader for URLS Nov 12, 2024
jacoblee93
jacoblee93 previously approved these changes Nov 12, 2024
Copy link
Collaborator

@jacoblee93 jacoblee93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! Looks great.

@dosubot dosubot bot added the lgtm PRs that are ready to be merged as-is label Nov 12, 2024
@jacoblee93
Copy link
Collaborator

Sorry, noticed one small thing. Can you also run yarn format?

@jacoblee93 jacoblee93 added close PRs that need one or two touch-ups to be ready and removed lgtm PRs that are ready to be merged as-is labels Nov 12, 2024
@philnash
Copy link
Contributor Author

@jacoblee93 I moved the WebBaseLoader and WebBaseLoaderParams to the HTML loader as they were relevant and it seems that loaders export their own types. I also added the selector property back to WebBaseLoaderParams and deprecated it from there.

I then imported and re-exported the WebBaseLoaderParams from the Cheerio loader with deprecation, so that it doesn't break for anyone that was importing it from here.

Hope that's all ok!

I recently discovered the same issue that was reported in langchain-ai#2467. The documentation and examples suggest chaining the `CheerioWebBaseLoader` with `MozillaReadabilityTransformer` to load and transform HTML documents. However, the `CheerioWebBaseLoader` uses [Cheerio's `text` method to extract the text](https://github.com/langchain-ai/langchainjs/blob/05e5813715150cd69d9e384924818562e3b7c1fa/libs/langchain-community/src/document_loaders/web/cheerio.ts#L144C34-L144C40) from the HTML document, or provided selector. This is great if you want the text, but the MozillaReadabilityTransformer needs to act on HTML.

I have added an `HTMLWebBaseLoader` that simply uses `fetch` to get an HTML document and returns the full HTML content.

I've also updated the `MozillaReadabilityTransformer` example to use the `HTMLWebBaseLoader` instead of the `CheerioWebBaseLoader`.
I moved the WebBaseLoader and WebBaseLoaderParams to the HTML loader
as they were relevant and it seems that loaders export their own types.
I also added the `selector` property back to WebBaseLoaderParams and
deprecated it from there. I then imported and re-exported the
WebBaseLoaderParams from the Cheerio loader with deprecation, so that
it doesn't break for anyone that was importing it from here.
@philnash
Copy link
Contributor Author

Force pushed to fix the merge conflict.

@philnash
Copy link
Contributor Author

@jacoblee93 can we trigger the jobs again to see if this is looking good? Thanks!

Copy link
Collaborator

@jacoblee93 jacoblee93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Sorry for the delay.

@jacoblee93 jacoblee93 merged commit 54decfe into langchain-ai:main Nov 17, 2024
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto:improvement Medium size change to existing code to handle new use-cases close PRs that need one or two touch-ups to be ready lgtm PRs that are ready to be merged as-is size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Possible mistake in the documentation | Document Transformers | Mozilla Readability
2 participants