Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add WARC resource containing DOM tree after load #730

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

magbb
Copy link

@magbb magbb commented Dec 5, 2024

I use browsertrix for handling JavaScript heavy pages and extracting text from these. While the text extraction in Browsertrix definitely works, I need a fast and simple way to do boilerplate removal (like Trafilatura or jusText) on the DOM tree after rendering (i.e. I need the HTML tags).

Perhaps there is another way to do this already, but I implemented an option "dom" (similar to "text" and heavily inspired by that) which creates a WARC resource of DOM.getOuterHTML. In that way, I can simply run Trafilatura or jusText on these records while traversing the archive without having to replay the whole archive in a browser.

Does that make sense?

Thanks for creating browsertrix. It makes crawling so much easier!

@ikreymer
Copy link
Member

ikreymer commented Feb 1, 2025

Hi, apologies for late response - I've been meaning to comment, but didn't have a chance. Actually started a similar approach for DOM snapshots on this branch:
https://github.com/webrecorder/browsertrix-crawler/tree/dom-snapshot
I think the approach I was taking was to use the existing DOMSnapshot.captureSnapshot that we use for text, and also use it to serialize to html...
main...dom-snapshot
The reason for this approach was to have more flexibility: for example, probably want to string out <script> tags and potentially be able to capture shadow DOM, which https://chromedevtools.github.io/devtools-protocol/tot/DOMSnapshot/#method-captureSnapshot should include.
Otherwise, the approach is about the same, I believe!
Do you think the approach I'd started would make sense for your use case?
Perhaps its worth comparing the results a bit more...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants