Add WARC resource containing DOM tree after load #730

magbb · 2024-12-05T16:34:33Z

I use browsertrix for handling JavaScript heavy pages and extracting text from these. While the text extraction in Browsertrix definitely works, I need a fast and simple way to do boilerplate removal (like Trafilatura or jusText) on the DOM tree after rendering (i.e. I need the HTML tags).

Perhaps there is another way to do this already, but I implemented an option "dom" (similar to "text" and heavily inspired by that) which creates a WARC resource of DOM.getOuterHTML. In that way, I can simply run Trafilatura or jusText on these records while traversing the archive without having to replay the whole archive in a browser.

Does that make sense?

Thanks for creating browsertrix. It makes crawling so much easier!

ikreymer · 2025-02-01T06:39:56Z

Hi, apologies for late response - I've been meaning to comment, but didn't have a chance. Actually started a similar approach for DOM snapshots on this branch:
https://github.com/webrecorder/browsertrix-crawler/tree/dom-snapshot
I think the approach I was taking was to use the existing DOMSnapshot.captureSnapshot that we use for text, and also use it to serialize to html...
main...dom-snapshot
The reason for this approach was to have more flexibility: for example, probably want to string out <script> tags and potentially be able to capture shadow DOM, which https://chromedevtools.github.io/devtools-protocol/tot/DOMSnapshot/#method-captureSnapshot should include.
Otherwise, the approach is about the same, I believe!
Do you think the approach I'd started would make sense for your use case?
Perhaps its worth comparing the results a bit more...

magbb and others added 4 commits December 5, 2024 16:47

add a DOM resource type for boilerplate removal

975b760

remove .directory

3c33f0f

Delete src/util/.directory

44bdeeb

more consistent naming

4ea9669

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add WARC resource containing DOM tree after load #730

Add WARC resource containing DOM tree after load #730

magbb commented Dec 5, 2024

ikreymer commented Feb 1, 2025

Add WARC resource containing DOM tree after load #730

Are you sure you want to change the base?

Add WARC resource containing DOM tree after load #730

Conversation

magbb commented Dec 5, 2024

ikreymer commented Feb 1, 2025