Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Jsoup to parse HTML #327

Closed
wants to merge 6 commits into from

Conversation

andreasrosdal
Copy link
Contributor

@andreasrosdal andreasrosdal commented May 26, 2024

Use JSOUP to parse HTML.
https://jsoup.org/

Deprecate XMLResource.
Add HTMLResource.

Jsoup is a HTML parser, so it could work better to parse HTML than the SAX parser currently in use.
I consider this a proposal, which I think is a step in the right direction. However, I am not fully sure that I understand all the consequences of this yet. So I would like to propose this change, maybe it will be accepted. Related: #279 and #282.

Further, Jsoup is a HTML parser, and most users of Flying saucer will be expecting HTML syntax to be valid, not just XHTML. The current parser in Flysing Saucer is a SAX based XML parser which will throw exceptions if there input is not valid XHTML.

And importantly, we need to make sure that this doesn't introduce any XSS or HTML based vulnerabilities.

@andreasrosdal
Copy link
Contributor Author

@jhy Does this use of Jsoup look fine?

Copy link
Contributor

@asolntsev asolntsev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea.
The only thing I wish as an end-user of FS: the ability to specify tolerance of the parsing.
In my project, I use valid HTML and want FS to throw an exception if my HTML is invalid.

@andreasrosdal
Copy link
Contributor Author

I think Jsoup can help bring HTML5 support to FS eventually, and generally improve the HTML parsing.
Specifying parsing tolerance would be nice. We can try to find out how to do this using Jsoup.

@pbrant
Copy link
Member

pbrant commented May 27, 2024

I agree with Andrei.

FS is fundamentally a library meant to be used as part of a larger application. We wouldn't want to force every user to include jsoup as a dependency.

A good alternative approach would be to make this a separate optional module (ala flying-saucer-log4j) that users can use if they want, but can otherwise ignore if they're happy with what they're currently doing.

@pbrant pbrant self-requested a review May 27, 2024 18:24
@@ -0,0 +1,93 @@
/*
* {{{ header & license
* Flying Saucer - Copyright (c) 2024
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copyrights should be owned by individual contributors.


public static HTMLResource load(Reader reader) {
try {
InputStream stream = convertReaderToInputStream(reader);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can jsoup read from a Reader directly? It would be nice to avoid the conversion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PDF, and images.

The new [Jsoup](https://jsoup.org/) based HTML parser will allow supporting HTML 5 syntax
and features in Flying Saucer, with a goal of implementing full HTML5 support over time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The enthusiasm is awesome, but I worry this is so far away as to be borderline misleading.

Supporting HTML5 syntax as input (which is already possible without code changes) is one thing. It doesn't make it any easier to implement the vast array of new features that have been added to HTML/CSS. For example, CSS Grid and Flexbox are huge, complicated features in and off themselves. They are simply out of reach as things stand now.

Copy link
Member

@pbrant pbrant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address comments and move jsoup support to a separate module.

@andreasrosdal
Copy link
Contributor Author

Yes, I can address these comments and move the jsoup support to a separate module.

How about a separate module for htmlunit-neko also, in a similar way. As pointed out by @rbri in #282 (comment) it seems that htmlunit-neko is also a quite capable html parser. So I am thinking it could be useful to support both jsoup and htmlunit-neko parsers, but I'm not sure how at this time.

@rbri
Copy link

rbri commented May 28, 2024

How about a separate module for htmlunit-neko also, in a similar way. As pointed out by @rbri in #282 (comment) it seems that htmlunit-neko is also a quite capable html parser. So I am thinking it could be useful to support both jsoup and htmlunit-neko parsers, but I'm not sure how at this time.

Great idea - will try to support this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants