-
Notifications
You must be signed in to change notification settings - Fork 563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Jsoup to parse HTML #327
Conversation
Use HTML in error messages.
@jhy Does this use of Jsoup look fine? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea.
The only thing I wish as an end-user of FS: the ability to specify tolerance of the parsing.
In my project, I use valid HTML and want FS to throw an exception if my HTML is invalid.
I think Jsoup can help bring HTML5 support to FS eventually, and generally improve the HTML parsing. |
I agree with Andrei. FS is fundamentally a library meant to be used as part of a larger application. We wouldn't want to force every user to include jsoup as a dependency. A good alternative approach would be to make this a separate optional module (ala flying-saucer-log4j) that users can use if they want, but can otherwise ignore if they're happy with what they're currently doing. |
@@ -0,0 +1,93 @@ | |||
/* | |||
* {{{ header & license | |||
* Flying Saucer - Copyright (c) 2024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copyrights should be owned by individual contributors.
|
||
public static HTMLResource load(Reader reader) { | ||
try { | ||
InputStream stream = convertReaderToInputStream(reader); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can jsoup read from a Reader directly? It would be nice to avoid the conversion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main Jsoup class has no Reader method here: https://jsoup.org/apidocs/org/jsoup/Jsoup.html
However, there is https://jsoup.org/apidocs/org/jsoup/parser/Parser.html#parseInput(java.io.Reader,java.lang.String)
PDF, and images. | ||
|
||
The new [Jsoup](https://jsoup.org/) based HTML parser will allow supporting HTML 5 syntax | ||
and features in Flying Saucer, with a goal of implementing full HTML5 support over time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The enthusiasm is awesome, but I worry this is so far away as to be borderline misleading.
Supporting HTML5 syntax as input (which is already possible without code changes) is one thing. It doesn't make it any easier to implement the vast array of new features that have been added to HTML/CSS. For example, CSS Grid and Flexbox are huge, complicated features in and off themselves. They are simply out of reach as things stand now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address comments and move jsoup support to a separate module.
Yes, I can address these comments and move the jsoup support to a separate module. How about a separate module for htmlunit-neko also, in a similar way. As pointed out by @rbri in #282 (comment) it seems that htmlunit-neko is also a quite capable html parser. So I am thinking it could be useful to support both jsoup and htmlunit-neko parsers, but I'm not sure how at this time. |
Great idea - will try to support this |
Use JSOUP to parse HTML.
https://jsoup.org/
Deprecate XMLResource.
Add HTMLResource.
Jsoup is a HTML parser, so it could work better to parse HTML than the SAX parser currently in use.
I consider this a proposal, which I think is a step in the right direction. However, I am not fully sure that I understand all the consequences of this yet. So I would like to propose this change, maybe it will be accepted. Related: #279 and #282.
Further, Jsoup is a HTML parser, and most users of Flying saucer will be expecting HTML syntax to be valid, not just XHTML. The current parser in Flysing Saucer is a SAX based XML parser which will throw exceptions if there input is not valid XHTML.
And importantly, we need to make sure that this doesn't introduce any XSS or HTML based vulnerabilities.