Use Jsoup to parse HTML #327

andreasrosdal · 2024-05-26T19:33:16Z

Use JSOUP to parse HTML.
https://jsoup.org/

Deprecate XMLResource.
Add HTMLResource.

Jsoup is a HTML parser, so it could work better to parse HTML than the SAX parser currently in use.
I consider this a proposal, which I think is a step in the right direction. However, I am not fully sure that I understand all the consequences of this yet. So I would like to propose this change, maybe it will be accepted. Related: #279 and #282.

Further, Jsoup is a HTML parser, and most users of Flying saucer will be expecting HTML syntax to be valid, not just XHTML. The current parser in Flysing Saucer is a SAX based XML parser which will throw exceptions if there input is not valid XHTML.

And importantly, we need to make sure that this doesn't introduce any XSS or HTML based vulnerabilities.

Use HTML in error messages.

andreasrosdal · 2024-05-26T21:49:04Z

@jhy Does this use of Jsoup look fine?

asolntsev

I like the idea.
The only thing I wish as an end-user of FS: the ability to specify tolerance of the parsing.
In my project, I use valid HTML and want FS to throw an exception if my HTML is invalid.

andreasrosdal · 2024-05-27T08:02:51Z

I think Jsoup can help bring HTML5 support to FS eventually, and generally improve the HTML parsing.
Specifying parsing tolerance would be nice. We can try to find out how to do this using Jsoup.

pbrant · 2024-05-27T18:20:14Z

I agree with Andrei.

FS is fundamentally a library meant to be used as part of a larger application. We wouldn't want to force every user to include jsoup as a dependency.

A good alternative approach would be to make this a separate optional module (ala flying-saucer-log4j) that users can use if they want, but can otherwise ignore if they're happy with what they're currently doing.

pbrant · 2024-05-27T18:28:50Z

flying-saucer-core/src/main/java/org/xhtmlrenderer/resource/HTMLResource.java

@@ -0,0 +1,93 @@
+/*
+ * {{{ header & license
+ * Flying Saucer - Copyright (c) 2024


Copyrights should be owned by individual contributors.

pbrant · 2024-05-27T18:32:15Z

flying-saucer-core/src/main/java/org/xhtmlrenderer/resource/HTMLResource.java

+
+    public static HTMLResource load(Reader reader) {
+        try {
+            InputStream stream = convertReaderToInputStream(reader);


Can jsoup read from a Reader directly? It would be nice to avoid the conversion.

The main Jsoup class has no Reader method here: https://jsoup.org/apidocs/org/jsoup/Jsoup.html
However, there is https://jsoup.org/apidocs/org/jsoup/parser/Parser.html#parseInput(java.io.Reader,java.lang.String)

pbrant · 2024-05-27T18:44:30Z

README.md

+PDF, and images. 
+
+The new [Jsoup](https://jsoup.org/) based HTML parser will allow supporting HTML 5 syntax
+and features in Flying Saucer, with a goal of implementing full HTML5 support over time. 


The enthusiasm is awesome, but I worry this is so far away as to be borderline misleading.

Supporting HTML5 syntax as input (which is already possible without code changes) is one thing. It doesn't make it any easier to implement the vast array of new features that have been added to HTML/CSS. For example, CSS Grid and Flexbox are huge, complicated features in and off themselves. They are simply out of reach as things stand now.

pbrant

Please address comments and move jsoup support to a separate module.

andreasrosdal · 2024-05-27T19:41:09Z

Yes, I can address these comments and move the jsoup support to a separate module.

How about a separate module for htmlunit-neko also, in a similar way. As pointed out by @rbri in #282 (comment) it seems that htmlunit-neko is also a quite capable html parser. So I am thinking it could be useful to support both jsoup and htmlunit-neko parsers, but I'm not sure how at this time.

rbri · 2024-05-28T07:16:48Z

How about a separate module for htmlunit-neko also, in a similar way. As pointed out by @rbri in #282 (comment) it seems that htmlunit-neko is also a quite capable html parser. So I am thinking it could be useful to support both jsoup and htmlunit-neko parsers, but I'm not sure how at this time.

Great idea - will try to support this

andreasrosdal added 2 commits May 26, 2024 21:28

Use JSOUP to parse HTML. Deprecate XMLResource. Add HTMLResource.

8cacef6

Update HTMLResource.java

49dd8de

Use HTML in error messages.

asolntsev approved these changes May 27, 2024

View reviewed changes

Andreas Rosdal added 3 commits May 27, 2024 09:42

Use W3CDom class for converting from JSoup document.

1895583

Update JavaDoc.

585dedb

Use Jsoup HTML parser.

997b108

Mention new Jsoup parser in readme, with hopes of HTML5 support.

7e26547

andreasrosdal mentioned this pull request May 27, 2024

Support for HTML 5 #282

Open

pbrant self-requested a review May 27, 2024 18:24

pbrant reviewed May 27, 2024

View reviewed changes

pbrant requested changes May 27, 2024

View reviewed changes

andreasrosdal closed this May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Jsoup to parse HTML #327

Use Jsoup to parse HTML #327

andreasrosdal commented May 26, 2024 •

edited

Loading

andreasrosdal commented May 26, 2024

asolntsev left a comment

andreasrosdal commented May 27, 2024

pbrant commented May 27, 2024

pbrant May 27, 2024

pbrant May 27, 2024

andreasrosdal May 27, 2024

pbrant May 27, 2024

pbrant left a comment

andreasrosdal commented May 27, 2024

rbri commented May 28, 2024

Use Jsoup to parse HTML #327

Use Jsoup to parse HTML #327

Conversation

andreasrosdal commented May 26, 2024 • edited Loading

andreasrosdal commented May 26, 2024

asolntsev left a comment

Choose a reason for hiding this comment

andreasrosdal commented May 27, 2024

pbrant commented May 27, 2024

pbrant May 27, 2024

Choose a reason for hiding this comment

pbrant May 27, 2024

Choose a reason for hiding this comment

andreasrosdal May 27, 2024

Choose a reason for hiding this comment

pbrant May 27, 2024

Choose a reason for hiding this comment

pbrant left a comment

Choose a reason for hiding this comment

andreasrosdal commented May 27, 2024

rbri commented May 28, 2024

andreasrosdal commented May 26, 2024 •

edited

Loading