Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How skip HTML validation while generating PDF? #228

Open
asolntsev opened this issue Nov 21, 2023 · 2 comments
Open

How skip HTML validation while generating PDF? #228

asolntsev opened this issue Nov 21, 2023 · 2 comments

Comments

@asolntsev
Copy link
Contributor

Reported by (Ezhil](mailto:[email protected])

Team - I am trying to create a PDF using page url. But I am getting an error saying that

Can't load the XML resource (using TrAX transformer). org.xml.sax.SAXParseException; lineNumber: 6; columnNumber: 14; Open quote is expected for attribute "name" associated with an element type "meta".

It looks like renderer.setDocument(urlcheck) check whether the URL has proper start and end HTML tag. Is there any we can skip this validation ?

try {
  // Define the URL
  String urlcheck = "https://en.wikipedia.org/wiki/IPhone_15";

  // Establish a URL connection
  HttpURLConnection connection = (HttpURLConnection) new URL(urlcheck).openConnection();
  connection.setRequestMethod("GET");

  // Check the response code (200 indicates success)
  int responseCode = connection.getResponseCode();
  if (responseCode == 200) {
    // Get the input stream from the connection
    InputStream urlInputStream = connection.getInputStream();

    // Create an ITextRenderer instance
    ITextRenderer renderer = new ITextRenderer();

    // Set the HTML content as the document
    renderer.setDocument(urlcheck);

    // Render to PDF
    ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
    renderer.layout();
    renderer.createPDF(outputStream);
    renderer.finishPDF();
  } 
@asolntsev
Copy link
Contributor Author

Answer from Peter Brand:

I'm afraid that won't work in general. FS is a pretty complete static implementation of CSS 2.1. It does not support JavaScript or the many, many features subsequently added to CSS and HTML.

In order to limit the number of external dependencies, FS only supports XML input out of the box, but it provides the facilities to use your own parser as long as the output of that parser is a W3C Document value.

@pbrant
Copy link
Member

pbrant commented Mar 19, 2024

See also the JSoup example provided in #299. A similar technique would work with https://github.com/HtmlUnit/htmlunit-neko or the validator.nu HTML5 parser.

I'm afraid the first paragraph above still applies though. Sites that use JavaScript, CSS Flexbox, CSS Grid, etc. will still be pretty broken. There is no easy fix there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants