A short note about parsing HTML and XML documents using W3C’s DOM (Document Object Model), as implemented in Java. Learning to parse using the DOM is good because the DOM is a widely implemented standard: once you know how it works, you can parse HTML (and XML) documents in Javascript, Python, .NET, …
This should successfully parse any document that can be transformed to a DOM according to W3C standard, including valid HTML documents in XML syntax. It uses the bootstrapping approach (described in the DOM Level 3 Core Specification) and the LS feature (described in the DOM Level 3 Load and Save specification).
Document doc;
Path input = Path.of("input.xhtml");
DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
DOMImplementationLS impl = (DOMImplementationLS) registry.getDOMImplementation("LS");
LSParser builder = impl.createLSParser(DOMImplementationLS.MODE_SYNCHRONOUS, null);
doc = builder.parseURI(input.toUri().toString());
Element docE = doc.getDocumentElement();
LOGGER.info("Main tag name: {}.", docE.getTagName());
This approach uses SAX rather than the standard bootstrapping approach. I recommend using the previous one instead where applicable.
Document doc;
Path input = Path.of("input.xhtml");
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
doc = builder.parse(input.toUri().toString());
Element docE = doc.getDocumentElement();
LOGGER.info("Main tag name: {}.", docE.getTagName());
(To manipulate general XML, add factory.setNamespaceAware(true);
before creating the builder.)
HTML documents in the wild are seldom valid. You may use the jsoup library for these cases.
Document doc;
File inputFile = new File("input.html");
org.jsoup.nodes.Document jsoupDoc = Jsoup.parse(inputFile, StandardCharsets.UTF_8.name());
doc = new W3CDom().fromJsoup(jsoupDoc);
Element docE = doc.getDocumentElement();
LOGGER.info("Main tag name: {}.", docE.getTagName());
Here is how to programmatically write an HTML document (in XML syntax, with namespaces). Assuming a private static final String XHTML_NAME_SPACE = "http://www.w3.org/1999/xhtml";
field.
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.newDocument();
Element html = document.createElementNS(XHTML_NAME_SPACE, "html");
html.setAttribute("lang", "en");
document.appendChild(html);
Element head = document.createElementNS(XHTML_NAME_SPACE, "head");
html.appendChild(head);
Element meta = document.createElementNS(XHTML_NAME_SPACE, "meta");
meta.setAttribute("http-equiv", "Content-type");
meta.setAttribute("content", "text/html; charset=utf-8");
head.appendChild(meta);
Element body = document.createElementNS(XHTML_NAME_SPACE, "body");
html.appendChild(body);
/** And so on. */
And here is now to print your document to a string.
DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
DOMImplementationLS impl = (DOMImplementationLS) registry.getDOMImplementation("LS");
LSSerializer ser = impl.createLSSerializer();
ser.getDomConfig().setParameter("format-pretty-print", true);
/** Do not use ser.writeToString: it uses UTF-16. */
LSOutput output = impl.createLSOutput();
StringWriter writer = new StringWriter();
output.setCharacterStream(writer);
ser.write(document, output);
return writer.toString();
Validating the XML on the fly while writing it using an LSSerializer
seems unsupported. An alternative approach is to use the Java Validation API, as illustrated below (thanks).
Schema schema = SchemaFactory.newDefaultInstance().newSchema(getClass().getResource("xhtml1-strict.xsd"));
DOMSource docAsSource = new DOMSource();
docAsSource.setNode(document);
schema.newValidator().validate(docAsSource);
The above code samples run without any added dependencies. In particular, do not include xml-apis:xml-apis
among your dependencies. The java.xml
module of the JRE probably provides everything you might need from xml-apis
, and conflicts will occur if both are reachable from your project. You might have to explicitly exclude xml-apis:xml-apis
from the transitive dependencies of your dependencies.
If possible, do not include xerces:xercesImpl either, for a similar reason: the JRE also includes the Xerces2 code as part of the java.xml
module. Note however that there’s a subtelty: the Xerces2 implementation included with the JRE is private, and the packages have been renamed, for example, the package org.apache.xerces
becomes com.sun.org.apache.xerces.internal
inside the JRE. Thus, one reason for including xercesImpl
is when one of your dependencies (incorrectly, probably) itself wants access to the Xerces implementation classes (e.g., tries to load explicitly org.apache.xerces.dom.DocumentImpl
as for example org.apache.odftoolkit:simple-odf:0.8.2-incubating
does). This will (hopefully) not create problems, provided you exclude xml-apis:xml-apis
from xercesImpl
’s transitive dependencies, as indicated above.
-
Parsing from DOM and related technologies in Java: see the JAXP tutorials here (focus on the parts related to the DOM) and there.
-
(JAXP (JSR 206) has been withdrawn as a standalone project following its inclusion in OpenJDK 7, see section 11.5 in the specification PDF of JAXP 1.6 (Maintenance Release 3, 4 March 2014).)
-
-
<dom4j>, a well-written library for simpler DOM manipulation
-
Comparison of HTML parsers (Wikipedia)
-
W3C DOM4 (Recommendation 19 November 2015), a snapshot of the DOM Living Standard
-
xom, seems to be high-quality