Skip to content

Parsing of Elsevier XML documents

Haoyan Huo edited this page Jun 12, 2019 · 1 revision

Overview

Elsevier is a big publisher that produces a significant number of articles in the synthesis database. Most of the articles from Elsevier is downloaded from an API endpoint that produces XML files. This document provides insights , design patterns and practical implementation considerations of the Elsevier XML parser in LimeSoup.

Table of contents

  1. Overview of XML format.
  2. Analysis of Elsevier XML definitions.
  3. Implementation details of the ElsevierSoup.

Overview of XML format

Namespaces

XML and HTML are both similar markup languages. They share a set of common features, such as using tags to enclose information, use key-value pairs to designate tag attributes, etc. However, there are two important differences for XML compared with HTML:

  1. XML is designed to carry information while the main purpose of HTML is to display webpages.
  2. XML is extensible. Only a minimum set of tags are defined by standard XML, and a XML document usually contains one or more document type definitions (DTD).

Many entities have defined their own XML standards, and possibly share similar tag names. For example, Elsevier publishing defined that one or more tag <section> can be enclosed by a <sections> tag, which might not be the case for American Chemical Society (ACS). Thus, it is important for them to define their own namespaces, and rewrite tags as <elsevier:section> and <elsevier:sections>. The following shows an example of Elsevier XML documents:

<full-text-retrieval-response xmlns="http://www.elsevier.com/xml/svapi/article/dtd" xmlns:ce="http://www.elsevier.com/xml/common/dtd">
	<ce:abstract-sec>
		<ce:simple-para id="simple-para0005" view="all">We explore the influence of electrolyte concentration on the adsorption of charged spheres using modeling techniques based on random sequential adsorption (RSA). We present a parametric study of the effects of double layer interactions between the charged particles and between the particle and the substrate on the jamming limit using a two-dimensional RSA simulation similar to that of Z. Adamczyk<ce:italic>et al.</ce:italic>(1990,<ce:italic>J. Colloid Interface Sci.</ce:italic>
			<ce:bold>140,</ce:bold>123) along with a simple method of estimating jamming limit coverages. In addition, we present a more realistic RSA algorithm that includes explicit energetic interaction in three dimensions, that is, particle–particle and particle–surface interactions during the approach of a particle to the substrate. The calculation of interaction energies in the 3-D RSA model is achieved with the aid of a three-body superposition approximation. The 3-D RSA model differs from the 2-D model in that the extent of coverage is controlled by kinetic rather than energetic considerations. Results of both models capture the experimentally observed trend of increased surface coverage with increased electrolyte concentration, and both models require the value of a key model parameter to be specified for a quantitative match to experimental data. However, the 3-D model more effectively captures the governing physics, and the parameter in this case takes on more meaningful values than for the 2-D model.</ce:simple-para>
	</ce:abstract-sec>
</full-text-retrieval-response>

Here, <full-text-retrieval-response> is in the global namespace, and a namespace named ce is defined using the clause xmlns:ce="http://www.elsevier.com/xml/common/dtd". Everything contained in <full-text-retrieval-response> that has a prefix ce: belongs to the ce namespace. Just as the namespaces of variables in Python, it is important to understand which namespace a tag is in. We will not cover this topic here, and you are encouraged to read online materials to better understand XML namespaces.

Document type definition (DTD)

Think about designing a database in which team member information will be stored. Not only do you want to add some constriants on what should be stored, but also the data types associated with each field need to be defined before-hand. Usually a schema will be created for this purpose. Whenever a record is created in the database, it can thus be validated against the schema.

Because we mentioned that XML is designed to carry information, people can also define 'XML schemas'. Such schemas define the organization of XML documents, and are named "document type definitions" (DTD). Let's look at a synthetic DTD:

<!DOCTYPE member
[
<!ELEMENT member (name,dob,id)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT dob (#PCDATA)>
<!ELEMENT id (#PCDATA)>
]>

The above DTD is interpreted as follows:

  1. The root element of the XML document is <member>.
  2. A <member> tag must contain three children <name>, <dob>, and <id>.
  3. <name> tag must contain one child with data type #PCDATA (Parsed Character Data). So are <dob> and <id>.

According to the above DTD, a valid XML document can then be created:

<member><name>Bob</name><dob>Jan 01, 1990</dob><id>12345</id></member>

Whitespaces in XML documents

XML is designed to enhance the readability of documents. Thus, additional whitespaces, tabs, newlines can be inserted to make the document more "beautiful". For example, usually people will rewrite the above example as:

<member>
	<name>Bob</name>
	<dob>Jan 01, 1990</dob>
	<id>12345</id>
</member>

However, addition of whitespaces and newlines violates the DTD because now the actual document really looks something like <!ELEMENT member (name,"\n\t",dob,"\n\t",id,"\n")>. Clearly this is not an acceptable DTD any more.

To solve this problem, we need to understand how XML standard treats whitespaces. Historically, whitespaces in XML documents are defined as 4 characters: carriage return "\r", newline "\n", tab "\t", and spacebar " ". We can also write a Python regular expression as [\r\n\t ]+. XML processors will classify any of these 4 characters as "whitespace".

A whitespace in XML can be either of two types: significant whitespace, and insignificant whitespace. As the name suggests, significant whitespace serves as actual data, whereas insignificant serves soly for the purpose of readability and should be ignored.

The original W3C XML specification declares the default behavior of XML processors should preserve all whitespaces, or all whitespaces are significant. This is demonstrated by the following Python snippet:

from bs4 import BeautifulSoup
doc = BeautifulSoup("""<tag>
	<tag1>Demo</tag1>
</tag>""", 'lxml')
print(list(doc.tag.children))
# Prints ['\n', <tag1>Demo</tag1>, '\n']

Elsevier XML DTD

Elsevier published a series of articles and books on the XML DTD's used for their publications at https://www.elsevier.com/authors/author-schemas/elsevier-xml-dtds-and-transport-schemas. The main XML DTD we need to analyze is the so called "ja" (journal article) XML DTD. A sub category of the ja XML DTD is the "Elsevier Common Element Pool" material, which consists of mainly <ce:*> tags seen in Elsevier publication XML documents.

The latest XML DTD file for the ja-common-element DTD can be found at https://www.elsevier.com/__data/assets/text_file/0006/275667/ja5_common150_ent.txt. This XML DTD file contains many lines defining how a valid Elsevier journal article XML will be formatted. For example, the following lines:

<!-- keywords -->

<!ELEMENT   ce:keywords         ( ce:section-title?, ce:keyword+ )>
<!ATTLIST   ce:keywords
                %common-link.att;
                %common-view.att;
                class           CDATA               "keyword"
                xml:lang        %iso639;            #IMPLIED>
<!ELEMENT   ce:keyword          ( ce:text, ce:keyword* )>
<!ATTLIST   ce:keyword
                %common-link.att; >

tells us that element <ce:keywords> will have an optional tag <ce:section-title> and more than 1 tag <ce:keyword>, and the attributes for tag <ce:keywords> will be inherted from %common-link.att;, %common-view.att;, plus class and xml:lang. Element <ce:keyword> will always contain a <ce:text> element representing the text of the keyword, as well as optional nested <ce:keyword> elements. The above information is also documented in the book The Elsevier DTD 5 Family of XML DTDs page 365-366.

Just like writing a Python module, many of the Elsevier DTD elements can be imported from outside resources. For example, Elsevier adopts the Mathematical Markup Language (MathML) DTD's developed by W3C. The ja-common-element DTD imports these external DTD's via statements as follows:

<!ENTITY % mathml-dtd
    PUBLIC "-//W3C//DTD MathML 2.0 Mod ES//EN"
    "mathml2-mod-ES.dtd">
%mathml-dtd;

This derivative instructs any XML processor to open external DTD file mathml2-mod-ES.dtd and load the corresponding DTD's. For this purpose, we, as users of Elsevier XML DTD's, must ensure these resources are available. Luckily, Elsevier has provided a complete zip file in which these external DTD's are already present. These files can also be downloaded from https://www.elsevier.com/authors/author-schemas/elsevier-xml-dtds-and-transport-schemas, for example, the JA DTD 5.5.0 and CEP 1.5.0 complete zip file contains everything we will need for Elsevier journal article XML files.

Implementing LimeSoup using Elsevier DTD's

As of now, we have obtained the Elsevier XML DTD for journal article common elements. A parser of Elsevier XML files should conform to these XML definitions and behave accordingly. Based on the DTD's, paragraphs of an article will be arranged inside the element <ce:sections>:

<!ELEMENT   ce:sections         ( %parsec; )>
<!ENTITY    % parsec            "( ce:para | ce:section )+" >

<!ELEMENT   ce:para             ( %par.data; )* >
<!ELEMENT   ce:section          ( ( ( ce:section-title | ( ce:label, ce:section-title? ) ),
                                  %parsec; ) | ce:section+ )>
...

It can be seen that there are two types of objects: XML elements, and XML entities. In LimeSoup, we implemented parsing functions for these two types of objects: extract_tag_name for XML elements, such as extract_ce_sections; and process_entity_name for XML entities, such as process_par_data. Every of these function strictly follows the rules defined in DTD. As a result, implementation of the ElsevierSoup will only need a few calls to these two functions: extract_ce_para and extract_ce_section.