When running a query, a list of file names to run the query over needs to be provided by the user.
An Apache Spark resilient distributed dataset (RDD) is created from these file names. The file names are partitioned and distributed across the processing nodes in the Spark cluster.
Each file is read and parsed into a set of Python objects, an object model, that represents the contents of the file. These objects are used by the Python code that implements each query.
The object model to use is specified by the user, depending upon the files they are querying. At present there are three object models:
books
: for British Library Books (BLB)fmp
: for ALTO-compliant subset of Find My Past newspapers (FMP)papers
: for British Library Newspapers (BLN) or Times Digital Archive (TDA)generic_xml
: for arbitrary XML documents
BLB and the ALTO-compliant subset of FMP both conform to the ALTO format. An abstract object model, alto
, which is a parent of both the books
and fmp
object models represents the commonality between these models.
The query selected by the user is also object model-specific:
defoe.alto.queries
package contains queries forbooks
andfmp
.defoe.papers.queries
package contains queries forpapers
.defoe.generic_xml.queries
package contains queries forgeneric_xml
.
This is an abstract object model representing ALTO documents.
The list of files to run the query over is expected to be a list of ZIP files.
A ZIP file is expected to contain one or more documents in ALTO format:
- Each document has an associated XML metadata file.
- Each document has one or more associated XML page files, one per each page of the document. Each of these page files are listed in the metadata file.
An Archive
object reads a ZIP file. It checks the ZIP file contents and, using pattern matching, extracts the file names for metadata files and page files and caches these indexed by by document and page IDs.
When asked for a specific document, the Archive
creates a new Document
object. This is done for every request, no caching is done.
The Document
object reads a metadata file for a specific document from the archive and extracts information about the document: code, title, publisher, place, years, page codes, and number of pages.
When asked for a specific page, the Document
creates a new Page
object. This is done for every request, no caching is done.
The Page
object reads a page file from the archive and extracts information about the page: code, width, height. The Page
object also caches the raw XML as a collection of Python lxml objects.
When asked for its words, the Page
retrieves these on-demand by running an XPath query over the cached XML objects. This is done for every request.
A ZIP file contains one or more books in ALTO format:
- Each book has an XML metadata file.
- Each book has one or more XML page files, one for each page of the book.
The ZIP archive is assumed to hold the following files and directories:
<METADATA_CODE>_metadata.xml
<METADATA_CODE>_metadata.xml
...
ALTO/
<METADATA_CODE>_<FILE_CODE>.xml
<METADATA_CODE>_<FILE_CODE>.xml
...
<METADATA_CODE>_<FILE_CODE>.xml
<METADATA_CODE>_<FILE_CODE>.xml
...
where:
<METADATA_CODE>
is `[0-9]*<FILE_CODE>
is[0-9_]*
For example 000000874_0_1-22pgs__570785_dat.zip
holds:
000001143_metadata.xml
ALTO/
000001143_000001.xml
000001143_000002.xml
000001143_000003.xml
...
000001143_000020.xml
An Archive
object subclassed from alto.archive.Archive
provides the patterns for the metadata file names and page file names.
All other aspects of this object model are as for alto
.
BLB ZIP files may be grouped into directories, one per time period e.g. 1510_1699
, 1700_1799
, ..., 1890_1899
. The object model is agnostic to this.
A ZIP file contains one or more newspaper issues in ALTO format:
- Each newspaper issue has an XML metadata file.
- Each newspaper issue has one or more XML page files, one for each page of the newspaper.
A ZIP archive corresponds to a document and each file to a page.
The ZIP archive is assumed to hold the following files:
<METADATA_CODE>_mets.xml
<METADATA_CODE>_<FILE_CODE>.xml
<METADATA_CODE>_<FILE_CODE>.xml
...
<METADATA_CODE>_mets.xml
<METADATA_CODE>_<FILE_CODE>.xml
<METADATA_CODE>_<FILE_CODE>.xml
...
where:
<METADATA_CODE>
is[0-9]*?_[0-9]*?
<FILE_CODE>
is[0-9_]*
For example 0318.zip holds:
0000239_18000318_mets.xml
0000239_18000318_0001.xml
0000239_18000318_0002.xml
0000239_18000318_0003.xml
0000239_18000318_0004.xml
An Archive
object subclassed from alto.archive.Archive
provides the patterns for the metadata file names and page file names.
All other aspects of this object model are as for alto
.
The list of file to run the query over is expected to be a list of XML files.
Each XML file is expected to contain one issue of a newspaper:
- Each XML file is compliant with either:
- LTO_issue.dtd schema (TDA)
- ncco_issue.dtd schema (TDA)
- GALENP.dtd schema (TDA)
- bl_ncnp_issue_apex.dtd (BLN)
- Each XML file contains an
issue
element with one or morearticle
elements, corresponding to articles.
An Issue
object reads the XML file and extracts information about the issue: ID, date, page count, day of week. The Issue
object also caches the raw XML as a collection of Python lxml objects.
The Issue
object also creates one Article
object for each article
element within the XML file.
The Issue
object uses conditionals and XPath queries to pull out the ID, issue metadata, and articles. These allow for both the TDA and BLN XML Schema to be handled.
An Article
object extracts from an article
element information about the article: ID, title, OCR quality, page IDs. It also extracts any preamble and the words within the article. The Article
object also caches its XML fragment as a collection of Python lxml
objects
The Article
object uses conditionals and pattern matching to pull out the page IDs. These allow for both the TDA and BLN XML Schema to be handled.
British Library Newspapers XML files may be grouped into directories, one per newspaper e.g. 0000164- The Courier and Argus
, 0000187- The Bath Chronicle
etc. The code is agnostic to this.
The list of file to run the query over is expected to be a list of XML files.
Each XML file is expected to contain one issue of a newspaper:
- Each XML file is compliant with either:
- GALENP.dtd schema (TDA)
- bl_ncnp_issue_apex.dtd (BLN)
- Each XML file contains an
issue
element with one or morearticle
elements, corresponding to articles.
An Articles
object reads the XML file and extracts information about one or more articles within the XML file, and creates one Article
object for each result
element within the XML file.
An Article
object extracts from an article
element information about the article: title, date, enclosing newspaper name, article type and content. The Article
object also caches its XML fragment as a collection of Python lxml
objects.
The list of file to run the query over is expected to be a list of XML files.
Unlike in the other models, strict XML parsing, using lxml.etree.XMLParser, is used. XMLParser's recover
option is disabled, meaning no attempt to parse broken XML is made,
The limitations of the code and the object model are as follows.
The user needs to specify, as a command-line parameter, the object model they ant the data to be parsed into.
There is no testing done to check whether the data is compliant with the object model. defoe takes it on trust that the files it is given can be parsed into the object model specified by the user. If the data and object model are not compliant unhandled errors may arise.
There is no testing for compliance to any file structure or XML Schema.
The query selected by the user is also object model-specific. Again, it is the user's responsibility to ensure that the query is consistent, and can work with, their chosen object model.
No caching of the RDDs is done. Every time a query is run the files are reread, and RDDs and objects are recreated.
No preprocessing of the text prior to running a query is done.
Normalization of words, to remove any non-a-z|A-Z
characters (for both data and search terms provided by the user) is the responsibility of individual query implementations (a defoe.query_utils.normalize
helper function is provided for this purpose).
For the generic_xml
model:
- Namespace extraction is done solely on from the document's root element's
nsmap
. If namespaces are defined in sub-elements then there will be a need to traverse the XML e.g. by traversing elements using Tree iteration. - Shema locations are accessed from a root element as an attribute. An alternative is to run an XPath query e.g.
namespaces = {u'xsi': u"http://www.w3.org/2001/XMLSchema-instance"}
query = "//@xsi:schemaLocation"
result = document_tree.getroot().xpath(query, namespaces=namespaces)
result = [str(r) for r in result]
query = "//@xsi:noNamespaceSchemaLocation"
result = document_tree.getroot().xpath(query, namespaces=namespaces)
result = [str(r) for r in result]
There are functions that could be reused in a preprocessing tool that, for example, parses raw XML, strips out stopwords, and stores both metadata and raw text.
Normalization of words could be done when XML files are parsed, rather than by the individual queries.
The alto
, books
and fmp
models could be refactored to parse pages and extract text once, as done in papers
, rather than on demand. The converse is also possible, with papers
parsing files and article
elements on demand.
The papers
model could be refactored into a generic super-class with sub-classes for BLN and TDA. This would be cleaner, and more in the spirit of object-oriented design, than using conditionals and XPath queries to handle these differences within a single class.
The ALTO-compliant FMP subset of newspapers data is structured in a different way from the BLN/TDA newspapers data. The FMP subset consists of one XML file per page and one XML file per issue (the metadata file). This issue file also contains information on the articles in the issue. The content of the articles is spread across the XML pages. In contrast, the BLN/TDA data consists of one XML file per issue, with articles held as XML elements within each issue's file. A common super-class for FMP, BLN and TDA newspapers would require more than just different XPath queries. For example, to get article text from BLN/TDA requires pulling out content from the article
element in a single issue file. In contrast, to get article text from FMP may require getting, from the metadata file for the issue, information on the pages, and so the files, which contain the article's text then parsing each of these in turn. This may be complicated by the fact that the page files don't seem to have any metadata specifically identifying articles. This may then incur having to searching the page text for the article title as recorded in the metadata file, and then searching for the title of the following article, to know when the end of the current article has been encountered.