Specify data to query

Spark needs to know the data which you want to run a query over. You need to create a plain-text file with a list of the paths to the data files to query.

British Library Books dataset

To run queries over this dataset, the file needs a list of the paths to the zip files corresponding to the books over which the query is to be run.

For example, suppose you want to run a query over two books:

1510_1699/000001143_0_1-20pgs__560409_dat.zip
1510_1699/000000874_0_1-22pgs__570785_dat.zip

If running on your local machine, your file would be as follows, where <BLB> is the path to the directory in your home directory, where you have the data:

<BLB>/1510_1699/000001143_0_1-20pgs__560409_dat.zip
<BLB>/1510_1699/000000874_0_1-22pgs__570785_dat.zip

If running on Urika, your file would be as follows:

/mnt/lustre/<project>/<project>/<username>/BLB/1510_1699/000001143_0_1-20pgs__560409_dat.zip
/mnt/lustre/<project>/<project>/<username>/BLB/1510_1699/000000874_0_1-22pgs__570785_dat.zip

You can write these files by hand.

Alternatively, you can use the bash find command, to create a list of all available files:

On a local machine:

find <BLB> -name "*.zip" | sort > data.txt

On Urika:

find /mnt/lustre/<project>/<project>/<username>/BLB -name "*.zip" | sort > data.txt

You can then filter this file to select subsets of data e.g. to select books in the folder 1510_1699 you can run:

grep 1510_1699 data.txt > data_1510_1699.txt

British Library Newspapers dataset

To run queries over this dataset, the file needs a list of the paths to the XML files corresponding to the newspapers over which the query is to be run.

For example, suppose you want to run a query over two newspapers:

0000164- The Courier and Argus/0000164_19070603.xml
0000164- The Courier and Argus/0000164_19151123.xml

If running on your local machine, your file would be as follows, where <BLN> is the path to the directory in your home directory, where you have the data:

<BLN>/Part 1/0000164- The Courier and Argus/1907/0000164_19070603/0000164_19070603.xml
<BLN>/Part 1/0000164- The Courier and Argus/1915/0000164_19151123/0000164_19151123.xml

If running on Urika, your file would be as follows:

/mnt/lustre/<project>/<project>/<username>/BLN/Part 1/0000164- The Courier and Argus/1907/0000164_19070603/0000164_19070603.xml
/mnt/lustre/<project>/<project>/<username>/BLN/Part 1/0000164- The Courier and Argus/1915/0000164_19151123/0000164_19151123.xml

You can write these files by hand.

Alternatively, you can use the bash find command, to create a list of all available files:

On a local machine:

find <BLN> -name "*.xml" | sort > data.txt

On Urika:

find /mnt/lustre/<project>/<project>/<username>/BLN -name "*.xml" | sort > data.txt

You can then filter this file to select subsets of data e.g. to select papers in the folder 0000164- The Courier and Argus/, you can run:

grep "0000164-\ The\ Courier\ and\ Argus" data.txt > data.courier.txt

Times Digital Archive dataset

To run queries over this dataset, the file needs a list of the paths to the XML files corresponding to the newspapers over which the query is to be run.

For example, suppose you want to run a query over two newspapers:

0FFO-1785-APR01.xml
0FFO-1785-APR02.xml

If running on your local machine, your file would be as follows, where <TDA> is the path to the directory in your home directory, where you have the data:

<TDA>/1785/17850401/0FFO-1785-APR01.xml
<TDA>/1785/17850402/0FFO-1785-APR02.xml

If running on Urika, your file would be as follows:

/mnt/lustre/<project>/<project>/<username>/TDA/1785/17850401/0FFO-1785-APR01.xml
/mnt/lustre/<project>/<project>/<username>/TDA/1785/17850402/0FFO-1785-APR02.xml

You can write these files by hand.

Alternatively, you can use the bash find command, to create a list of all available files:

On a local machine:

find <TDA> -name "*.xml" | sort > data.txt

On Urika:

find /mnt/lustre/<project>/<project>/<username>/TDA -name "*.xml" | sort > data.txt

Papers Past New Zealand and Pacific newspapers dataset

To run queries over this dataset, the file needs a list of the paths to the XML files corresponding to the newspapers over which the query is to be run.

For example, suppose you want to run a query over two documents:

1000.xml
1001.xml

If running on your local machine, your file would be as follows, where <NZPP> is the path to the directory in your home directory, where you have the data:

<NZPP>/1000.xml
<NZPP>/1001.xml

If running on Urika, your file would be as follows:

/mnt/lustre/<project>/<project>/<username>/NZPP/1000.xml
/mnt/lustre/<project>/<project>/<username>/NZPP/1001.xml

You can write these files by hand.

Alternatively, you can use the bash find command, to create a list of all available files:

On a local machine:

find <NZPP> -name "*.xml" | sort > data.txt

On Urika:

find /mnt/lustre/<project>/<project>/<username>/NZPP -name "*.xml" | sort > data.txt

NLS dataset

When you want to work with a NLS dataset in defoe, you will need to have a data file with just the list of directories (with the full path) to query. See an example of a data file for this dataset (using the Encyclopaedia Britannica) bellow:

XXXX/nls-data-encyclopaediaBritannica/193108323
XXXX/nls-data-encyclopaediaBritannica/193696080
XXX/nls-data-encyclopaediaBritannica/144850378
...

Each directory of those directories has to have inside a METS XML file, and alto directory, with an ALTO XML file per page.

> cd XXXX/nls-data-encyclopaediaBritannica/193108323
> ls
alto/
193108323-mets.xml
> ls alto
193201290.34.xml  193202343.34.xml ...

So, a data file for a NLS dataset (e.g. Chapbooks, Scottish gazetteers, Encyclopaedia Britannica, etc) do not require a list of XML files to query. Instead, it just require a list of directories to query.

Arbitrary XML documents

To run queries over arbitrary XML documents, the file needs a list of the paths to the XML files.

This can be created as for the British Library Newspapers dataset above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

specify-data-to-query.md

specify-data-to-query.md

Specify data to query

British Library Books dataset

British Library Newspapers dataset

Times Digital Archive dataset

Papers Past New Zealand and Pacific newspapers dataset

NLS dataset

Arbitrary XML documents

Files

specify-data-to-query.md

Latest commit

History

specify-data-to-query.md

File metadata and controls

Specify data to query

British Library Books dataset

British Library Newspapers dataset

Times Digital Archive dataset

Papers Past New Zealand and Pacific newspapers dataset

NLS dataset

Arbitrary XML documents