A query can be submitted to Spark as follows.
ZIP up the source code (Spark needs this to run the query):
zip -r defoe.zip defoe
Submit the source code to Spark along with information about your query:
spark-submit --py-files defoe.zip defoe/run_query.py <DATA_FILE> <MODEL_NAME> <QUERY_NAME> <QUERY_CONFIG_FILE> [-r <RESULTS_FILE>] [-e <ERRORS_FILE>] [-n <NUM_CORES>]
where:
<DATA_FILE>
is a file that lists either URLs or file paths which are the files over which the query is to be run, one per line. Either URLs or file paths should be exclusively used, not both.<MODEL_NAME>
specifies which text model is to be used, one of:books
: British Library Bookspapers
: British Library Newspapersfmp
: Find My Past Newspapersnzpp
: Papers Past New Zealand and Pacific newspapersgeneric_xml
: Arbitrary XML documentsnls
: National Library of Scotland digital collectionsnlsArticles
: For extracting automatically the articles (at page level) from the Encyclopaedia Britanica.- For example,
books
tells the code that the data files listed indata.txt
are books so should be parsed into a books data model.
<QUERY_NAME>
is the name of a Python module implementing the query to run, for exampledefoe.alto.queries.find_words_group_by_word
ordefoe.papers.queries.articles_containing_words
. The query must be compatible with the chosen model.<QUERY_CONFIG_FILE>
is a query-specific configuration file. This is optional and depends on the query implementation.<RESULTS_FILE>
is the query results file, to hold the query results in YAML format. If omitted the default isresults.yml
.<ERRORS_FILE>
is the errors file, to hold information on any errors in YAML format. If omitted the default iserrors.yml
.<NUM_CORES>
is the number of computer processor cores requested for the job. If omitted the default is 1.
Note for Urika users
- It is recommended that the value of 144 be used for
<NUM_CORES>
. This, with the number of cores per node, determines the number of workers/executors and nodes. As Urika has 36 cores per node, this would request 144/36 = 4 workers/executors and nodes. - This is required as
spark-runner --total-executor-cores
seems to be ignored.
For example, to submit a query to search a set of books for occurrences of some words (e.g. "heart" or "hearts") and return the counts of these occurrences grouped by year, you could run:
spark-submit --py-files defoe.zip defoe/run_query.py data.txt books defoe.alto.queries.find_words_group_by_year queries/hearts.txt
where:
data.txt
is the file with the paths to the books files to run the query over. Examples of this can be found under the others directory.defoe.alto.queries.find_words_group_by_year
is the module that runs the query.queries/hearts.txt
is a configuration file for the query which contains a list of the words, one per line, to search for.
For example, to submit a query to search a set of newspapers for occurrences of gender-specific words (e.g. "she", "he" etc.) and return the counts of these occurrences grouped by year, you could run:
spark-submit --py-files defoe.zip defoe/run_query.py ~/data/papers.2.txt papers defoe.papers.queries.articles_containing_words queries/gender.txt
To create a file with the file paths (data.txt), check how to specify data to defoe queries.
where:
data.txt
is the file with the paths to the papers files to run the query over. Examples of this can be found under the others directory.defoe.papers.queries.articles_containing_words
is the module that runs the query.queries/gender.txt
is a configuration file for the query which contains a list of the words, one per line, to search for.
If successful the results will be written into a new file (by default called results.yml
) in the current directory.
Note for Cirrus/HPC Clusters users
You will need to have a Spark cluster job running, and then you can submit defoe quer(ies) (within or in another) job. Very likely you might have to install Spark in your user account. To see an example of this, check the defoe + Cirrus documentation.
Note for Cloud/VM Clusters users
You will need to install Spark, along with another tools necessaries for defoe. To see an example of this, check the following documentation.
To submit a job to Spark as a background process, meaning you can do other things while Spark is running your query, use nohup
and capture the output from Spark in log.txt
file. For example:
nohup spark-submit --py-files defoe.zip defoe/run_query.py <DATA_FILE> <MODEL_NAME> <QUERY_NAME> <QUERY_CONFIG_FILE> [-r <RESULTS_FILE>] [-n <NUM_CORES>] > log.txt &
You can expect to see at least one python
and one java
process:
ps
PID TTY TIME CMD
...
92250 pts/1 00:00:02 java
92368 pts/1 00:00:00 python
...
Caution: If you see <RESULTS_FILE>
then do not assume that the query has completed and prematurely copy or otherwise try to use that file. If there are many query results then it may take a while for these to be written to the file after it is opened. Check that the background job has completed before using <RESULTS_FILE>
.
If any problems arise in reading data files or converting these into objects before running queries then an attempt will be made to capture these errors and record them in the errors file (default name errors.yml
). If present, this file provides a list of the problematic files and the errors that arose. For example:
- [/mnt/lustre/<project>/<project>/<username>/data/book.zip, File is not a zip file]
- [/mnt/lustre/<project>/<project>/<username>/data/sample-book.zip, '[Errno 2] No such file or directory: ''sample-book.zip'
A quick-and-dirty way to get the Spark application ID is, if you have used nohup
and output capture to log.txt
, to run:
grep Framework\ registered log.txt
For example:
I0125 09:07:58.364142 188697 sched.cpp:743] Framework registered with 6646eaa2-999d-4c87-a657-d4109b4f120b-0691
A quick-and-dirty way to check the number of executors used is, if you have used nohup
and output capture to log.txt
, to run:
grep Exec log.txt | wc -l
If running spark-submit
locally you get:
bash: spark-submit: command not found...
Then add Apache Spark to your PATH
e.g.
export PATH=~/spark-2.4.0-bin-hadoop2.7/bin:$PATH
If you get an error like:
IOError: [Errno 2] No such file or directory: ''
or:
IOError: Error reading file '': failed to load external entity ""
then check for blank lines in your data file and, if there are any, then remove them.
If you run:
head result.yml
and see:
{}
then check that the files the query is being run over exist.
If the files are on the local file system then also check their permissions. This error can arise if, for example, a data file has permissions like:
ls -l /mnt/lustre/<project>/<project>/<user>/blpaper/0000164_19010101.xml
---------- 1 <user> at01 3374189 May 31 13:57
If you get an exception
raise BadZipfile, "File is not a zip file"
then check that the files the query is being run over exist.