This repository contains implementations of High-energy Physics (HEP) analysis queries from the IRIS HEP benchmark written in SQL++ to be run on AsterixDB.
The purpose of this repository is to study the suitability of SQL++ for HEP analyses and to serve as a use case for improving database technologies. Since SQL++ was designed to deal with semi-structured data such as JSON documents, which often has similar or more nestedness than HEP data, it seems like a promising candidate for this benchmark.
- Install Python 3 with pip.
- Install the Python requirements:
pip3 install -r requirements.txt
- Install docker and docker-compose.
- Clone this repository and bring up the services with Docker compose.
The benchmark defines a data set in the ROOT format, which is not supported by AsterixDB. However, the Rumble implementation of the benchmark provides scripts to convert the data to Parquet, which AsterixDB can load or query in-place.
You can run the queries against "external tables" consisting of files on HDFS. A basic HDFS installation is part of the services brought up by docker-compose
. Read the instructions of that repository for details. The main steps are as follows:
- Copy
Run2012B_SingleMu-restructured-1000.parquet
from this repository to thedata/
repository of the Docker compose project. - Upload it to HFDS:
docker exec -it docker-asterixdb_namenode_1 hadoop fs -mkdir /Run2012B_SingleMu-restructured-1000/ docker exec -it docker-asterixdb_namenode_1 hadoop fs -put /data/Run2012B_SingleMu-restructured-1000.parquet /Run2012B_SingleMu-restructured-1000/
- Create an external table with the provided script:
Other configurations of that script are discussed below.
scripts/create_table.py \ --asterixdb-server localhost:19002 \ --external-server hdfs://namenode:8020 \ --external-path "/Run2012B_SingleMu-restructured-1000/*.parquet" \ --dataset-name Run2012B_SingleMu_1000_typed_external_parquet \ --datatype eventType \ --file-format parquet \ --storage-location external \ --log-level INFO
Instead of querying files on HDFS, you can also load the data into the internal storage of AsterixDB. To do so, use --storage-location internal
(and adapt the name of the dataset).
You can also query (or load) data in the JSON format (converted with scripts/parquet2json.py
). To do so, use --file-format json
(and adapt the name of the dataset and the files).
You can create the tables either with or without specifying a schema (i.e., either with an empty open type or a closed type with all possible attributes). Use --datatype anyType
or --datatype eventType
, respectively (and adapt the name of the dataset).
Queries are run through test_queries.py
. Run the following command to see its options:
$ ./test_queries.py --help
usage: test_queries.py [options] [file_or_dir] [file_or_dir] [...]
...
custom options:
-Q QUERY_ID, --query-id=QUERY_ID
Folder name of query to run.
-F FREEZE_RESULT, --freeze-result=FREEZE_RESULT
Whether the results of the query should be persisted to disk.
-N NUM_EVENTS, --num-events=NUM_EVENTS
Number of events taken from the input file. This influences which reference file should be taken.
-I INPUT_TABLE, --input-table=INPUT_TABLE
Name of input table or view.
-S ASTERIXDB_SERVER, --asterixdb-server=ASTERIXDB_SERVER
URL as <host>:<port> of the AsterixDB REST interface.
-C ASTERIXDB_DATAVERSE, --asterixdb-dataverse=ASTERIXDB_DATAVERSE
Default dataverse to use.
--plot-histogram Plot resulting histogram as PNG file.
For example, the following command runs queries 6-1
and 6-2
against the table created above:
./test_queries.py -vs --num-events 1000 \
--input-table Run2012B_SingleMu_1000_typed_external_parquet \
--query-id query-6-1 --query-id query-6-2