#Adaptive Partitioning for Very Large RDF Data
State-of-the-art distributed RDF systems partition data across multiple computer nodes (workers). Some systems perform cheap hash partitioning, which may result in expensive query evaluation, while others apply heuristics aiming at minimizing inter-node communication during query evaluation. This requires an expensive data pre-processing phase, leading to high startup costs for very large RDF knowledge bases. Apriori knowledge of the query workload has also been used to create partitions, which however are static and do not adapt to workload changes; as a result, inter-node communication cannot be consistently avoided for queries that are not favored by the initial data partitioning.
We propose AdHash, a distributed RDF system, which addresses the shortcomings of previous work. First, AdHash applies lightweight partitioning on the initial data, that distributes triples by hashing on their subjects; this renders its startup overhead low. At the same time, the locality-aware query optimizer of AdHash takes full advantage of the partitioning to (i) support the fully parallel processing of join patterns on subjects and (ii) minimize data communication for general queries by applying hash distribution of intermediate results instead of broadcasting, wherever possible. Second, AdHash monitors the data access patterns and dynamically redistributes and replicates the instances of the most frequent ones among workers. As a result, the communication cost for future queries is drastically reduced or even eliminated. To control replication, AdHash implements an eviction policy for the redistributed patterns. Our experiments with synthetic and real data verify that AdHash (i) starts faster than all existing systems, (ii) processes thousands of queries before other systems become online, and (iii) gracefully adapts to the query load, being able to evaluate queries on billion-scale RDF data in sub-seconds.
For more details, visit http://cloud.kaust.edu.sa/Pages/adhash.aspx
AdHash is released under the Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.
If you use AdHash in your research, please cite our paper:
@article{DBLP:journals/pvldb/HarbiAKM15,
author = {Razen Harbi and
Ibrahim Abdelaziz and
Panos Kalnis and
Nikos Mamoulis},
title = {Evaluating {SPARQL} Queries on Massive {RDF} Datasets},
journal = {{PVLDB}},
volume = {8},
number = {12},
pages = {1848--1859},
year = {2015},
url = {http://www.vldb.org/pvldb/vol8/p1848-harbi.pdf},
timestamp = {Fri, 14 Aug 2015 15:24:11 +0200},
biburl = {http://dblp.uni-trier.de/rec/bib/journals/pvldb/HarbiAKM15},
bibsource = {dblp computer science bibliography, http://dblp.org}
}
AdHash was tested on 64-bit Ubuntu Linux 12.04 (precise), 14.04 (trusty) and LinuxMint 17.01 (Rebecca).
There are a few dependencies which must be satisfied in order to compile and run AdHash.
-
build-essential and g++ (>= 4.4.7) [Required]
- Needed for compiling AdHash.
-
openssh-server [Required]
- Required to initialize MPI and establish connections among compute nodes.
-
MPICH2 [Required]
- AdHash uses MPI for inter-node communication. Open MPI is not tested with AdHash.
-
Install a recent version of Boost (>= 1.46) [Required]
- AdHash requires all headers of non-complied boost libraries (libboost-dev) and other compiled boost libraries. Specifically, AdHash uses boost's iostreams, system and filesystem libraries.
-
Qt (>= 5.4) [Required]
- AdHash has a simple Qt-based GUI for managing, partitioning and querying RDF datasets. AdHash links dynamically to Qt.
-
lGL library [Required]
- Needed by Qt for GUI rendering.
-
gnuplot (>= 4.4) [Required]
- Needed to plot some images within AdHash GUI.
All the dependencies can be satisfied from the Ubuntu/LinuxMint repositories:
sudo apt-get update
sudo apt-get install gcc g++ build-essential libopenssh-server libmpich2-dev libboost-dev libboost-iostreams-dev libboost-system-dev libboost-filesystem-dev libglu1-mesa-dev gnuplot git
Download Qt from http://www.qt.io/download/. Both online and offline installations were tested with AdHash.
Note: for MPI to work, all workers need to be able to ssh to each other without passwords.
You can download AdHash from Github, which also offers a zip download of the repository through the website.
To clone from Github using git, execute the following command:
git clone https://github.com/razen-alharbi/AdHash.git
- Change your current working directory to the downloaded directory. This directory is referred to as ADHASH_HOME hereafter.
cd ADHASH_HOME
- Generate the make files by running qmake.
qmake AdHashGUI.pro -r -spec linux-g++
* Compile AdHash. You can run multiple parallel build tasks. We used 4 parallel tasks.
make -j4
* Run the management console of AdHash by executing the following command. AdHash has a simple and self-explanatory GUI that is easy to follow for loading, partitioning and querying RDF datasets.
./Release/mgmt
## Running Issues
If your system's locale is not set properly, AdHash may fail to run. This was specifically noticed when installing the desktop-environment on ubuntu servers. To go around this, execute the following command:
export LC_ALL="en_US.UTF-8"
If you encounter issues please send an email to [email protected] and [email protected]
## Tested datasets
The following datasets/benchmarks were tested successfully by AdHash:
* LUBM Benchmark (http://swat.cse.lehigh.edu/projects/lubm/).
* WatDiv Benchmark (http://dsg.uwaterloo.ca/watdiv/).
* YAGO2 (http://yago-knowledge.org/).
* Bio2RDF (http://www.bio2rdf.org/).