- Servers: ontop, fuseki
- number of query repeats: 30
- client ip: 128.176.254.12
- client hostname: micropop046
- client software: apache-jena-5.2.0/bin/rsparql
- rsparql version reported as 4.2.0
- connection: vpn to uni-muenster
- note: all queries were run once about 30min before. Noting reduced query time on the 00-construct and other queries. Caching?
- As there is no API to control cache neither server side nor client side, we may have to do some kind of warm up. In subsequent runs, change order of loops (repetition of queries should be a higher order loop than iterating servers.
- executing all queries
- Connected to Münster VPN
- Server host: 10.14.28.137
- checked all queries can be executed.
- query 09 failed, moved it to queries/query_foundry.
- observed that queries run after the test respond faster, suspect there is some caching going on.
- results are saved in /queries/saved_runs/20241210T091800
cd queries for q in ??-*.rq; do ./timer.sh $query http://128.176.233.7:8080/sparql ontop 30
shell command for fuseki
cd queries for q in ??-*.rq; do ./timer.sh $query http://128.176.233.7:3030/OME/sparql ontop 30In all queries, ontop’s query response time is faster than fuseki. Must be aware of caching, hence should try other query submission strategy (-> see Run 2, Run 3) By changing the order with respect to Run1, we try to eliminate the effect of caching, as each server never sees the same two queries immediately one after the other, wherea before, all 30 queries were run one after the other.
- Servers: ontop, fuseki
- number of query repeats: 30
- client ip: 128.176.254.12
- client hostname: micropop046
- client software: apache-jena-5.2.0/bin/rsparql
- rsparql version reported as 4.2.0
- connection: vpn to uni-muenster
- note: changing order of looping over endpoints, rounds, and queries, see script below.
- data generating script:
run_all_queries.sh
#! /bin/bash for query in ??-*.rq; do echo "Wall (s),User (s),Sys (s)" > $(basename ${query} .rq).ontop.timings.collected.csv echo "Wall (s),User (s),Sys (s)" > $(basename ${query} .rq).fuseki.timings.collected.csv done for round in {1..30}; do echo "round=$round " for query in ??-*.rq; do echo "$query " echo ontop ./timer.sh $query http://10.14.28.137:8080/sparql ontop 1 cat $(basename ${query} .rq).ontop.timings.csv | tail -1 >> $(basename ${query} .rq).ontop.timings.collected.csv echo fuseki ./timer.sh $query http://10.14.28.137:3030/OME/sparql fuseki 1 cat $(basename ${query} .rq).fuseki.timings.csv | tail -1 >> $(basename ${query} .rq).fuseki.timings.collected.csv done doneTimings come out differently: for many queries, now fuseki is faster, ontop factor 2 slower. The observed factor 2 in ontop’s query response time as compared to fuseki is likely caused by the overhead in ontop. Ontop has to rewrite the sparql query in sql, run the sql query against the relational database and re-translate the sql response into RDF, whereas fuseki runs the sparql query directly on its triplestore database.
- Servers: ontop, fuseki, virtuoso
- number of query repeats: 30
- client ip: 128.176.254.12
- client hostname: micropop046
- client software: apache-jena-5.2.0/bin/rsparql
- rsparql version reported as 4.2.0
- connection: vpn to uni-muenster
- data generating script:
run_all_queries.sh
#! /bin/bash for query in ??-*.rq; do echo "Wall (s),User (s),Sys (s)" > $(basename ${query} .rq).ontop.timings.collected.csv echo "Wall (s),User (s),Sys (s)" > $(basename ${query} .rq).fuseki.timings.collected.csv echo "Wall (s),User (s),Sys (s)" > $(basename ${query} .rq).virtuoso.timings.collected.csv done for round in {1..30}; do echo "round=$round " for query in ??-*.rq; do echo "$query " echo ontop ./timer.sh $query http://10.14.28.137:8080/sparql ontop 1 cat $(basename ${query} .rq).ontop.timings.csv | tail -1 >> $(basename ${query} .rq).ontop.timings.collected.csv echo fuseki ./timer.sh $query http://10.14.28.137:3030/OME/sparql fuseki 1 cat $(basename ${query} .rq).fuseki.timings.csv | tail -1 >> $(basename ${query} .rq).fuseki.timings.collected.csv echo virtuoso ./timer.sh $query http://10.14.28.137:8890/sparql virtuoso 1 cat $(basename ${query} .rq).virtuoso.timings.csv | tail -1 >> $(basename ${query} .rq).virtuoso.timings.collected.csv done doneWe first generated rdf.ttl files of reduced size with
construct {?s ?p ?o} where {?s ?p ?o} limit <NTRIPLES>
NTRIPLES
is a placeholder which takes on values of 1000, 2000, 5000,
10000, 20000, 50000, 100000, and 200000.
Starting with the 200k triples graph loaded into the fuseki triplestore, we run 10 queries on the fuseki endpoint, repeat that sequence 10 times. Script is pasted below.
After each run, we drop the default graph
drop default
and upload the next reduced graph.
#! /bin/bash for query in ??-*.rq; do echo "Wall (s),User (s),Sys (s)" > $(basename ${query} .rq).fuseki.timings.collected.csv done for round in {1..10}; do echo "round=$round " for query in ??-*.rq; do echo "$query " echo fuseki ./timer.sh $query http://10.14.28.137:3030/OME/sparql fuseki 1 cat $(basename ${query} .rq).fuseki.timings.csv | tail -1 >> $(basename ${query} .rq).fuseki.timings.collected.csv done done
The figure shows the measured query response time (Wall , User , and System clocks [fn:1]) as function of the number of triples loaded in the Fuseki triplestore (top: linear x axis scale, right: log x axis scale). Each point is the average over 10 identical queries, the various queries are color coded. The shaded areas mark 1 standard deviation above and below the marker.
The observed query response time vary with the number of triples and the type of the query. The longest query response time (wall time) is measured for the “image properties” query, which retrieves all key-value annotations from all images. For very small graphs (1000 and 2000 triples), all measured response (wall) times coincide at approx. 1s. Up to a certain graph size, each query’s response time is at first independent of the graph size before it starts to increase at approximately linear scale (query response wall time ~ number of triples).[fn:1] Wall time = time elapsed on a “wall” clock; User time = Sum of times that any CPU spends in user code within the process; System time: Sum of times that any CPU spends in system code within the process. See this SO post for details.
- Servers: ontop, fuseki, virtuoso
- number of query repeats: 10
- client ip: 128.176.254.12
- client hostname: micropop046
- client software: apache-jena-5.2.0/bin/rsparql
- rsparql version reported as 4.2.0
- connection: public network (server IP 128.176.233.7)
#! /bin/bash for query in ../../??-*.rq; do echo "Wall (s),User (s),Sys (s)" > $(basename ${query} .rq).ontop.timings.collected.csv echo "Wall (s),User (s),Sys (s)" > $(basename ${query} .rq).fuseki.timings.collected.csv echo "Wall (s),User (s),Sys (s)" > $(basename ${query} .rq).virtuoso.timings.collected.csv done for round in {1..30}; do echo "" echo "" echo "round=$round " for query in ../../??-*.rq; do echo "$query " echo ontop ../../timer.sh $query http://128.176.233.7:8080/sparql ontop 1 cat $(basename ${query} .rq).ontop.timings.csv | tail -1 >> $(basename ${query} .rq).ontop.timings.collected.csv echo fuseki ../../timer.sh $query http://128.176.233.7:3030/OME/sparql fuseki 1 cat $(basename ${query} .rq).fuseki.timings.csv | tail -1 >> $(basename ${query} .rq).fuseki.timings.collected.csv echo virtuoso ../../timer.sh $query http://128.176.233.7:8890/sparql virtuoso 1 cat $(basename ${query} .rq).virtuoso.timings.csv | tail -1 >> $(basename ${query} .rq).virtuoso.timings.collected.csv done doneIn previous runs, we connected to the sparql endpoints via VPN, here we connect via public network. We find that this results in a slight reduction of query response time in all queries.