Skip to content

Latest commit

 

History

History
383 lines (327 loc) · 12.4 KB

Logbook.org

File metadata and controls

383 lines (327 loc) · 12.4 KB

Logbook

Dec. 10 2024

Run 1

Testing all queries from remote via vpn

  • Servers: ontop, fuseki
  • number of query repeats: 30
  • client ip: 128.176.254.12
  • client hostname: micropop046
  • client software: apache-jena-5.2.0/bin/rsparql
  • rsparql version reported as 4.2.0
  • connection: vpn to uni-muenster
  • note: all queries were run once about 30min before. Noting reduced query time on the 00-construct and other queries. Caching?
  • As there is no API to control cache neither server side nor client side, we may have to do some kind of warm up. In subsequent runs, change order of loops (repetition of queries should be a higher order loop than iterating servers.

Workflow

  • executing all queries
  • Connected to Münster VPN
  • Server host: 10.14.28.137
  • checked all queries can be executed.
  • query 09 failed, moved it to queries/query_foundry.
  • observed that queries run after the test respond faster, suspect there is some caching going on.
  • results are saved in /queries/saved_runs/20241210T091800

Shell commands

shell command for ontop:
cd queries
for q in ??-*.rq; do ./timer.sh $query http://128.176.233.7:8080/sparql ontop 30

shell command for fuseki

cd queries
for q in ??-*.rq; do ./timer.sh $query http://128.176.233.7:3030/OME/sparql ontop 30

Results

queries/saved_runs/20241210T091800/facet_walltime.png

Discussion

In all queries, ontop’s query response time is faster than fuseki. Must be aware of caching, hence should try other query submission strategy (-> see Run 2, Run 3)

Run 2: Loop over 10 queries, 2 servers, 10 rounds in that order

By changing the order with respect to Run1, we try to eliminate the effect of caching, as each server never sees the same two queries immediately one after the other, wherea before, all 30 queries were run one after the other.

Run parameters

  • Servers: ontop, fuseki
  • number of query repeats: 30
  • client ip: 128.176.254.12
  • client hostname: micropop046
  • client software: apache-jena-5.2.0/bin/rsparql
  • rsparql version reported as 4.2.0
  • connection: vpn to uni-muenster
  • note: changing order of looping over endpoints, rounds, and queries, see script below.
  • data generating script: run_all_queries.sh

Script

The script was run while connected to the VPN. Adjust the IP if you take this as basis for your own runs.
#! /bin/bash

for query in ??-*.rq; do
    echo "Wall (s),User (s),Sys (s)" > $(basename ${query} .rq).ontop.timings.collected.csv
    echo "Wall (s),User (s),Sys (s)" > $(basename ${query} .rq).fuseki.timings.collected.csv
done

for round in {1..30}; do
    echo "round=$round "
    for query in ??-*.rq; do
        echo "$query "

        echo ontop
        ./timer.sh $query http://10.14.28.137:8080/sparql ontop 1
        cat $(basename ${query} .rq).ontop.timings.csv | tail -1 >> $(basename ${query} .rq).ontop.timings.collected.csv

        echo fuseki
        ./timer.sh $query http://10.14.28.137:3030/OME/sparql fuseki 1
        cat $(basename ${query} .rq).fuseki.timings.csv | tail -1 >> $(basename ${query} .rq).fuseki.timings.collected.csv
    done
done

Results

Timings come out differently: for many queries, now fuseki is faster, ontop factor 2 slower.

queries/saved_runs/20241210T113800/facet_walltime.png

Discussion

The observed factor 2 in ontop’s query response time as compared to fuseki is likely caused by the overhead in ontop. Ontop has to rewrite the sparql query in sql, run the sql query against the relational database and re-translate the sql response into RDF, whereas fuseki runs the sparql query directly on its triplestore database.

Run 3: Same logic as Run 2 but including virtuoso

Run parameters

  • Servers: ontop, fuseki, virtuoso
  • number of query repeats: 30
  • client ip: 128.176.254.12
  • client hostname: micropop046
  • client software: apache-jena-5.2.0/bin/rsparql
  • rsparql version reported as 4.2.0
  • connection: vpn to uni-muenster
  • data generating script: run_all_queries.sh

Script

#! /bin/bash

for query in ??-*.rq; do
    echo "Wall (s),User (s),Sys (s)" > $(basename ${query} .rq).ontop.timings.collected.csv
    echo "Wall (s),User (s),Sys (s)" > $(basename ${query} .rq).fuseki.timings.collected.csv
    echo "Wall (s),User (s),Sys (s)" > $(basename ${query} .rq).virtuoso.timings.collected.csv
done

for round in {1..30}; do
    echo "round=$round "
    for query in ??-*.rq; do
        echo "$query "

        echo ontop
        ./timer.sh $query http://10.14.28.137:8080/sparql ontop 1
        cat $(basename ${query} .rq).ontop.timings.csv | tail -1 >> $(basename ${query} .rq).ontop.timings.collected.csv

        echo fuseki
        ./timer.sh $query http://10.14.28.137:3030/OME/sparql fuseki 1
        cat $(basename ${query} .rq).fuseki.timings.csv | tail -1 >> $(basename ${query} .rq).fuseki.timings.collected.csv

        echo virtuoso
        ./timer.sh $query http://10.14.28.137:8890/sparql virtuoso 1
        cat $(basename ${query} .rq).virtuoso.timings.csv | tail -1 >> $(basename ${query} .rq).virtuoso.timings.collected.csv

    done
done

Results

queries/saved_runs/20241210T143500/facet_walltime.png

Dec. 11 2024

Run 4: Query response time vs. number of triples

We first generated rdf.ttl files of reduced size with
construct {?s ?p ?o} where {?s ?p ?o} limit <NTRIPLES>

NTRIPLES is a placeholder which takes on values of 1000, 2000, 5000, 10000, 20000, 50000, 100000, and 200000.

Starting with the 200k triples graph loaded into the fuseki triplestore, we run 10 queries on the fuseki endpoint, repeat that sequence 10 times. Script is pasted below.

After each run, we drop the default graph

drop default

and upload the next reduced graph.

Script

#! /bin/bash

for query in ??-*.rq; do
    echo "Wall (s),User (s),Sys (s)" > $(basename ${query} .rq).fuseki.timings.collected.csv
done

for round in {1..10}; do
    echo "round=$round "
    for query in ??-*.rq; do
        echo "$query "

        echo fuseki
        ./timer.sh $query http://10.14.28.137:3030/OME/sparql fuseki 1
        cat $(basename ${query} .rq).fuseki.timings.csv | tail -1 >> $(basename ${query} .rq).fuseki.timings.collected.csv
    done
done

Results

queries/saved_runs/20241211T102000/fuseki_clock_vs_ntriples_linear.png

queries/saved_runs/20241211T102000/fuseki_clock_vs_ntriples_log.png

The figure shows the measured query response time (Wall , User , and System clocks [fn:1]) as function of the number of triples loaded in the Fuseki triplestore (top: linear x axis scale, right: log x axis scale). Each point is the average over 10 identical queries, the various queries are color coded. The shaded areas mark 1 standard deviation above and below the marker.

Discussion

The observed query response time vary with the number of triples and the type of the query. The longest query response time (wall time) is measured for the “image properties” query, which retrieves all key-value annotations from all images. For very small graphs (1000 and 2000 triples), all measured response (wall) times coincide at approx. 1s. Up to a certain graph size, each query’s response time is at first independent of the graph size before it starts to increase at approximately linear scale (query response wall time ~ number of triples).

[fn:1] Wall time = time elapsed on a “wall” clock; User time = Sum of times that any CPU spends in user code within the process; System time: Sum of times that any CPU spends in system code within the process. See this SO post for details.

Dec. 12 2024

Run 5: Query response time on public network

Run parameters

  • Servers: ontop, fuseki, virtuoso
  • number of query repeats: 10
  • client ip: 128.176.254.12
  • client hostname: micropop046
  • client software: apache-jena-5.2.0/bin/rsparql
  • rsparql version reported as 4.2.0
  • connection: public network (server IP 128.176.233.7)

Script

queries/saved_runs/20241212T164000/run5.sh
#! /bin/bash

for query in ../../??-*.rq; do
    echo "Wall (s),User (s),Sys (s)" > $(basename ${query} .rq).ontop.timings.collected.csv
    echo "Wall (s),User (s),Sys (s)" > $(basename ${query} .rq).fuseki.timings.collected.csv
    echo "Wall (s),User (s),Sys (s)" > $(basename ${query} .rq).virtuoso.timings.collected.csv
done

for round in {1..30}; do
    echo ""
    echo ""
    echo "round=$round "
    for query in ../../??-*.rq; do
        echo "$query "

        echo ontop
        ../../timer.sh $query http://128.176.233.7:8080/sparql ontop 1
        cat $(basename ${query} .rq).ontop.timings.csv | tail -1 >> $(basename ${query} .rq).ontop.timings.collected.csv

        echo fuseki
        ../../timer.sh $query http://128.176.233.7:3030/OME/sparql fuseki 1
        cat $(basename ${query} .rq).fuseki.timings.csv | tail -1 >> $(basename ${query} .rq).fuseki.timings.collected.csv

        echo virtuoso
        ../../timer.sh $query http://128.176.233.7:8890/sparql virtuoso 1
        cat $(basename ${query} .rq).virtuoso.timings.csv | tail -1 >> $(basename ${query} .rq).virtuoso.timings.collected.csv

    done
done

Results

queries/saved_runs/20241212T164000/facet_walltime.png

Discussion

VPN vs. Public Network

In previous runs, we connected to the sparql endpoints via VPN, here we connect via public network. We find that this results in a slight reduction of query response time in all queries.

queries/saved_runs/20241212T164000/facet_walltime_vpn-www.png