This repository contains code and data for reproducing experiments from the manuscript accompanying the Dashing software application, available at https://github.com/dnbaker/dashing.
To build executables, either git clone --recursive bonsai && make
, or if you have bonsai available on your system (with its submodules), make BONSAI_DIR=$PATH_TO_BONSAI
.
Like dashing and bonsai, these require C++14.
dsexp contains experiments testing the performance of various data structures for Jaccard-coefficient calculation.
- dsexp.cpp
- Performs numerical simulations comparing the error rates of bloom filters, minhash sketches, and hyperloglogs at specific sketch sizes for Jaccard-coefficient estimation of sets of varying sizes.
- dsexp.Rmd
- Contains code for visualizing results from dsexp.cpp, which can be used to reproduce Fig. 1 and Supplementary Table 1 from the manuscript.
This code was used to generate Fig. 3, Table 2, and Supplementary Table 3.
all_pairwise.py
- Performs all pairwise comparisons between a set of genomes across varying sketch size and kmer length, comparing bindash, mash, and several estimation methods for HyperLogLogs.
- filenames.txt
- A list of all genomes used for this experiment. These were fetched using bonsai's (https://github.com/dnbaker/bonsai/)
download_genomes.py
script, requestingall
genomes.
- A list of all genomes used for this experiment. These were fetched using bonsai's (https://github.com/dnbaker/bonsai/)
This code and data were used to generate Table 1, Fig. 2, and Supplementary Table 2.
- pairselector.py
- This script finds candidate genome pairs in specified ranges of Jaccard indices from a large, upper-triangular table of pairwise distances.
- We generated this table with
dashing dist
, with k=31 and p=16.
- We generated this table with
- This script finds candidate genome pairs in specified ranges of Jaccard indices from a large, upper-triangular table of pairwise distances.
pairwise_benchmark.cpp
- For all pairs of genomes provided, calculate the exact Jaccard index with hash sets, and report the difference and errors between estimates and the true value for bindash, mash, and several estimation methods for HyperLogLogs.
- This can be rather memory-intensive due to the use of full hash sets; for this reason, we suggest omitting large genomes from the call generating the table used in pairselector.py.
genomes_for_exp.txt
- The set of genomes emitted by pairselector.py.
ji_range.Rmd
- Contains code for generating Fig. 2 from the output of
pairwise_benchmark
.
- Contains code for generating Fig. 2 from the output of
ji_range_postprocess.py
- Contains code for finalizing preparation of Table 1 from the output of
pairwise_benchmark
.
- Contains code for finalizing preparation of Table 1 from the output of
This code was used to evaluate the accuracy of JI estimates as a function of hash function selection.
- testhash.cpp
- This code evaluates the performance of the HLL and experimental, related structures for cardinality estimation.
- We used this code, extracting only the HLL-relevant data, to provide experimental results relating to the performance of hash functions in the text.