GitHub - langmead-lab/dashing-experiments: Experiments for the Dashing manuscript

Dashing Experiments

This repository contains code and data for reproducing experiments from the manuscript accompanying the Dashing software application, available at https://github.com/dnbaker/dashing.

Build instructions

To build executables, either git clone --recursive bonsai && make, or if you have bonsai available on your system (with its submodules), make BONSAI_DIR=$PATH_TO_BONSAI.

Like dashing and bonsai, these require C++14.

dsexp

dsexp contains experiments testing the performance of various data structures for Jaccard-coefficient calculation.

dsexp.cpp
1. Performs numerical simulations comparing the error rates of bloom filters, minhash sketches, and hyperloglogs at specific sketch sizes for Jaccard-coefficient estimation of sets of varying sizes.
dsexp.Rmd
1. Contains code for visualizing results from dsexp.cpp, which can be used to reproduce Fig. 1 and Supplementary Table 1 from the manuscript.

timing

This code was used to generate Fig. 3, Table 2, and Supplementary Table 3.

all_pairwise.py
1. Performs all pairwise comparisons between a set of genomes across varying sketch size and kmer length, comparing bindash, mash, and several estimation methods for HyperLogLogs.
filenames.txt
1. A list of all genomes used for this experiment. These were fetched using bonsai's (https://github.com/dnbaker/bonsai/) download_genomes.py script, requesting all genomes.

accuracy

This code and data were used to generate Table 1, Fig. 2, and Supplementary Table 2.

pairselector.py
1. This script finds candidate genome pairs in specified ranges of Jaccard indices from a large, upper-triangular table of pairwise distances.
  1. We generated this table with dashing dist, with k=31 and p=16.
pairwise_benchmark.cpp
1. For all pairs of genomes provided, calculate the exact Jaccard index with hash sets, and report the difference and errors between estimates and the true value for bindash, mash, and several estimation methods for HyperLogLogs.
2. This can be rather memory-intensive due to the use of full hash sets; for this reason, we suggest omitting large genomes from the call generating the table used in pairselector.py.
genomes_for_exp.txt
1. The set of genomes emitted by pairselector.py.
ji_range.Rmd
1. Contains code for generating Fig. 2 from the output of pairwise_benchmark.
ji_range_postprocess.py
1. Contains code for finalizing preparation of Table 1 from the output of pairwise_benchmark.

hash

This code was used to evaluate the accuracy of JI estimates as a function of hash function selection.

testhash.cpp
1. This code evaluates the performance of the HLL and experimental, related structures for cardinality estimation.
2. We used this code, extracting only the HLL-relevant data, to provide experimental results relating to the performance of hash functions in the text.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
accuracy		accuracy
dsexp		dsexp
hash		hash
static_binaries/linux		static_binaries/linux
timing		timing
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dashing Experiments

Build instructions

dsexp

timing

accuracy

hash

About

Releases

Packages

Contributors 2

Languages

License

langmead-lab/dashing-experiments

Folders and files

Latest commit

History

Repository files navigation

Dashing Experiments

Build instructions

dsexp

timing

accuracy

hash

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages