Skip to content
brantfaircloth edited this page Oct 6, 2011 · 1 revision

How did you create the set of edit distance tags?

Why is creating a set of edit distance tags so slow?

How did you create the set of edit distance tags?

I created the set of tags by running the following script, with increased values for the length of the tag::

python make_levenshtein_tags.py --tag-length=10 --edit-distance=3 --no-polybase \
    --gc --comp --use-c --multiprocessing --min-and-greater | tee 10_nt.txt

This runs make_levenshtein_tags.py for the options given, which creates all combinations of ACGT of appropriate size, filters polybase (n > 2) tags, filters tags of low (gc % < 40) and high (gc% > 60) GC content, and then finds the largest set of edit distance tags for the distance given (in the case above this is 3). The code also iterates through higher edit distance values, until reaching the maximum values.

Because of this iteration, you to not always get the exact maximum number of tags in edit distance categories greater than the minimum (so, in the above example these categories include edit distances [4,5,6,7,8,9]). However, the number are usually very close to the maximum, and this method of identifying tags if faster than starting the run anew for each group of edit distance tags.

Why is creating a set of edit distance tags so slow (for values >8nt)?

Because you have to compute pairwise comparisons of edit distance between all the tags in the candidate set. For the filtered set of 10nt tags, this is::

(483,999 * 484,000) / 2 = 117,127,758,000

comparisons to make. We skip many of these comparisons by first finding the set of tags that have the largest number of tags (compared to themselves) greater than the specified minimum edit distance. Then, we iterate one-by-one across this reduced set of tags, finding the group of tags greater than or equal to the minimum edit distance from all other tags in that group. Finally, we pick the tags that compose the largest group.