The MetaGenomQuest Tutorial

Each of you will get a random metagenomic sample and with the help of this tutorial, it is your quest to report the sample that you were tasked to study!

Introduction

As sequencing methods become more inexpensive, producing genomic and metagenomic data becomes more available for large-scale analyses. Therefore, the development of quick and accuarate computational approaches become essential. Here, we introduce one such program, sourmash, which enables the minimization of sequencing data, similarity comparisons, species identification, among other tasks[1].

FracMinHash Sketch

Utilizing sourmash, we can sketch a simpler representation of the original genomic or metagenomic datasets using the FracMinHash approach.

Simply put it, a minimized set of k-mers (a subset of a string of length k) is produced from a fasta file with sequences of interest. The threshold to which the set is minimized is indicated by a scale factor, a parameter used to reduce the original set. Note that if a scale factor of 1 is utilized, the final FracMinHash sketch will be the original set of k-mers and as this number increases, the more the original set of k-mers will decrease for the final FracMinHash sketch. Each k-mer in the original set is passed through a Hash function, k-mers that minimize this function by having a value that is equal or less to the scale factor are used in the final FracMinHash sketch, FRAC(A) for example.

Comparing similarity of two FracMinHash sketches

One of the beneficial tasks of sourmash is to estimate similarity between sketches offering the similarity indexes Jaccard and Containment.

Jaccard. The Jaccard index is one of an earlier similarity index used to estimate the size of the union of two sets. However, as the sizes between two sets becomes more dissimilar, the less reliable is this estimation[2].

Containment. The Containment index addresses this issue by estimating what is contained from one Set in another so the differing size of two sets is not an issue[2].

Preprocessing data utilizing sourmash sketch

To start our sourmash tutorial, we will sketch our fasta file of interest: sample_001.fna. This file does not contain any meaningfull information but is just a play dataset for tutorial purposes.

sourmash sketch dna datasets/sample_001.fna -p k=31,scaled=500

Parameter	Description
sourmash sketch	The command that sourmash utilizes to produce a sketch.
dna	Identify that the sequences in our X file are DNA. If a different sequence type is used, like protein sequences, then this would be indicated as protein instead.
sample_001.fna	Filename of interest. Please note that sourmash can produce sketches from either FASTA or FASTQ fules.
-p	Flag to indicate a list of parameters.
k=33	Tunable parameter required by the user to set. Check how the ksize can affect further analyses here: (CITE).
scaled=500	Our scale factor, which is also tunable.

Utilizing sourmash sketch, we have produced the following file, known as a signature file: sample_001.fna.sig.

Let's be a bit anosey and see what this file contains:

sourmash sig describe sample_001.fna.sig

Below is the information you should see displayed on your end. Information such as the signature filename, sequence type, ksize and scale factor used are report. Additional information, such as total number of signatures and hashes are also shown, alongside other information.

== This is sourmash version 4.8.6. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from 'sample_001.fna.sig'
path filetype: MultiIndex
location: sample_001.fna.sig
is database? no
has manifest? yes
num signatures: 1
** examining manifest...
total hashes: 21
summary of sketches:
   1 sketches with DNA, k=31, scaled=1000             21 total hashes

Comparing and Searching signatures

Compare similarity of two sketches with sourmash compare

Now that we feel a bit more comfortable with a sketching fasta file. Let's sketch a another file to compare it to the previous signature file we produced. Notice, that we are using the same ksize and scale factor, sourmash requires that two signatures have parameters tuned in the same way for comparison.

sourmash sketch dna datasets/sample_002.fna -p k=31,scaled=500

Let's estimate the containment index of our signature files utilizing sourmash compare!

sourmash compare sample_001.fna.sig sample_002.fna.sig --containment

You should see the following output, where we have the names of the original fasta files and the containment between these files.

== This is sourmash version 4.8.6. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loaded 2 signatures total.
NOTE: downsampling to scaled value of 1000

0-sample_001.fna        [1.    0.696]
1-sample_002.fna        [0.762 1.   ]
min similarity in matrix: 0.696
WARNING: size estimation for at least one of these sketches may be inaccurate. ANI values will be set to 1 for these comparisons.

These results can be reported to a csv file for further analyses.

sourmash compare sample_001.fna.sig sample_002.fna.sig --containment --csv compare.csv

Parameter	Description
sourmash compare	The command that sourmash utilizes to compare two signatures.
sample_001.fna.sig	signature filename 1.
sample_002.fna.sig	signature filename 2.
--containment	Flag to indicate we are using the similarity index containment. Other similarity indexes that can be used are --jaccard or --ani
--csv	Flag to indicate we want to produce a CSV file to report a matrix with similarity indexes

Looking at our newly produced csv file, we would expect the following similarity matrix:

sample_001.fna,sample_002.fna
1.0,0.6956521739836137
0.7619047624764425,1.0

Compare individual sequences between signatures.

As you may have noticed, the comparisons reported are between two fasta files as a whole. However, you might be interested in comparing the sequences individually than the sequences as a whole. To accomplish this, we revisit sourmash sketch.

Let's modify our sourmaash sketch command by adding the --singleton flag, this will produce a signature file where each sequence is sketched individually.

sourmash sketch dna datasets/sample_001.fna --singleton -p k=31,scaled=500 -o sample_001.fna.singleton.sig

Compare the description between the first signature file we produced in which the --singleton flag was not utilized (sample_001.fna.sig) ato our new siganture file that does utilize the --singleton flag, sample_001.fna.singleton.sig.

sourmash sig fileinfo sample_001.fna.singleton.sig

Instead of having just one sketch representing the fasta file as a whole, the --singleton flag produces a signature file with 10 sketches representing the 10 sequences within our fasta file of interest.


== This is sourmash version 4.8.6. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from 'test.sig'
path filetype: MultiIndex
location: sample_001.fna.singleton.sig
is database? no
has manifest? yes
num signatures: 10
** examining manifest...
total hashes: 210
summary of sketches:
   10 sketches with DNA, k=31, scaled=500             210 total hashes

We can run the same command for coomparing two signature files, and the csv file produced will contain the matrix for individual sequences.

sourmash compare sample_001.fna.singleton.sig sample_002.fna.singleton.sig --containment --csv compare.singleton.sig.csv

Search and report overall similarity percentages using sourmash search

Maybe you would like to report the percent of how much there is of one sample in another sample.

sourmash search sample_001.fna.sig sample_002.fna.sig --containment

Parameter	Description
sourmash search	The command that sourmash utilizes to search how much a query is in a sample.
sample_001.fna.sig	signature filename 1.
sample_002.fna.sig	signature filename 2.
--containment	Flag to indicate we are using the similarity index containment. Other similarity indexes that can be used are `--jaccard`

According to the report below, there is ~76% of sample_001.fna.sig in sample_002.fna.sig.

== This is sourmash version 4.8.6. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

select query k=31 automatically.
loaded query: sample_001.fna... (k=31, DNA)
--
loaded 1 total signatures from 1 locations.
after selecting signatures compatible with search, 1 remain.

1 matches above threshold 0.080:
similarity   match
----------   -----
 76.2%       sample_002.fna

Search an md5 identifier and report similarity

Let's revisit the signature file that was produced using the --singleton flag. A manifest file contains md5 identifiers for each sketch within a signature file. To create your manifest file, use the following command:

sourmash sig manifest sample_001.fna.singleton.sig -o sample_001.manifest.csv

Parameter	Description
sourmash sig manifest	The command that sourmash utilizes to produce a manifest file.
sample_001.fna.singleton.sig	signature filename 1 that includes a sketch for each sequence.
-o	Output name of manifest file. The output name can also be indicated using `--csv`

If you open the newly produced manifest file, you'll observe that each sketch in your signature file has a md5 identifier. For tutorial purposes, I only show the first few lines.

internal_location	md5	md5short	ksize	moltype	scaled	n_hashes	name	filename
sample_001.fna.sig	d3513280b35b2a918a7181875c0683c8	d3513280	31	DNA	500	22	genome_A_sequence_A	datasets/sample_001.fna
sample_001.fna.sig	2d96ee330b6b295a06b70fdbfb75af34	2d96ee33	31	DNA	500	21	genome_A_sequence_0	datasets/sample_001.fna
sample_001.fna.sig	364c20a1ca43d3ae3e9ae7d9e9a6c837	364c20a1	31	DNA	500	19	genome_B_sequence_0	datasets/sample_001.fna

If we copy the first md5 identifer in this file, we can sourmash search how much of this sequence is in sample_002.fna.singleton.sig

sourmash search sample_001.fna.singleton.sig sample_002.fna.singleton.sig --md5 d3513280b35b2a918a7181875c0683c8 --containment

The md5 identifier that we are interested in is found across three sequences in sample_002.fna.singleton.sig with the following percentages.

== This is sourmash version 4.8.6. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

select query k=31 automatically.
loaded query: genome_A_sequence_A... (k=31, DNA)
--
loaded 10 total signatures from 1 locations.
after selecting signatures compatible with search, 10 remain.

9 matches above threshold 0.080; showing first 3:
similarity   match
----------   -----
100.0%       ref_gene
 90.9%       genome_A_2
 86.4%       genome_A_1
WARNING: size estimation for at least one of these sketches may be inaccurate. ANI values will not be reported for these comparisons.

Report abundance using sourmash gather

One of the many tasks in metagenomic research, is reporting the abundanes of species within a sample. We can accomplish this utilizing sourmash gather.

Unsure if this is working, running sourmash gather --help returning error.

References

[1] Brown, C. T., & Irber, L. (2016). sourmash: a library for MinHash sketching of DNA. Journal of open source software, 1(5), 27.

[2] Koslicki, D., & Zabeti, H. (2019). Improving minhash via the containment index with applications to metagenomic analysis. Applied Mathematics and Computation, 354, 206-215.

[3] Hera, M. R., Pierce-Ward, N. T., & Koslicki, D. (2023). Deriving confidence intervals for mutation rates across a wide range of evolutionary distances using FracMinHash. Genome research, 33(7), 1061-1068.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
data		data
datasets		datasets
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The MetaGenomQuest Tutorial

Introduction

FracMinHash Sketch

Comparing similarity of two FracMinHash sketches

Preprocessing data utilizing sourmash sketch

Comparing and Searching signatures

Compare similarity of two sketches with sourmash compare

Compare individual sequences between signatures.

Search and report overall similarity percentages using sourmash search

Search an md5 identifier and report similarity

Report abundance using sourmash gather

References

About

Releases

Packages

KoslickiLab/sourmash_tutorial

Folders and files

Latest commit

History

Repository files navigation

The MetaGenomQuest Tutorial

Introduction

FracMinHash Sketch

Comparing similarity of two FracMinHash sketches

Preprocessing data utilizing sourmash sketch

Comparing and Searching signatures

Compare similarity of two sketches with sourmash compare

Compare individual sequences between signatures.

Search and report overall similarity percentages using sourmash search

Search an md5 identifier and report similarity

Report abundance using sourmash gather

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages