Skip to content

Importing Denoised Sequences from DADA2

Michael Hall edited this page Jul 5, 2017 · 3 revisions

It may be preferred to denoise your sequences before aggregating the temporal profiles with Ananke. The R library DADA2 is a good solution for denoising Illumina sequences. The result can then be imported using the ananke import_from_dada2 command. This tutorial will demonstrate downloading sequences from EBI, pre-processing them with DADA2, and importing the denoised sequences into Ananke.

Note: This tutorial is for pre-assembled sequences. For paired-end data, follow the tutorial at https://benjjneb.github.io/dada2/tutorial.html and export seqtab.nochim with write.table(seqtab.nochim, "dada2_export.csv").

Pre-requisites

  • curl
  • wget
  • awk
  • dada2 version 1.4 (R library)
  • lubridate (R library)
  • FASTQ amplicon data

Obtaining Input Data

In this case, our input data are pre-assembled Illumina FASTQ files fetched from EBI. Here is a BASH script that will download the metadata and the sequence files.

fetchData.sh:

#!/bin/bash

#Set the project accession (example given here is Lake Mendota 11-year time-series)
ACCESSION=PRJEB14911

#Put everything in a folder
mkdir -p sequence_data
cd sequence_data

#Fetch the project file manifest
curl -sLo MANIFEST.txt "http://www.ebi.ac.uk/ena/data/warehouse/search?query=%22study_accession%3D%22${ACCESSION}%22%22&result=read_run&fields=fastq_ftp,sample_alias,sample_accession&display=report"

#Fetch each of the noted FASTQ files
awk 'BEGIN{FS="\t";}{if (NR>1) {split($2, f, ";"); system("wget " f[1]); system("wget " f[2]);}}' MANIFEST.txt

#Fetch the metadata for each sample
for SAMPLE_ACCESSION in `tail -n +2 MANIFEST.txt | cut -f 4`
do
    #Get the XML report from EBI
    curl -sLo ${SAMPLE_ACCESSION}.txt "https://www.ebi.ac.uk/ena/data/view/${SAMPLE_ACCESSION}&display=xml"

    #If there is no metadata file, write the first line
    if [ ! -f "METADATA.txt" ]
    then
        #Scrape the metadata categories from the XML file, save them as the header
        awk 'BEGIN{ORS=""; OFS=""; i=1} {if ($0~/<SAMPLE_ATTRIBUTE>/) { getline; split($0,x,">"); split(x[2], y, "<"); tag[i]=y[1]; i+=1;}} END{print "#SampleID" "\t" "sample_name"; for (j=1; j<=i; j++){print "\t" tag[j];}}' ${SAMPLE_ACCESSION}.txt > METADATA.txt
    fi

    #Scrape the metadata values from the XML file, save them as a new row
    awk 'BEGIN{ORS=""; OFS=""; i=1} {if ($0~/ENA-RUN/) {getline; split($0, x, ">"); split(x[2], y, "<"); run=y[1];} if ($0~/SUBMITTER_ID/) { split($0, x, ">"); split(x[2], y, "<"); samplename=y[1];}; if ($0~/<SAMPLE_ATTRIBUTE>/) { getline; getline; split($0,x,">"); split(x[2], y, "<"); value[i]=y[1]; i+=1;}} END{print "\n"; print run "\t" samplename; for (j=1; j<=i; j++){print "\t" value[j];}}' ${SAMPLE_ACCESSION}.txt >> METADATA.txt
done

In the sequence_data directory, you will now have the .fastq.gz sequence files, and a METADATA.txt file that contains all relevant metadata.

Now we need to modify the METADATA.txt file to contain a column where the dates are given as integer offsets from the earliest date. We will use the R library lubridate for this purpose.

fixDates.R:

library(lubridate)

md <- read.table("sequence_data/METADATA.txt",sep="\t",header=TRUE,comment.char="")

minDate <- min(as.Date(md$collection_timestamp, format="%m/%d/%Y"))
md$time_points <- as.numeric(as.Date(md$collection_timestamp, format="%m/%d/%Y") - minDate)

colnames(md)[1] <- "#SampleID"

write.table(md, "sequence_data/METADATA_time.txt", sep="\t", quote=FALSE, row.names=FALSE)

Denoise with DADA2

Once again in R, we will denoise the sequences that were just downloaded.

runDADA2.R:

library(dada2)

path <- "./sequence_data"
fqs <- sort(list.files(path, pattern=".fastq.gz"))
sample.names <- sapply(strsplit(fqs, "\\."), `[`, 1)
fqs <- file.path(path, fqs)

# The quality profile step helps inform the trimming length in filterAndTrim, 
# which has been omitted here as these data were filtered before submission to EBI
#plotQualityProfile(fqs[1:2])
filt_path <- file.path(path, "filtered")
filtfqs <- file.path(filt_path, paste0(sample.names, "_filt.fastq.gz"))

# Leaving out trunclen since these data saw some truncation before EBI submission
out <- filterAndTrim(fqs, filtfqs,
              maxN=0, maxEE=2, truncQ=2, rm.phix=TRUE,
              compress=TRUE, multithread=TRUE)

# Error estimation step
err <- learnErrors(filtfqs, multithread=TRUE)
derepFs <- derepFastq(filtfqs, verbose=TRUE)
names(derepFs) <- sample.names

# This is the main denoise step
dadaFs <- dada(derepFs, err=err, multithread=TRUE, OMEGA_A=1e-20)

# Build the table
seqtab <- makeSequenceTable(dadaFs)
# Remove chimeras
seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)

# This is used to track the sequences through each stage, just as a sanity check
getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dadaFs, getN), rowSums(seqtab), rowSums(seqtab.nochim))
colnames(track) <- c("input", "filtered", "denoised", "tabled", "nonchim")
rownames(track) <- sample.names
print(track)

# Export table from DADA2 to text table
write.table(t(seqtab.nochim), "dada2_export.csv")

Import into Ananke

It is now a simple command to import those denoised sequences into Ananke:

ananke import_from_dada2 -i dada2_export.csv -m sequence_data/METADATA_nolow.txt -t time_points -f seq.unique.dada2.fasta -o mendota_ananke.h5

The mendota_ananke.h5 file can now be filtered and clustered with Ananke, and visualized with Ananke-UI. The seq.unique.dada2.fasta file can be used in your preferred classification pipeline, and can be clustered with your usual pipeline to allow contrasts between sequence-identity clusters and time-series clusters.