-
Notifications
You must be signed in to change notification settings - Fork 2
Importing Denoised Sequences from DADA2
It may be preferred to denoise your sequences before aggregating the temporal profiles with Ananke. The R library DADA2 is a good solution for denoising Illumina sequences. The result can then be imported using the ananke import_from_dada2
command. This tutorial will demonstrate downloading sequences from EBI, pre-processing them with DADA2, and importing the denoised sequences into Ananke.
Note: This tutorial is for pre-assembled sequences. For paired-end data, follow the tutorial at https://benjjneb.github.io/dada2/tutorial.html and export seqtab.nochim
with write.table(seqtab.nochim, "dada2_export.csv")
.
- curl
- wget
- awk
- dada2 version 1.4 (R library)
- lubridate (R library)
- FASTQ amplicon data
In this case, our input data are pre-assembled Illumina FASTQ files fetched from EBI. Here is a BASH script that will download the metadata and the sequence files.
fetchData.sh:
#!/bin/bash
#Set the project accession (example given here is Lake Mendota 11-year time-series)
ACCESSION=PRJEB14911
#Put everything in a folder
mkdir -p sequence_data
cd sequence_data
#Fetch the project file manifest
curl -sLo MANIFEST.txt "http://www.ebi.ac.uk/ena/data/warehouse/search?query=%22study_accession%3D%22${ACCESSION}%22%22&result=read_run&fields=fastq_ftp,sample_alias,sample_accession&display=report"
#Fetch each of the noted FASTQ files
awk 'BEGIN{FS="\t";}{if (NR>1) {split($2, f, ";"); system("wget " f[1]); system("wget " f[2]);}}' MANIFEST.txt
#Fetch the metadata for each sample
for SAMPLE_ACCESSION in `tail -n +2 MANIFEST.txt | cut -f 4`
do
#Get the XML report from EBI
curl -sLo ${SAMPLE_ACCESSION}.txt "https://www.ebi.ac.uk/ena/data/view/${SAMPLE_ACCESSION}&display=xml"
#If there is no metadata file, write the first line
if [ ! -f "METADATA.txt" ]
then
#Scrape the metadata categories from the XML file, save them as the header
awk 'BEGIN{ORS=""; OFS=""; i=1} {if ($0~/<SAMPLE_ATTRIBUTE>/) { getline; split($0,x,">"); split(x[2], y, "<"); tag[i]=y[1]; i+=1;}} END{print "#SampleID" "\t" "sample_name"; for (j=1; j<=i; j++){print "\t" tag[j];}}' ${SAMPLE_ACCESSION}.txt > METADATA.txt
fi
#Scrape the metadata values from the XML file, save them as a new row
awk 'BEGIN{ORS=""; OFS=""; i=1} {if ($0~/ENA-RUN/) {getline; split($0, x, ">"); split(x[2], y, "<"); run=y[1];} if ($0~/SUBMITTER_ID/) { split($0, x, ">"); split(x[2], y, "<"); samplename=y[1];}; if ($0~/<SAMPLE_ATTRIBUTE>/) { getline; getline; split($0,x,">"); split(x[2], y, "<"); value[i]=y[1]; i+=1;}} END{print "\n"; print run "\t" samplename; for (j=1; j<=i; j++){print "\t" value[j];}}' ${SAMPLE_ACCESSION}.txt >> METADATA.txt
done
In the sequence_data directory, you will now have the .fastq.gz
sequence files, and a METADATA.txt file that contains all relevant metadata.
Now we need to modify the METADATA.txt file to contain a column where the dates are given as integer offsets from the earliest date. We will use the R library lubridate
for this purpose.
fixDates.R:
library(lubridate)
md <- read.table("sequence_data/METADATA.txt",sep="\t",header=TRUE,comment.char="")
minDate <- min(as.Date(md$collection_timestamp, format="%m/%d/%Y"))
md$time_points <- as.numeric(as.Date(md$collection_timestamp, format="%m/%d/%Y") - minDate)
colnames(md)[1] <- "#SampleID"
write.table(md, "sequence_data/METADATA_time.txt", sep="\t", quote=FALSE, row.names=FALSE)
Once again in R, we will denoise the sequences that were just downloaded.
runDADA2.R:
library(dada2)
path <- "./sequence_data"
fqs <- sort(list.files(path, pattern=".fastq.gz"))
sample.names <- sapply(strsplit(fqs, "\\."), `[`, 1)
fqs <- file.path(path, fqs)
# The quality profile step helps inform the trimming length in filterAndTrim,
# which has been omitted here as these data were filtered before submission to EBI
#plotQualityProfile(fqs[1:2])
filt_path <- file.path(path, "filtered")
filtfqs <- file.path(filt_path, paste0(sample.names, "_filt.fastq.gz"))
# Leaving out trunclen since these data saw some truncation before EBI submission
out <- filterAndTrim(fqs, filtfqs,
maxN=0, maxEE=2, truncQ=2, rm.phix=TRUE,
compress=TRUE, multithread=TRUE)
# Error estimation step
err <- learnErrors(filtfqs, multithread=TRUE)
derepFs <- derepFastq(filtfqs, verbose=TRUE)
names(derepFs) <- sample.names
# This is the main denoise step
dadaFs <- dada(derepFs, err=err, multithread=TRUE, OMEGA_A=1e-20)
# Build the table
seqtab <- makeSequenceTable(dadaFs)
# Remove chimeras
seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)
# This is used to track the sequences through each stage, just as a sanity check
getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dadaFs, getN), rowSums(seqtab), rowSums(seqtab.nochim))
colnames(track) <- c("input", "filtered", "denoised", "tabled", "nonchim")
rownames(track) <- sample.names
print(track)
# Export table from DADA2 to text table
write.table(t(seqtab.nochim), "dada2_export.csv")
It is now a simple command to import those denoised sequences into Ananke:
ananke import_from_dada2 -i dada2_export.csv -m sequence_data/METADATA_nolow.txt -t time_points -f seq.unique.dada2.fasta -o mendota_ananke.h5
The mendota_ananke.h5
file can now be filtered and clustered with Ananke, and visualized with Ananke-UI. The seq.unique.dada2.fasta file can be used in your preferred classification pipeline, and can be clustered with your usual pipeline to allow contrasts between sequence-identity clusters and time-series clusters.