Assembly

De novo assembly of short reads

Overview

Most sequencing machines are unable to produce long reads, and even when they do they don't produce a single read per chromosome. Moreover, the sequencers currently able to produce long reads (Oxford Nanopore and PacBio) usually have a higher error-rate (Nanopore), a higher cost-per-base (PacBio) and a lower throughput if compared with the more commonly available Illumina sequencers.

The sequencing of small genomes usually involves the whole genome shotgun approach, that will produce a stack of short reads from the genome.

Programs able to perform a de novo genome assembly are then used to create a consensus, usually becoming a set of contiguous sequences (known as contigs).

Gzip compression

A popular compression tool used in Linux servers is called gzip. Unlike other compression systems, it's only able to compress a single file at the time, but this has some advantages, as we'll see.

Go to the phage directory we found yesterday. If you don't remember where it is:

find ~ -name "phage" -type d

Now that you found it:

cd ~/learn_bash/phage/

let's check the size of some files:

ls -lh *f

now let's compress the very same files (they will get a .gz suffix):

gzip *f
ls -lh *f.gz

To decompress them:

gunzip *.gz

Our input files

To start, create a directory called denovo, placed inside your home directory:

mkdir ~/denovo

Now we want to locate input reads. Usual extensions from reads from a machine are .fastq or .fq, but being usually large files, they are compressed with gzip:

find ~/learn_bash -name "*.fastq.gz" | sort

You should find at least 6 files (3 pairs):

/home/ubuntu/learn_bash/phage/reads/sample1_R1.fastq.gz
/home/ubuntu/learn_bash/phage/reads/sample1_R2.fastq.gz
/home/ubuntu/learn_bash/phage/reads/sample2_R1.fastq.gz
/home/ubuntu/learn_bash/phage/reads/sample2_R2.fastq.gz
/home/ubuntu/learn_bash/phage/reads/sample3_R1.fastq.gz
/home/ubuntu/learn_bash/phage/reads/sample3_R2.fastq.gz

These are paired reads, meaning that R1 and R2 files are coming from the same library and contain the forward and reverse read, respectively.

Now try to:

Decompress them
Have a look at their structure (using less -S)
Count the lines
Re-compress them

Seqkit to count the number of reads

We can use one of the many programs to calculate statistics about FASTQ files:

seqkit stats ~/learn_bash/phage/reads/*fastq.gz

seqfu count ~/learn_bash/phage/reads/*fastq.gz

The "SPAdes" assembly program

In our server there is a version of SPAdes pre-installed. To learn more we can check the online documentation, but also try to see if the program comes with some help (it is usually the case). Try typing spades.py, and hit enter, without any parameter.

Is the help page printed in the standard output or in the standard error?
Save the help screen in a file called "spades.txt" in your freshly created denovo directory.

As you could see from the manual, to assemble paired ends the (minimal) syntax is:

spades.py -1 {first_pair_fastq} -2 {second_pair_fastq} -o {new_output_directory}

For example:

spades.py -t 4 -1 ~/learn_bash/phage/reads/sample2_R1.fastq.gz \
  -2 ~/learn_bash/phage/reads/sample2_R2.fastq.gz -o ~/assembly

What file have been created in the output directory? Ar there FASTA files?
How many sequences are there in contigs.fasta?
How long are they [the size is printed in the sequence header]?

All the samples have been preassembled in the /data directory, so we can run a program that will check the statistics for all the assemblies:

seqkit stats /data/assembly-*/contigs.fasta

· Bioinformatics at the Command Line - Andrea Telatin, 2017-2020

Menu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly