-
Notifications
You must be signed in to change notification settings - Fork 4
Assembly
Most sequencing machines are unable to produce long reads, and even when they do they don't produce a single read per chromosome. Moreover, the sequencers currently able to produce long reads (Oxford Nanopore and PacBio) usually have a higher error-rate (Nanopore), a higher cost-per-base (PacBio) and a lower throughput if compared with the more commonly available Illumina sequencers.
The sequencing of small genomes usually involves the whole genome shotgun approach, that will produce a stack of short reads from the genome.
Programs able to perform a de novo genome assembly are then used to create a consensus, usually becoming a set of contiguous sequences (known as contigs).
A popular compression tool used in Linux servers is called gzip
. Unlike other compression systems, it's only able to compress a single file at the time, but this has some advantages, as we'll see.
Go to the phage
directory we found yesterday. If you don't remember where it is:
find ~ -name "phage" -type d
Now that you found it:
cd ~/learn_bash/phage/
let's check the size of some files:
ls -lh *f
now let's compress the very same files (they will get a .gz
suffix):
gzip *f
ls -lh *f.gz
To decompress them:
gunzip *.gz
To start, create a directory called denovo, placed inside your home directory:
mkdir ~/denovo
Now we want to locate input reads. Usual extensions from reads from a machine are .fastq or .fq, but being usually large files, they are compressed with gzip
:
find ~/learn_bash -name "*.fastq.gz" | sort
You should find at least 6 files (3 pairs):
/home/ubuntu/learn_bash/phage/reads/sample1_R1.fastq.gz
/home/ubuntu/learn_bash/phage/reads/sample1_R2.fastq.gz
/home/ubuntu/learn_bash/phage/reads/sample2_R1.fastq.gz
/home/ubuntu/learn_bash/phage/reads/sample2_R2.fastq.gz
/home/ubuntu/learn_bash/phage/reads/sample3_R1.fastq.gz
/home/ubuntu/learn_bash/phage/reads/sample3_R2.fastq.gz
These are paired reads, meaning that R1 and R2 files are coming from the same library and contain the forward and reverse read, respectively.
Now try to:
- Decompress them
- Have a look at their structure (using
less -S
) - Count the lines
- Re-compress them
We can use one of the many programs to calculate statistics about FASTQ files:
seqkit stats ~/learn_bash/phage/reads/*fastq.gz
seqfu count ~/learn_bash/phage/reads/*fastq.gz
In our server there is a version of SPAdes pre-installed. To learn more we can check the online documentation, but also try to see if the program comes with some help (it is usually the case). Try typing spades.py
, and hit enter, without any parameter.
- Is the help page printed in the standard output or in the standard error?
- Save the help screen in a file called "spades.txt" in your freshly created denovo directory.
As you could see from the manual, to assemble paired ends the (minimal) syntax is:
spades.py -1 {first_pair_fastq} -2 {second_pair_fastq} -o {new_output_directory}
For example:
spades.py -t 4 -1 ~/learn_bash/phage/reads/sample2_R1.fastq.gz \
-2 ~/learn_bash/phage/reads/sample2_R2.fastq.gz -o ~/assembly
- What file have been created in the output directory? Ar there FASTA files?
- How many sequences are there in contigs.fasta?
- How long are they [the size is printed in the sequence header]?
All the samples have been preassembled in the /data
directory, so we can run a program that will check the statistics for all the assemblies:
seqkit stats /data/assembly-*/contigs.fasta
· Bioinformatics at the Command Line - Andrea Telatin, 2017-2020
Menu