Bioinformatics

Preamble: 15 minutes of glory

Try doing this de novo section in 15 minutes. You'll try performing a de novo assembly following some guidance, but this time it's not a step-by-step tutorial. Skip questions that you find unclear.

A the end of the 15 minutes type in the classroom chat where did you arrive. No need to finish everything!

First, install the needed tools (spades and n50, both available in the bioconda channel):

conda install -y -c bioconda spades n50

To start, create a denovo directory in your home directory.

We downloaded a new program, SPAdes. To learn more we can check the online documentation, but also try to see if the program comes with some help (it is usually the case). Try typing spades.py, and hit enter, without any parameter.

Is the help page printed in the standard output or in the standard error?
Save the help screen in a file called "spades.txt" in your freshly created denovo directory.

As you could see from the manual, to assemble paired ends the (minimal) syntax is:

spades.py -1 {first_pair_fastq} -2 {second_pair_fastq} -o {new_output_directory}

I used placeholders surrounded by curly brackets. I will specify that the output directory must be ~/assembly (i.e. a directory called assembly placed in your home folder. You don't need to create one as SPAdes will do this for you).

Locate in your ~/learn_bash/ directory a set of paired-end reads with ".fq" extension
How many reads are there in those files? Save the answer in "~/stats/reads_count.txt" (you will need to create the "stats" directory first)
Run spades with the freshly identified ".fq" files.
Check with ls the content of the output folder
Count the reads present in the FASTA files in the output folder, saving the result in "~/stats/fasta_count.txt"

If you finished, try invoking the n50 program without parameters. Follow its instruction to read the manual (it will behave like man so you know how to exit).

Use n50 to calculate the n50 statistics of your contigs, saving the output in ~/stats/n50_test.txt.

Short reads mapping with BWA

Today we will need some packages that we already installed: bwa, samtools, maybe also bedtools and covtobed. If you didn't install them you can do it now (using Miniconda):

conda install -y -c bioconda bwa samtools bedtools covtobed

The "SAM" format

Today we introduce the "SAM" format, to store sequence mappings. It's a highly structured TSV file (tab-delimited) with a header section (lines starting with '@').

Understanding a program

When testing a program it's always a very good idea to make a tiny dataset that we fully understand. That's why your homework has been to create a simple FASTA file with pieces of sequences coming from the Lambda genome. If you want you can copy and paste the sequences presented at the bottom of this page, saving them as ~/reads.fa.

The program that we will try today is a popular mapper called bwa and developed by Heng Li.

Indexing the genome

This is a one-step procedure: when you download a new FASTA file to be used as a reference for the alignment, you have to index it first.

bwa index  ~/learn_bash/phage/vir_genomic.fna

Check with ls what happened inside the "phage" directory. If you want to sort the files by creation date, what switches can you use (you can check with man).

Alignment: first test!

Now we can align sequences, we'll use the small dataset that we should have in ~/reads.fa.

bwa mem ~/learn_bash/phage/vir_genomic.fna ~/reads.fa > ~/first_alignment.sam

Time to inspect your first SAM file!

Alignment: a dataset

bwa mem  ~/learn_bash/phage/vir_genomic.fna ~/learn_bash/phage/vir_reads1.fq > ~/phage.sam

Samtools primer

samtools is the swiss-army knife for manipulating SAM files. We will see only the minimal pipeline to convert a SAM file to its binary version (BAM), sorting it by coordinate and finally indexing it.

This is a mock workflow, where I did put {placeholders} inside curly brackets. Try to adapt it for your needs:

# Convert SAM to BAM: you can provide the reference but it's not mandatory. The following commands are thus _alternatives_
samtools view -b -T {reference} {sam_file} > {bam_output}
samtools view -b -S {sam_file} > {bam_output}
 
# Sort a BAM file 
samtools sort -o {sorted_bam} {unsorted_bam} 
 
# Indexing a _sorted_ BAM file
samtools index {sorted_bam}
# see with an 'ls' that a new file has been created

Now that you have a (sorted) BAM file, you can have a look using samtools view:

samtools view {bam_file} | less -S

reads.fa

Lowercase letters are artificially added "errors". "INS" means an insertion was done (lowercase), "DEL" a deletion. "RC" means reverse complemented.

>0-Lambda:980-1040
GCCTACTTTATAGAGCATAAGCAGCGCAACACCCTTATCTGGTTGCCGACGGATGGTGATG
>1-Lambda:100-200
CTCTGAAAAGAAAGGAAACGACAGGTGCTGAAAGCGAGGCTTTTTGGCCTCTGTCGTTTCCTTTCTCTGTTTTTGTCCGTGGAATGAACAATGGAAGTCAA
>2a-Lambda:100-200_INS
CTCTGAAAAGAAAGGAAACGACAGGTGCTGAAAGCGAGGCTTTTTGGCCTCTcagataGTCGTTTCCTTTCTCTGTTTTTGTCCGTGGAATGAACAATGGAAGTCAA
>2b-Lambda:100-200_DEL
CTCTGAAAAGAAAGGAAACGACAGGTGCTGAAAGCGAGGCTTTTTGGCCTCTTTCCTTTCTCTGTTTTTGTCCGTGGAATGAACAATGGAAGTCAA
>3-Lambda:100-200_RC
TTGACTTCCATTGTTCATTCCACGGACAAAAACAGAGAAAGGAAACGACAGAGGCCAAAAAGCCTCGCTTTCAGCACCTGTCGTTTCCTTTCTTTTCAGAG
>4-Lambda:500-540
ACTCCGCTGAAGTGGTGGAAACCGCATTCTGTACTTTCGTG
>5-Lambda:500-540_RC
CACGAAAGTACAGAATGCGGTTTCCACCACTTCAGCGGAGT
>6-Lambda:900-940_980-1040_Errors
TGGGCAGCGACTACATaCGTGAGGTGAATGTGGTGAAGTCTGCCTACaaaaATAGAGCATAAGCAGCGCAACACCCTTATCTGGTTGCCGtcttcgGGTcAaG
>7-Lambda:1900-1940_1980-2040
CGCCGGTATCGACTCCCAGCTGGACCGCTACGAAATGCGCGnACCGGCAGATTATTATGGGCCGCCACGACGATGAACAGACGCTGCTGCGTGTGGATGAGGC

· Bioinformatics at the Command Line - Andrea Telatin, 2017-2020

Menu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly