Skip to content
Tobias Hofmann edited this page Jul 29, 2016 · 4 revisions

assemble_reads.py

Assembly software

This pipeline implements the free assembly softwares Trinity or Abyss for the assembly of contigs. The assembly is probably the most time-intensive step in the whole workflow and can easily take several hours or days (depending on the number of samples). We recommend Abyss for most regular DNA-read assemblies, which is a commonly used software for this purpose and which also runs faster than Trinity. You can choose which assembler you want to use by using the --assembler flag (default:abyss). Before running the script you should check the installation path of Trinity or Abyss and provide the full path to the program executable (Trinity.pl or abyss-pe) by using the --abyss or --trinity flag depending on whichi assembler you choose.

Run the script

As before, you can check the basic syntax of the script by typing:
python2.7 assemble_reads.py --h

Which will return:
usage: assemble_reads.py [-h] --input INPUT --output OUTPUT [--assembler {trinity,abyss}] [--trinity TRINITY] [--abyss ABYSS] [--kmer KMER] [--contig_length CONTIG_LENGTH] [--single_reads] [--cores CORES]

The script needs the cleaned reads from the previous step as input. You can move the folder, containing the cleaned reads for all samples, where you want it, but don't change anything in the file structure within that folder, not even the names of the files/subfolders. The input folder needs to contain a separate subfolder for each sample, named in this manner sampleID_clean, which contains at least the cleaned forward read file (sampleID_*_READ1.fastq) and the cleaned backward read file (sampleID_*_READ2.fastq). But if you used the clean_reads.py script described in the previous chapter on your data, you don't have to bother with the file-/folder-structure at all, you will be able to just give the path to the cleaned-reads-folder as input for the assembly.

You can choose a kmer value, also referred to as word-length (--kmer) as well as a minimum contig length (--contig_length). The default for --contig_length is 200 but you will have to decide what the best minimum for your data is (dependent on the length of the shortest loci that you tried to capture). In some cases you might want to choose an even lower threshold as e.g. 100, if you try to capture very short exons. You can parallelize the computation over multiple cores, using the --cores flag.

Example:

python2.7 assemble_reads.py --input path/to/cleaned-reads-folder --output path/to/contig-folder --assembler abyss --abyss /installation/path/of/abyss-pe --contig_length 100 --cores 12