During the application of these methods, we will work according to the guidelines of reproducible research \cite{ReproducibleResearch2017} best practices \cite{BestPractices2017} in computational biology that have been developed from the Better Analysis Faster programme. Specifically, we will implement the 4 Pillars that have been shown to be key for developing a "Best Practice".
The approach we will take consists of a number of components which together will comprise the analysis toolkit. There are a variety of options for each step, therefore the identification of those options and evaluation in terms of fitness for purpose will be necessary. Our approach will, initially, leverage existing software tools, data sets and expert knowledge of biological systems thereby allowing us to make rapid progress. As each of the components have been sufficiently evaluated, they will be used to develop the integrated testing platform. In order to maximise the ability of external collaborators to contribute to the framework, we will develop collaboratively and transparently, using software development best practices, sharing code through GitHub accounts that already exist. The framework will be deployable using standard cloud systems requiring few additional dependencies. Through this work, PFR staff will be upskilled in the use of common cloud computing tools, expanding PFRs computational capability.
A set of variant calling pipelines suitable for this work is to be identified and characterised in terms of the relationships of their component parts and their dependencies on other software. This work has been started by the Biospectra-by-Sequencing project and is summarized below. (will send table as a word doc. I can't get the bloody thing coded right) We will be working directly with the developers of at least 3 of these pipelines (FastGBS, iPyRad, and GBS-SNP-CROP) which will allow us to leverage their detailed knowledge of those pipelines. While this table has been constructed with pipelines suitable for GBS data, many of their components (e.g. aligners, variant callers, etc.) are used with standard resequencing projects for variant discovery and are applicable to the whole of this project.
Deploying modern, cloud based software tools will allow us to develop, test and run analysis frameworks in a reproducible, automateble and self-documenting manner. Tools such as Docker containers will provide the software environment and unit of reproducibility. Control systems (e.g. Kubernetes) will manage multiple instances of Docker containers to allow for rapid analysis of large numbers of tests. Automation servers such as Jenkins will assist in both evaluating the frameworks as they are developed, and run those frameworks on our testing data sets as well as arbitrary data sets. PFR staff will work directly with NZ based cloud computing experts to rapidly and effectively take advantage of these technologies.
Developing the evaluation framework with the appropriate framework type will be critical. There are over a dozen possible framework types from which to choose \cite{Leipzig_2016}. The selection of the framework type is substantially dependent on the nature of the variant calling pipelines and the computational environments in which they will be run. Once the pipelines and computational environments have been characterised as described above, One or more of the available frameworks will be selected and a small pilot suitability study will be conducted.
Pipeline Name | DOI | Year Published | Denovo / Reference | Demultiplexer | Trimmer / Read QC Filter | Aligner | SNP Caller | Special format of fastq filenames | Hardcoded enzyme | Download | Output file types |
---|---|---|---|---|---|---|---|---|---|---|---|
TASSEL UNEAK | 10.1371/journal.pgen.1003215 | 2013 | Denovo | Built In | None | N/A | Built IN | Yes | Yes | http://www.maizegenetics.net/tassel | hmp |
STACKS | 10.1111/mec.12354 | 2013 | Denovo / Reference | Built In | Built In | BWA / Bowtie2 / GSnap | Built IN | No | Yes | http://catchenlab.life.illinois.edu/stacks/ | Database with export. |
IGST-GBS | 10.1371/journal.pone.0054603 | 2013 | Reference | FastX toolkit (https://github.com/agordon/fastx_toolkit) | FastX toolkit | BWA | sam tools (https://github.com/samtools/samtools) | No | ? | Not publicly available | vcf |
TASSLE 3 | 10.1371/journal.pone.0090346 | 2014 | Reference | Built In | None | BWA / Bowtie 2 | Built In | Yes | Yes | http://www.maizegenetics.net/tassel | hmp |
pyRAD | 10.1093/bioinformatics/btu121 | 2014 | Denovo | Built In | Built in but only recommend for very messy single end GBS or ddRAD | Muscle | Built In | NO | No. Can specify enzyme cutsite overhang in parameter file | https://github.com/dereneaton/pyrad/releases | Text (.loci, .phy, .nex, .snps, .vcf and others), |
AftrRad | 10.1111/1755-0998.12378 | 2015 | Denovo | Built In | Built in but just removes bad reads | Mafft | Built In | No | No needs recognition sequence | https://github.com/mikesovic/AftrRAD | Text (translation scripts) |
GBS-SNP-CROP | 10.1186/s12859-016-0879-y | 2016 | Both | Built In | Trimmomatic (trims before demultiplexing!) | BWA | samtools mpileup | yes and for each step name and location expectation | No needs recognition sequence | https://github.com/halelab/GBS-SNP-CROP | SNP matrix, TASSEL hmp, PLINK tped |
GibPSs | 10.1111/1755-0998.12510 | 2016 | Denovo | Built In | Built in but just removes bad reads | Matching Built in | Built In | No | No needs recognition sequence | https://github.com/ahapke/gibpss | Database with export. |
NGSEP | 10.1186/s12864-016-2827-7 | 2016 | Reference | Built In | Built in Removes adapter | Bowtie2 | Built In | No | unknown | https://sourceforge.net/projects/ngsep/ | vcf |
FastGBS | 10.1186/s12859-016-1431-9 | 2017 | Reference | Sabre (https://github.com/najoshi/sabre) | Removes adaptor with cutadapt (https://github.com/marcelm/cutadapt/) | BWA | Platypus (http://www.well.ox.ac.uk/platypus) | Yes | Enzyme cutsite overhang is not checked | https://bitbucket.org/jerlar73/fast-gbs | vcf |
ipyRAD | NA | NA | Denovo, Reference, Denovo+Reference | Built In | Built in | BWA / SMALT | Samtools/bedtools | No | No. Can specify enzyme cutsite overhang in parameter file | https://github.com/dereneaton/ipyrad | Varint VCF; ipyrad .loci; phylogenetic and structure (.phy, .nex and .str); SMARTPCA etc (.geno); MAP (.snps.map) |
TASSEL 5 | NA | NA | Reference | Built In | Built In | BWA / Bowtie2 | Built In | Yes | Yes | http://www.maizegenetics.net/tassel | Database with export |
In silico experiments
Multiple genomic read aligners, assemblers, and variant callers will be assessed on their performance using many sets of simulated read data. The read data will be constructed by permutation from a “known-good” source, such as one of the human chromosomes. The controls will be sets of “perfect reads” (positive) generated as an idealization of the sequencing process, and sets of random reads (negative). Between the two extremes, many levels of read permutation will be examined allowing for variability in multiple factors including specified type, insert size, base quality (error rates), mapping quality, and allele frequency. The in-silico permutation process mimics the introduction of sequencing errors as well as the introduction of sequencing biases, the incorporation of a range of allele frequencies, and other sources of known problems in alignment algorithms. Due to our ability to construct our datasets with fine precision and then critically analyse the results, we will be able to ascertain the downstream affects of each type of problem. Each tool will be run on the sample data sets, and then the results evaluated using the appropriate standard criteria, including the ability to handle polyploid data correctly and efficiently.
Biological Experimental Datasets
Using biological, experimentally collected datasets, the same performance assessments will be completed. In these analyses, we will examine "real" data that will contain some but not necessarily all the problems manifested in the simulated data. However, it is critical that we also establish a close coupling between our constructed data and biologically sourced data for purposes of validation.
Here again, we will leverage existing tools to provide a rapid start and then modify or extend them as needed. The software tool used to compare variant calling results in the 2016 paper by Torkamaneh et al has been made available by them for this project. The tools used at Cornell's Genomic Diversity Facility for reporting and visualising results of running the TASSEL and UNEAK GBS pipelines have been provided as well.
In addition to the software tools mentioned above, the combination of sequence data sets, meta data about those data sets, and our collaborator's knowledge of the genetics of the organisms / populations from which the sequence data were derived will strongly contribute to the assessments conducted.
- IBD and Mendelian sanity checks: Built into the simulated reads?
- pooling and population genetics
- genome free: So what is the reference?
- GBS and resequencing: Could simulate these pretty easily.
- polyploid: Control of allele frequencies in simulated data?
- making sense of reports - what methods here?
- visualisations: Should address graphs of key metrics
- confident curation. ???
These scoping suggestions could be elaborated on in these methods...Charles