Michael G. Campana and Ellie E. Armstrong, 2019-2024
Smithsonian Institution
Stanford University
Pipeline to calculate de novo mutation rates from parent-offspring trios
This README provides basic details for installing, configuring and running the pipeline. Please note that as of version 1.0.0, RatesTools has upgraded to Nextflow DSL2. For the original DSL1 pipeline, please see versions <=0.5.16. Detailed documentation is available for the Ruby and R scripts included in this package and for the pipeline's operation. Test data are provided in the Smithsonian Institution Figshare repository and a tutorial is available here.
- Creative Commons 0 Waiver
- Citation
- Conda-Assisted Installation
- Manual Pipeline Installation
- Configure the Pipeline
- Running the Pipeline
- References
To the extent possible under law, the Smithsonian Institution and Stanford University have waived all copyright and related or neighboring rights to RatesTools; this work is published from the United States.
We politely request that this work be cited as:
Armstrong, E.E. & M.G. Campana. 2023. RatesTools: a Nextflow pipeline for detecting de novo germline mutations in pedigree sequence data. Bioinformatics. 39: btac784. DOI: 10.1093/bioinformatics/btac784.
Preprint available on bioRxiv. DOI: 10.1101/2022.07.18.500472.
We provide a configuration profile "conda" in the default configuration file (nextflow.config
) that installs all dependencies using Conda. As of RatesTools 1.0.0, we recommend (and default to) the use of Mamba for environment construction. Using this profile, the user only needs to install Nextflow [1], Conda/Mamba and the RatesTools pipeline:
Install Nextflow: curl -s https://get.nextflow.io | bash
Install Conda (and/or Mamba): See installation instructions here and here
Pull the current version of the RatesTools pipeline: nextflow pull campanam/RatesTools -r main
We explicitly list software dependencies here as no installation system (e.g. via Conda or containerization) is universally supported across all computing architectures.
RatesTools requires Nextflow [1] v. >= 23.10.0, Ruby v. >= 3.2.2, R [2] v. 4.0.2 and Bash v. >= 4.2.46(2)-release. Basic instructions for installing these languages are copied below. We recommend installing Ruby using the Ruby Version Manager. See the official language documentation should you need help installing these languages.
Install Nextflow: curl -s https://get.nextflow.io | bash
Install the latest Ruby using Ruby Version Manager: curl -sSL https://get.rvm.io | bash -s stable --ruby
Install R: Use the appropriate precompiled binary/installer available at the Comprehensive R Archive Network (CRAN).
Pull the current version of the pipeline: nextflow pull campanam/RatesTools -r main
To specify another RatesTools release, replace main
with the RatesTools release version (e.g. v0.5.7
).
RatesTools requires the following external dependencies. See the documentation for these programs for their installation requirements. RatesTools requires the Genome Analysis Toolkit (GATK) [3] v. 3.8-1 or v. >= 4.4.0.0 and Java v. 1.8 (GATK3) or v. 1.17 (GATK4). Currently, RatesTools is not compatible with other versions of Java. Otherwise, listed versions are those that have been tested and confirmed, but other versions may work. RatesTools can utilize Environment Modules modulefiles to simplify deployment on computing clusters and limit dependency conflicts (See the tutorial).
- gzip
- awk
- sed
- zcat
- BWA [4] v. 0.7.17
- SAMtools [5,6] v. 1.18
- BCFtools [5,6] v. 1.18
- bgzip and tabix from HTSlib [6] v. 1.18
- Java v. 1.8 (GATK3) or v. 1.17 (GATK4)
- Picard [7] v. 2.23.8 (GATK3) or v. 3.1.07 (GATK4)
- Sambamba [8] v. 0.8.2
- Genome Analysis Toolkit (GAKTK) v. 3.8-1 or v. 4.4.0.0
- VCFtools [9] v.0.1.16
- GenMap [10] v.1.2.0 with SeqAn [11] v. 2.4.1
- RepeatMasker [12] v. 4.1.5
- RepeatModeler [13] v. 2.0.5
- BEDTools [14] v. 2.31.0
- WhatsHap [15,16] v. 2.1
RatesTools requires the following R packages installed in your R environment:
- tidyverse [17] v. 1.3.1 with dplyr [18] v. 1.0.7 and ggplot2 [19] v. 3.3.5.
- data.table [20] v. 1.14.2
- Hmisc [21] v.5.1-1
To assist installation and execution of the Java dependencies, we provide built-in options to install GATK and Picard through Conda. See the tutorial for details.
Assisted configuration of the RatesTools pipeline can be accomplished using the configure.sh
bash script included with this repository. The script copies the nextflow.config
included with this repository and modifies the copy for the target system. The configure.sh
script detects software installed on the local system and prompts the user to provide modulefiles, paths to undetected files, and program options. The configuration file can also be manually edited using a text editor. However, please note that the configure.sh
script requires an unmodified nextflow.config
file to work.
NB: The most straightforward method to obtain the configure.sh
and nextflow.config
files is to clone this repository and move the files to a desired location:
Clone the repository: git clone https://github.com/campanam/RatesTools
Move the files: mv RatesTools/*config* /some/path/
Change to the specified directory: cd /some/path
Execute the script: bash configure.sh
To specify sample and library information to RatesTools, provide a CSV with the following header and information:
Sample,Library,Read1,Read2
<samp1>,<lib1>,<lib1.R1.fq.gz>,<lib1.R2.fq.gz>
<samp2>,<lib2>,<lib2.R1.fq.gz>,<lib2.R2.fq.gz>
<samp2>,<lib3>,<lib3.R1.fq.gz>,<lib3.R2.fq.gz>
...
Sample
designates the unique sample name. Library
is the unique library name (multiple libraries can correspond to the same sample). Read1
and Read2
are the forward and reverse read files (FASTQ format) respectively.
RatesTools assumes bidirectional sequencing for each library, but allows for multiple sequenced libraries per individual. RatesTools will merge the libraries by sample name assuming the libraries are independent. If an individual library has been sequenced multiple times, concatenate the reads from the library and treat as a single bidirectionally sequenced file.
Given the wide-variety of computing architectures and operating systems, we cannot provide specific optimized configurations for your computing system. The nextflow.config
file includes an example of a 'standard' configuration profile for a local installation using modulefiles and a 'conda' configuration that installs all dependencies using Conda. Example configuration profiles for the analyses described in Armstrong & Campana 2023 are provided in the Figshare repository. Please consult your computing staff to optimize the profile settings for your hardware. We recommend storing configuration profiles in a system-wide central location for access by all users.
Enter nextflow run campanam/RatesTools -r <version> -c <config_file>
to run the pipeline, where version
is the installed RatesTools release. Append -resume
to restart a previous run or -bg
to run RatesTools in the background. If you developed platform-specific configuration profiles, you can specify this using the -profile <PROFILE>
option. See the Nextflow documentation for details. Final data are written to the specified output directory and its subdirectories.
- Di Tommaso, P., Chatzou, M., Floden, E.W., Prieto Barja, P., Palumbo, E., Notredame, C. (2017) Nextflow enables reproducible computational workflows. Nat Biotechnol, 35, 316–319. DOI: 10.1038/nbt.3820.
- R Core Team (2020) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. (https://www.r-project.org/).
- McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M., DePristo, M.A. (2010) The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res, 20, 1297-1303. DOI: 10.1101/gr.107524.110.
- Li, H. (2013) Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv, 1303.3997v2.
- Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., 1000 Genome Project Data Processing Subgroup (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25, 2078-2079. DOI: 10.1093/bioinformatics/btp352.
- Danecek, P., Bonfield, J.K., Liddle, J., Marshall, J., Ohan, V., Pollard, M.O., Whitwham, A., Keane, T., McCarthy, S.A., Davies, R.M., Li, H. (2021) Twelve years of SAMtools and BCFtools. GigaScience, 10, giab008. DOI: 10.1093/gigascience/giab008.
- Broad Institute (2020). Picard v. 2.23.8 (https://broadinstitute.github.io/picard/).
- Tarasov, A., Vilella, A.J., Cuppen, E., Nijman, I.J., Prins, P. (2015) Sambamba: fast processing of NGS alignment formats. Bioinformatics, 31, 2032–2034. DOI: 10.1093/bioinformatics/btv098.
- Danecek, P., Auton, A., Abecasis, G., Albers, C.A., Banks, E., DePristo, M.A., Handsaker, R.E., Lunter, G., Marth, G.T., Sherry, S.T., McVean, G., Durbin, R. (2011) The variant call format and VCFtools. Bioinformatics, 27, 2156–2158. DOI: 10.1093/bioinformatics/btr330.
- Pockrandt, C., Alzamel, M., Iliopoulos, C.S., Reinert, K. (2020) GenMap: ultra-fast computation of genome mappability. Bioinformatics, 36, 3687–3692, DOI: 10.1093/bioinformatics/btaa222.
- Reinert, K., Dadi, T.H., Ehrhardt, M., Hauswedell, H., Mehringer, S., Rahn, R., Kim. J., Pockrandt, C., Winkler, J., Siragusa, E., Urgese, G., Weese, D. (2017) The SeqAn C++ template library for efficient sequence analysis: A resource for programmers. J Biotechnol, 261, 157-168. DOI: 10.1016/j.jbiotec.2017.07.017.
- Smit, A.F.A., Hubley, R., Green, P. (2013-2015) RepeatMasker Open-4.0. (http://www.repeatmasker.org).
- Flynn, J.M., Hubley, R., Goubert, C., Rosen, J. Clark,. A.G., Feschotte, C., Smit, A.F. (2020) RepeatModeler2 for automated genomic discovery of transposable element families. Proc Natl Acad Sci U S A, 117, 9451-9457. DOI: 10.1073/pnas.1921046117.
- Quinlan, A.R., Hall, I.M. (2010) BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics, 26, 841-842, DOI: 10.1093/bioinformatics/btq0333.
- Martin, M., Patterson, M., Garg, S., Fischer, S.O., Pisanti, N., Klau, G.W., Schoenhuth, A., Marschall, T. (2016) WhatsHap: fast and accurate read-based phasing. BioRxiv, DOI: 10.1101/085050.
- Garg, S., Martin, M., Marschall, T. (2016) Read-based phasing of related individuals. Bioinformatics, 32, i234-i242, DOI: 10.1093/bioinformatics/btw276.
- Wickham, H., Averick, M., Bryan, J., Chang, W., D'Agostino McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T.L., Miller, E., Bache, S.M., Müller, K., Ooms, J., Robinson, D., Seidel, D.P., Spinu, V., Takahashi, K., Vaughan, D., Wilke, C., Woo, K. Yutani, H. (2019). Welcome to the Tidyverse. J Open Source Softw, 4, 1686. DOI: 10.21105/joss.01686.
- Wickham, H., François, R., Henry, L., Müller, K. (2021) dplyr: a grammar of data manipulation. R package version 1.0.7 (https://dplyr.tidyverse.org/).
- Wickham, H. (2016) ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag, New York, USA.
- Dowle, M., Srinivasan, A. (2021) data.table: extension of 'data.frame'. R package version 1.14.2. (https://r-datatable.com).
- Harrell, F.E., Jr. (2023) Hmisc: Harrell miscellaneous. R package version 5.1-1. (https://CRAN.R-project.org/package=Hmisc).
Image Credit: Conor Mallon. 2014. Smithsonian's National Zoo & Conservation Biology Institute. Smithsonian Institution. https://nationalzoo.si.edu/object/nzp_NZP-20141024-032CPM.