GitHub

PredRAD

High-throughput sequencing of reduced representation libraries obtained through digestion with restriction enzymes–generally known as restriction-site associated DNA sequencing (RAD-seq)–is now one most commonly used strategies to generate single nucleotide polymorphism data in eukaryotes. The choice of restriction enzyme is critical for the design of any RAD-seq study as it determines the number of genetic markers that can be obtained for a given species, and ultimately the success of a project.

For the design of a study using RAD-seq, or a related methodology, there are two general fundamental questions that researchers face: i) what is the best restriction enzyme to use to obtain a desired number of RAD tags in the organism of interest? And ii) how many markers can be obtained with a particular enzyme in the organism of interest? This software pipeline will allow any researcher to obtain an approximate answer to these questions and will help guide the design of any study using RAD sequencing and related methods.

This Git contains the software code and output results from Herrera S., P.H. Reyes-Herrera & T.M. Shank (2015) Predicting RAD-seq Marker Numbers across the Eukaryotic Tree of Life.

Requirements

Python 2.7 and above
Biopython
Bowtie

Install

Download python and shell scritps

For the shell script (change execute permissions using chmod u+x)

Usage

restriction_site_search.sh. This shell script will search all the restriction sites from the file (patternfilename) in every genome from the input file (genomefilename). As a result the script provides the following files:
- ALL.count.txt - contains a table with the number of restriciton sites found in each genome
- ALL.size.txt - contains a table with the size of each genome
- If bowtieflag is equal to YES then it provides the following files: ALL.aligned.txt, ALL.failed.txt, ALL.processed.txt, ALL.suppressed.txt - each file with a table summarizing bowtie output(reads aligned, failed, processed and suppressed) for each genome.
The input arguments are:
- parametersfilename: name of file with four parameters (see test/params.txt)
  - genomefilename: name of file with table with two columns (1) species code and (2) link to whole genome fasta file or path to fasta file (for genome file example with url see test/genomeFileExample.txt, for file with localfile path see test/genomeFileExample_localfile.txt)
  - patternfilename - name of file with table with two columns (1) restriction site regular expression and (2) restriction site name (see test/Patterns_list.txt)
  - bowtieflag equals YES (default value) to use bowtie to align. Any other value if you do not want to use bowtie.
  - localfile flag equals NO (default value) to download the fasta files. If the flag equals YES, the program will search for a localfile in the indicated path
To run, just write on shell

./restriction_site_search.sh parametersfilename

obtain_nucleotides_model.py. This python script obtains the nucleotides, dinucleotide and trinucleotides distribution for each genome from the input file (genomefilename)

The input arguments are:
- genomefilename: name of file with table with two columns (1) species code and (2) link to whole genome fasta file or path to fasta file.(for genome file example with url see test/genomeFileExample_2.txt, for file with localfile path see test/genomeFileExample_localfile.txt)
- resultsfile : name of the outputfile
- localfileflag : yes if the files are in local, no otherwise.
To run, just write on shell

python obtain_nucleotides_model.py genomefilename resultsfile localfileflag

For details of events that occur once the script runs, please check the .log file.

sequence_probability.py. This python script obtains the probability for each restriction site from the input file (patternfilename) in every genome considering nt, dint and trint frequencies (distributionfile). As a result the script provides the following files:
- $distributionfile$_nt - contains a table with the sequences probabilities (based on nucleotide probabilities)
- $distributionfile$_dint - contains a table with the sequences probabilities (based on dinucleotides probabilities)
- $distributionfile$_trint - contains a table with the sequences probabilities (based on trinucleotides probabilities)
The input arguments are:
- distributionfile - output from genome_nucleotide_distrib_paper (see test/DistributionFile.txt)
- patternfilename - name of file with table with tow columns (1) restriction site regular expression and (2) restriction site name (see test/Patterns_list.txt)
To run, just write on shell

python sequence_probability.py distributionfile patternsfile

License

PredRAD is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 2.

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
paper_analyses		paper_analyses
paper_outputs		paper_outputs
test		test
.Rhistory		.Rhistory
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README.md~		README.md~
obtain_nucleotides_model.py		obtain_nucleotides_model.py
restriction_site_search.sh		restriction_site_search.sh
sequence_probability.py		sequence_probability.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PredRAD

Requirements

Install

Usage

License

About

Releases

Packages

Languages

License

phrh/PredRAD

Folders and files

Latest commit

History

Repository files navigation

PredRAD

Requirements

Install

Usage

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages