Duplex Sequencing

##README

Duplex Sequencing software package
Version 2.0
August 11, 2014
Programs by Scott Kennedy(1), Brendan Kohrn, and Mike Schmitt(1)
Several steps are based on prior work by Joe Hiatt
(1) Department of Pathology, University of Washington School of Medicine, Seattle, WA 98195

Glossery
-Single Stranded Consensus Sequence (SSCS):
A construct created by comparing multiple reads and deciding ambiguities by simple majority. SSCSs are created by ConsensusMaker.py. Quality scores attached to SSCSs are meaningless. although the cigar strings still have meaning. -Duplex Consensus Sequence (DCS):
A construct created by comparing two SSCSs. Quality scores and cigar strings attached to DCS sequences are meaningless, though cigar strings regain meaning after reallignment.

-Duplex tag:
A random sequence of nucleotides that enables the identification of sequences resulting from the same starting molecule.

-Family:
A group of reads that shares the same tag sequence.

-Read:
A DNA sequence which has not been compressed by ConsensusMaker.py. A raw read has not yet been modified by tag_to_header.py, while an SMI read has.
Summary of process
These programs are meant to be run in order and result in the transformation of two input FASTQ files from an Illumina sequencing run into a paired-end BAM file containing the final DCS reads. This workflow will also generate a file containing a list of every tag that is present and how many times it occured, as well as file containing SSCSs that didn't have a mate and were unable to make a DCS (extraConsensus.bam ).
Dependencies
The following programs and packages must be installed on your computer.

BWA (written with V 0.6.2)
Samtools (written with V 0.1.17)
Python (written with V 2.7.3)
Pysam (written with V 0.7.5)
BioPython (written with V 1.62)
Inputs:
read-1-raw-data.fq
read-2-raw-data.fq

Usage
Create a folder with both your fastq files in it.

PE_BASH_MAKER.py is a script that outputs a bash script that will execute, in order, all the steps in the data processing pipeline that are needed to obtain the final DCS reads.

Run PE_BASH_MAKER.py, making sure to input the correct read length (option --rlength), using the syntax shown below. Although it is recommended that all non-optional inputs be provided, the only inputs that are truely required are --ref, --r1src, --r2src, --rlength, and --runIdentifier. Note that read_type s will not work with the default bash template. If you want to write your own template, consult section 9.

usage: PE_BASH_MAKER.py [-h] --ref REF --r1src R1SRC --r2src R2SRC --rlength RLENGTH  
                        --runIdentifier RUNID [--min MINMEM] [--max MAXMEM]  
                        [--cut CUTOFF] [--Ncut NCUT] [--blength BLENGTH]  
                        [--slength SLENGTH] [--progInd PROGIND]  
                        [--read_type READ_TYPE] [--isize ISIZE] [--filt FILT]  
                        [--repFilt REPFILT]  
                        [--template TEMPLATE]
                        
optional arguments:  
  -h, --help            show this help message and exit  
  --ref REF             .FASTA file containing the reference genome  
  --r1src R1SRC         .fq file containing the raw read1 data  
  --r2src R2SRC         .fq file containing the raw read2 data  
  --min MINMEM          Minimum members for SSCS consensus  
  --max MAXMEM          Maximum members for SSCS consensus  
  --cut CUTOFF          Mimimum percent matching for base choice in SSCS  
                        consensus  
  --Ncut NCUT           Maxumum percent N's allowed  
  --rlength RLENGTH     Length of a single read  
  --blength BLENGTH     Length of the barcode sequence on a unprocessed single  
                        read.  
  --slength SLENGTH     Length of the spacer sequence in a unprocessed single  
                        read.  
  --progInd PROGIND     How often you want to be told what a program is doing  
  --read_type READ_TYPE  
                        A string specifying which types of read to consider.  
                        Read types: n: Neither read 1 or read 2 mapped. m:  
                        Either read 1 or read 2 mapped, but not both. p: Both  
                        read 1 and read 2 mapped, not a propper pair. d: Both  
                        read 1 and read 2 mapped, propper pair. s: Single  
                        ended reads.   
  --isize ISIZE         Optional: Maximum distance between read pairs   
  --filt FILT           A string indicating which filters should be  
                        implemented. Filters: s: Filter out softclipped reads.  
                        o: Filter out overlapping reads. n: Filter out reads  
                        with too many Ns.   
  --runIdentifier RUNID  
                        An identifier for this particular sample and  
                        sequencing run.  
  --repFilt REPFILT     Remove tags with homomeric runs of nucleotides of  
                        length x.  
  --template TEMPLATE   Template to use with bash maker. If not specified,  
                        defaults to bash_template.sh.

The default parameters in the provided BASH script are:

--min 3 --max 1000 --cut 0.7 --blength 12 --slength 5 --progInd 1000000 --read_type dpm --isize -1 --filt os --repFilt 9

Run the bash script from the command line with:

bash runIdentifier.script.sh 3>&1 1>&2 2>&3 | tee -a runIdentifier.se.log.txt

where runIdentifier is the run identifier you fed to the bash maker. This should run the rest of the process through to an output paired-end BAM file, copying the contents of stderr to a log file for documentation and reporting purposes.

It is strongly sugested that the final sorted BAM file undergo post-processing with picard-tools-1.70/AddOrReplaceReadGroups.jar and GATK/GenomeAnalysisTK.jar, before generating statistics.

Data Outputs:
These are only valid when using the PE_BASH_MAKER.py script with the default template * indicates the run identifier you gave PE_BASH_MAKER.py.

File Description	File name
BAM file containing position-sorted paired-end reads:	*.pe.bam
BAM file containing paired-end SSCSs:	*.sscs.bam
BAM file containing unpaired SSCSs:	*.sscs_UP.bam
BAM file containing non-mapping or otherwise bad reads:	*.sscs_NM.bam
BAM file containing good reads with less common cigar scores:	*.sscs_LCC.bam
tagcounts file:	*.pe.tagcounts
Tagstats file:	*.pe.tagstats
Fastq files containing DCSs:	DCS...r1.fq and PE...r2.fq
BAM file containing paired-end, sorted, alligned DCSs	DCS...aln.sort.bam

Live Outputs

The file Duplex-Process-Numbers.txt describes the number of reads in each file and the live outputs from each step.
Program Details (Advanced Users)

Details of the individual programs can be found by running that program with the -h or --help options.

Creating a Custom Template (Advanced Users)

In order to work with the provided bash maker, all custom templates must contain the following lines before any commands are executed. Feel free to change the default values; the bash maker just needs to have the variable names stay the same. :

 #DEFAULTS
   DSpath=''
 	alignRef=''
 	runIdentifier=''
 	read1in=seq1.fq
 	read2in=seq2.fq
 	iSize=-1
 	minMem=3
 	maxMem=1000
 	cutOff=0.7
 	nCutOff=1
 	readLength=100
 	barcodeLength=12
 	spacerLength=5
 	filtersSet='os'
 	readTypes='dpm'
 	repFilt=9
 	readOut=1000000
 	
 	#NONDEFAULTS
 	
 	#FINAL_READ_LENGTH
 	readLength=$((readLength-barcodeLength-spacerLength))

   
   Following this, the programs should be executed in the following order:  
   
   *tag_to_header.py*, *bwa*, *samtools sort*, *ConsensusMaker.py*, *samtools sort*, *DuplexMaker.py*, *bwa*

Name		Name	Last commit message	Last commit date
Latest commit History 168 Commits
TestData		TestData
.gitignore		.gitignore
ConsensusMaker.py		ConsensusMaker.py
CountMuts.py		CountMuts.py
Duplex-Process-Numbers.txt		Duplex-Process-Numbers.txt
DuplexMaker.py		DuplexMaker.py
LICENSE		LICENSE
PE_BASH_MAKER.py		PE_BASH_MAKER.py
PostDCSProcessing.sh		PostDCSProcessing.sh
ProgramOptions.html		ProgramOptions.html
README.md		README.md
bash_template.sh		bash_template.sh
flag_translations.txt		flag_translations.txt
mut-position.py		mut-position.py
muts_by_read_position.py		muts_by_read_position.py
tag_to_header.py		tag_to_header.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Duplex Sequencing

About

Releases

Packages

Languages

License

apastore/Duplex-Sequencing

Folders and files

Latest commit

History

Repository files navigation

Duplex Sequencing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages