GitHub - AnJingwd/STRsearch: STRsearch: a new pipeline for targeted profiling of short tandem repeats in amplicon-based sequencing data

Overview

STRsearch is an end-to-end pipeline for targeted profiling of short tandem repeats (STRs) in massively parallel sequencing data. It is implemented using Python, supporting both version Python2 and Python3.

Algorithm description:

Briefly, STRsearch employs an iterative algorithm to obtain the longest continuous interval composed by all motifs of STR sequence structure without a priori assumptions on allele size. The actual STR region is determined by comparing the position of repeat patterns with the best matching location of flanking sequences in reads. Ultimately, allele size is calculated not only for repeat patterns, but also indels that are actually in the STR region.

Installation

To obtain STRsearch, use:

git clone https://github.com/AnJingwd/STRsearch.git

or

wget https://github.com/AnJingwd/STRsearch/archive/master.zip

Prerequisite

The following linux utilities are needed and the full path of them on your local machine should be provided in conf.py file

bwa (v1.7 or higher)
samtools (v1.7 or higher)
bamToFastq (v2.17.0 or higher)
seqtk (v1.2 or higher)
usearch (v11 or higher) Download

Additionally, the following Python modules are required.

numpy
argparse
pathlib

Configuration file format:

The first step for STR analysis with STRsearch is to create a configuration file with your custom set of STR loci. One way to do this is by referring to the most up-to-date revised forensic STR sequence guide and a worksheet can be downloaded from link. You will need to make a configuration file with the following columns present:

Chr	Start	End	Period	Reference allele	Marker	STR	STR sequence structure	Stand	5' Flanking sequence	3' Flanking sequence
chr1	7442891	7442934	4	11	Marker1	D1GATA113	[GATA]n	+	ACTTGCTTCCTAGAT	TTCCTATAGCCTCAA
chr21	20554291	20554417	4	29	Marker2	D21S11	[TCTA]n [TCTG]n [TCTA]n ta [TCTA]n tca [TCTA]n tccata [TCTA]n TA [TCTA]n	+	CCAAGTGAATTGCCT	TCGTCTATCTATCCA
chrX	149710971	149711038	4	15	Marker3	DXS7423	[TGGA]n aggacaga [TGGA]n	+	AAATGAATGAGTATG	TGGGGAGGAAATCTG
chrY	15752608	15752715	3	27	Marker4	DYS612	[CCT]n CTT [TCT]n CCT [TCT]n	+	AGGTTCAGAGGTTTG	GTCACTTTTCCAAAT
chrY	20842518	20842573	4	14	Marker5	DYS385a	[TTTC]n	-	TCCTTTCTTTTTCTC	CCTTCCTTCCTTCCT

Column 1 : chromosome (must)
Column 2 : start coordinate of the STR (must)
Column 3 : end coordinate of the STR (must)
Column 4 : period of the STR (must)
Column 5 : reference copy number (option)
Column 6 : Marker name (option)
Column 7 : STR name (option)
Column 8 : Reference Sequence repeat region sequence structure summary (must)
Column 9 : stand ("+" means positive stand;"-" means negative stand) (must)
Column 10 : 5' flanking sequence of repeat region (must)
Column 11 : 3' flanking sequence of repeat region (must)

Note some columns are not used. You can put any value in the non-required columns, just make sure there are at least 11 columns with the required information listed above. Importanly, flanking sequences are necessarily adjacent to STR repeat region.

Inputs

FASTQ file or BAM-file from singe-end or paird-end sequencing platforms

Output

genotypes.txt: genotypes on each targeted locus
multiple_alleles.txt: all alleles identified on each targeted locus
qc_matrix.txt: a quality control matrix including several sequence properties (total bases, sequencing quality score, number of allocated reads, distance distribution of STR repeat sequence to end of reads, allele read depth)

Usage examples

1. run with default parameters

for paired-end sequencing

python3 pipeline.py from_fastq \
--working_path example/test_results/ \
--sample test \
--fq1 example/test_data/test_R1.fastq \
--fq2 example/test_data/test_R2.fastq \
--ref ucsc.hg19.fasta

python3 pipeline.py from_bam \
--working_path example/test_results \
--sample test \
--sex male \
--bam example/test_results/alignments/test.bam \
--ref_bed example/ref_test.bed \
--genotypes example/test_results/test_genotypes.txt \
--multiple_alleles example/test_results/test_multiple_alleles.txt \
--qc_matrix example/test_results/test_qc_matrix.txt

for single-end sequencing

python3 pipeline.py  \
--type single \
from_fastq \
--working_path example/test_results/ \
--sample test \
--fq1 example/test_data/test_R1.fastq \
--ref ucsc.hg19.fasta

python3 pipeline.py \
--type single \
from_bam \
--working_path example/test_results \
--sample test \
--sex male \
--bam example/test_results/alignments/test.bam \
--ref_bed example/ref_test.bed \
--genotypes example/test_results/test_genotypes.txt \
--multiple_alleles example/test_results/test_multiple_alleles.txt \
--qc_matrix example/test_results/test_qc_matrix.txt

2. run with self-defined parameters

python3 pipeline.py \
--assemble_pairs True \
--reads_threshold 50 \
--stutter_ratio 0.6 \
--num_threads 8 \
--num_processors 8 \
from_bam \
--working_path example/test_results \
--sample test \
--sex male \
--bam example/test_results/alignments/test.bam \
--ref_bed example/ref_test.bed \
--genotypes example/test_results/test_genotypes.txt \
--multiple_alleles example/test_results/test_multiple_alleles.txt \
--qc_matrix example/test_results/test_qc_matrix.txt

Options

Default parameters

Option	Value Type	Default	Summary
--help		false	display the help message
--type	str	paired	(option) The sequencing type
--assemble_pairs	bool	False	(option) if True, paired-end reads are assembled
--reads_threshold	int	30	(option) The analytical threshold for reads
--stutter_ratio	float	0.5	(option) The stutter ratio
--num_threads	int	4	(option) The number of multiple threads
--num_processors	int	4	(option) The number of multiprocess

Sub command

from_bam

Option	Value Type	Default	Summary
--help		false	display the help message
--working_path	str	null	(must) The working path
--sample	str	null	(must) The sample name
--sex	str	null	(must) The sample sex
--bam	str	null	(must) The input BAM-file
--ref_bed	str	null	(must) The configuration file of STRs
--genotypes	str	null	(must) The output for STR genotypes
--multiple_alleles	str	null	(must) The output for multiple alleles
--qc_matrix	str	null	(must) The output for quality control matrix

from_fastq

Option	Value Type	Default	Summary
--help		false	display the help message
--working_path	str	null	(must) The working path
--sample	str	null	(must) The sample name
--fq1	str	null	(must) The in1.fq
--fq2	str	null	(option) The in2.fq for paired-end sequencing
--ref	str	null	(must) The reference genome fasta and index file in the same path

Run with Docker

To obtain STRsearch Docker image, use:

docker pull anjing123/strsearch:latest

LOCAL_PATH/app/
├── ref
│   ├── ucsc.hg19.fasta
│   ├── ucsc.hg19.fasta.amb
│   ├── ucsc.hg19.fasta.ann
│   ├── ucsc.hg19.fasta.bwt
│   ├── ucsc.hg19.fasta.pac
│   └── ucsc.hg19.fasta.sa
├── ref_test.bed
└── test_data
    ├── test.bam
    ├── test_R1.fastq
    └── test_R2.fastq

docker run -v LOCAL_PATH/app/:/app/ -w /app/ -it anjing123/strsearch:latest from_fastq \
--working_path /app/test_results/ \
--sample test \
--fq1 /app/test_data/test_R1.fastq \
--fq2 /app/test_data/test_R2.fastq \
--ref /app/ref/ucsc.hg19.fasta

docker run -v LOCAL_PATH/app/:/app/ -w /app/ -it anjing123/strsearch:latest from_bam \
--working_path /app/test_results/ \
--sample test \
--sex male \
--bam /app/test_results/alignments/test.bam \
--ref_bed /app/ref_test.bed \
--genotypes /app/test_results/test_genotypes.txt \
--multiple_alleles /app/test_results/test_multiple_alleles.txt \
--qc_matrix /app/test_results/test_qc_matrix.txt

Reference

The STRsearch publication is available here: STRsearch: a new pipeline for targeted profiling of short tandem repeats in massively parallel sequencing data

Contact

Developer: Dong Wang

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
example		example
model		model
scripts		scripts
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
conf.py		conf.py
pipeline.py		pipeline.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Algorithm description:

Installation

Prerequisite

Configuration file format:

Inputs

Output

Usage examples

Options

Run with Docker

Reference

Contact

About

Releases

Packages

Languages

License

AnJingwd/STRsearch

Folders and files

Latest commit

History

Repository files navigation

Overview

Algorithm description:

Installation

Prerequisite

Configuration file format:

Inputs

Output

Usage examples

Options

Run with Docker

Reference

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages