SplitFusion - a fast pipeline for detection of gene fusion based on fusion-supporting split alignment.

Gene fusion is a hallmark of cancer. Many gene fusions are effective therapeutic targets such as BCR-ABL in chronic myeloid leukemia, EML4-ALK in lung cancer, and any of a number of partners-ROS1 in lung cancer. Accurate detection of gene fusion plays a pivotal role in precision medicine by matching the right drugs to the right patients.

Challenges in the diagnosis of gene fusions include poor sample quality, limited amount of available clinical specimens, and complicated gene rearrangements. The anchored multiplex PCR (AMP) is a clinically proven technology designed, in one purpose, for robust detection of gene fusions across clinical samples of different types and varied qualities, including RNA extracted from FFPE samples.

SplitFusion is a companion data pipeline for AMP, for the detection of gene fusion based on split alignments, i.e. reads crossing fusion breakpoints, with the ability to accurately infer in-frame or out-of-frame of fusion partners of a given fusion candidate. SplitFusion also outputs example breakpoint-supporting seqeunces in FASTA format, allowing for further investigations.

Reference publication

Zheng Z, et al. Split multiplex PCR for targeted next-generation sequencing. Nat Med. 2014

How does SplitFusion work?

The analysis consists of ## computational steps:

Retrive all alignments that have secondary alignments (the 'SA' tag in SAM format) from bam files generated by BWA MEM.
Remove alignments with low mapping quality (default 20).
The consolidated SA belonging to the same UMI-tagged read are then divided into three alignment groups (left, middle and right) according to their relative mapping position to the original read.
Define a genuine breakpoint supporting SA by several criteria.
Fusion effect inference for oncogenic fusion using snpEff.
Genome Visualization of high reliable fusion.

Lastly, outputs a summary table and breakpoint-spanning reads.

Installation

1. Dependencies

R packages ("plyr", "data.table", "parallel")

> install.packages(c("plyr", "data.table", "parallel"))

2. Installing SplitFusion

git clone https://github.com/Zheng-NGS-Lab/SplitFusion.git

R CMD INSTALL SplitFusion

Run

1. Help

python ./SplitFusion/exec/SplitFusion.py -h

usage: SplitFusion.py [-h] --SplitFusionPath SPLITFUSIONPATH --R R --perl PERL
                      --refGenome REFGENOME --sample_id SAMPLE_ID
                      --database_dir DATABASE_DIR [--bam_dir BAM_DIR]
                      [--fastq_dir FASTQ_DIR] [--panel_dir PANEL_DIR]
                      [--r1filename R1FILENAME] [--r2filename R2FILENAME]
                      --output OUTPUT [--panel PANEL] [--steps STEPS]
                      [--AnnotationMethod ANNOTATIONMETHOD]
                      [--snpEff_ref SNPEFF_REF] [--thread THREAD]
                      [--minMQ MINMQ] [--minMQ1 MINMQ1]
                      [--minMapLength MINMAPLENGTH]
                      [--minMapLength2 MINMAPLENGTH2]
                      [--maxQueryGap MAXQUERYGAP] [--maxOverlap MAXOVERLAP]
                      [--minExclusive MINEXCLUSIVE]
                      [--FusionMinStartSite FUSIONMINSTARTSITE]
                      [--minPartnerEnds_BothExonJunction MINPARTNERENDS_BOTHEXONJUNCTION]
                      [--minPartnerEnds_OneExonJunction MINPARTNERENDS_ONEEXONJUNCTION]

Split-Fusion is a fast data analysis pipeline detects gene fusion based on
split reads and/or paired-end reads.

optional arguments:
  -h, --help            show this help message and exit
  --SplitFusionPath SPLITFUSIONPATH
                        the path where Split-Fusion pipeline is installed
                        [required]
  --R R                 the path of R [required]
  --perl PERL           the path of perl [required]
  --refGenome REFGENOME
                        the path where human genome reference is stored
                        [required]
  --sample_id SAMPLE_ID
                        the sample name of running [required]
  --database_dir DATABASE_DIR
                        the path where large databases e.g. reference genome
                        and annotation databases are stored [required]
  --bam_dir BAM_DIR     the path where bam or fastq file is stored.
                        [Kickstart] The bam file of the sameple_id (xxx.bam or
                        xxx.consolidated.bam) will be used. Either fastq_dir
                        or bam_dir should be specified
  --fastq_dir FASTQ_DIR
                        the path where fastq file is stored. Either fastq_dir
                        or bam_dir should be specified
  --panel_dir PANEL_DIR
                        For Target mode: the path where panel specific files
                        are stored.
  --r1filename R1FILENAME
                        Read 1 fastq filename. Can be in gzipped format. If
                        not specified, $fastq_dir/$sample_id.R1.fq will be
                        used.
  --r2filename R2FILENAME
                        Read 2 fastq filename. Can be in gzipped format. If
                        not specified, $fastq_dir/$sample_id.R2.fq will be
                        used.
  --output OUTPUT       the path where output is stored [required]
  --panel PANEL         the path where target genes panel file is stored
  --steps STEPS         specify steps to run
  --AnnotationMethod ANNOTATIONMETHOD
                        the name of annotation tools (annovar or snpEff)
  --snpEff_ref SNPEFF_REF
                        the version of snpEff reference
  --thread THREAD       number of threads for computing
  --minMQ MINMQ         minimum mapping quality
  --minMQ1 MINMQ1       minimum mapping quality of a leftmost of Read1
                        (rightmost of Read2
  --minMapLength MINMAPLENGTH
                        minimum read mapping length
  --minMapLength2 MINMAPLENGTH2
                        minimum mapping length of rightmost of Read1 (leftmost
                        of Read2)
  --maxQueryGap MAXQUERYGAP
                        maximum gap length on a query read of split alignments
  --maxOverlap MAXOVERLAP
                        maximum overlap bases of two split alignments
  --minExclusive MINEXCLUSIVE
                        minimum exclusive length between two split alignments
  --FusionMinStartSite FUSIONMINSTARTSITE
                        minimum number of Adaptor Ligation Read Starting Sites
                        to call Structure Variation/Fusion. Should be less or
                        equal minPartnerEnds_BothExonJunction
  --minPartnerEnds_BothExonJunction MINPARTNERENDS_BOTHEXONJUNCTION
                        minimum number of fusion partner ends (ligation site),
                        when both breakpoints are at exon junctions, to call
                        Structure Variation/Fusion
  --minPartnerEnds_OneExonJunction MINPARTNERENDS_ONEEXONJUNCTION
                        minimum number of fusion partner ends (ligation site),
                        when one breakpoint is at exon junction, to call
                        Structure Variation/Fusion

2. run SplitFusion

An example command:

#python ./SplitFusion/exec/SplitFusion.py --SplitFusionPath SplitFusionPath --refGenome refGenome --bam_dir bam_dir --sample_id sample_id --output output --R R --perl perl --database_dir database_dir --panel_dir SplitFusionPath/data/panel/
# I installed SplitFusion under
#       /home/zz/repo/
# Example run:

##=========================================================
## Start from FASTQ files, no panel info
## , compatible with RNA-seq whole transcriptome analysis
##=========================================================
python /home/zz/repo/SplitFusion/exec/SplitFusion.py \
        --SplitFusionPath /home/zz/R/x86_64-pc-linux-gnu-library/3.5/SplitFusion \
        --sample_id Lib001 \
        --fastq_dir /home/zz/repo/test \
        --database_dir /home/zz/repo/database \
        --r1filename "Lib001".R1.fq \
        --r2filename "Lib001".R2.fq \
        --output /home/zz/repo/test \
        --refGenome Homo_sapiens_assembly19.fasta \
        --R /home/zz/R/bin/R \
        --perl /usr/bin/perl \
        --thread 6 &


##=========================================================
## Kickstart mode, no panel info
## , compatible with RNA-seq whole transcriptome analysis
##=========================================================
python /home/zz/repo/SplitFusion/exec/SplitFusion.py \
        --SplitFusionPath /home/zz/R/x86_64-pc-linux-gnu-library/3.5/SplitFusion \
        --sample_id "Lib001" \
        --bam_dir /home/zz/repo/test \
        --database_dir /home/zz/repo/database \
        --output /home/zz/repo/test \
        --refGenome Homo_sapiens_assembly19.fasta \
        --R /home/zz/R/bin/R \
        --perl /usr/bin/perl \
        --thread 6 &


##===============================
## TARGET mode, with panel info
##===============================
python /home/zz/repo/SplitFusion/exec/SplitFusion.py \
        --SplitFusionPath /home/zz/R/x86_64-pc-linux-gnu-library/3.5/SplitFusion \
        --sample_id Lib001 \
        --fastq_dir /home/zz/repo/test \
        --database_dir /home/zz/repo/database \
        --panel_dir /home/zz/repo/panel \
        --panel LungFusion \
        --r1filename "Lib001".R1.fq \
        --r2filename "Lib001".R2.fq \
        --output /home/zz/repo/test \
        --refGenome Homo_sapiens_assembly19.fasta \
        --R /home/zz/R/bin/R \
        --perl /usr/bin/perl \
        --thread 6 &


##===============================
## Selecting only some steps to run
##===============================
python /home/zz/repo/SplitFusion/exec/SplitFusion.py \
        --SplitFusionPath /home/zz/R/x86_64-pc-linux-gnu-library/3.5/SplitFusion \
        --sample_id "Lib001" \
        --bam_dir /home/zz/repo/test \
        --database_dir /home/zz/repo/database \
        --panel_dir /home/zz/repo/panel \
        --panel LungFusion \
        --output /home/zz/repo/test \
        --refGenome Homo_sapiens_assembly19.fasta \
        --R /home/zz/R/bin/R \
        --perl /usr/bin/perl \
        --steps "3_breakpoint-filter,4_breakpoint-anno,5_breakpoint-anno-post" \
        --thread 6 &

Output

An example brief output table:

SampleID	GeneExon5_GeneExon3	frame	num_partner_ends	num_unique_reads	exon.junction	breakpoint	transcript_5	transcript_3	function_5	function_3	gene_5	cdna_5	gene_3	cdna_3
Lib009	EML4_intronic---ALK_exon20	N.A.	7	298	One	2_29446396__2_42492091	NM_019063	NM_004304	intronic	exonic	EML4	-	ALK	3171
Lib009	EML4_exon6---ALK_exon20	in-frame	14	481	Both	2_29446396__2_42491871	NM_019063	NM_004304	exonic	exonic	EML4	667	ALK	3171
Lib009	EML4_exon5---ALK_exon20	out-frame	14	83	One	2_29446396__2_42490447	NM_019063	NM_004304	exonic	exonic	EML4	596	ALK	3171

An example output fastq file for the EML4_intronic---ALK_exon20 fusion of sample Lib009 is:

>NS500673:45:HHK2HAFXX:1:21106:26233:4628:
ATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTACACTTGGCTGTTTTTTTCGCGAGTTGACATTTTTGCTTGATTAAAGATGTCATCATT
>NS500673:45:HHK2HAFXX:1:21108:4972:6200:
ATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTACACTGGCTGTTTTTTTCGCGAGTTTACATTTTTGCTTGGTTGATT
>NS500673:45:HHK2HAFXX:2:11207:14331:12205:
ATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTACACTGGCTGTTTTTTTCGCGAGTTGACATTTTTGCTTGGTTGATG
>NS500673:45:HHK2HAFXX:2:11301:14903:19850:
ATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTACACTTGGCTGTTTTTTTCGCGAGTTGACATTTTTG
>NS500673:45:HHK2HAFXX:2:21111:24355:8828:
ATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTACACTTGGCTGTTTTTTTCGCGAGTTGACATTTTTG
>NS500673:45:HHK2HAFXX:4:11406:15146:11569:
ATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTACACTTGGCCGTTTTTTTCGCGAGTTGACATTTTTG
>NS500673:45:HHK2HAFXX:4:11606:2779:2081:
ATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTACACTTGGCTGTTTTTTTCGCGAGTTGACATTTTTGCTTGGTTGATGATGACATCTTT
>NS500673:45:HHK2HAFXX:4:21409:22050:11159:
ATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTACACTGGCTGTTATTTTCGCGAGTAGACATTTTTGCTTGGTTGATG
>NS500673:45:HHK2HAFXX:4:21508:24201:16676:
ATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTACACTTGGCTGTTTTTTTCGCGAGTTGACATTTTTG

Visualization

An visualization of example output fastq for the fusion:

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
R		R
exec		exec
inst/data		inst/data
man		man
.DS_Store		.DS_Store
.Rbuildignore		.Rbuildignore
.gitattributes		.gitattributes
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
README.md		README.md
SplitFusion.Rproj		SplitFusion.Rproj
example.command.txt		example.command.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SplitFusion - a fast pipeline for detection of gene fusion based on fusion-supporting split alignment.

Reference publication

How does SplitFusion work?

Installation

1. Dependencies

2. Installing SplitFusion

Run

1. Help

2. run SplitFusion

Output

Visualization

About

Releases

Packages

Languages

chloefishstar/SplitFusion

Folders and files

Latest commit

History

Repository files navigation

SplitFusion - a fast pipeline for detection of gene fusion based on fusion-supporting split alignment.

Reference publication

How does SplitFusion work?

Installation

1. Dependencies

2. Installing SplitFusion

Run

1. Help

2. run SplitFusion

Output

Visualization

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages