SplitFusion - a fast pipeline for detection of gene fusion based on fusion-supporting split alignment.
Gene fusion is a hallmark of cancer. Many gene fusions are effective therapeutic targets such as BCR-ABL in chronic myeloid leukemia, EML4-ALK in lung cancer, and any of a number of partners-ROS1 in lung cancer. Accurate detection of gene fusion plays a pivotal role in precision medicine by matching the right drugs to the right patients.
Challenges in the diagnosis of gene fusions include poor sample quality, limited amount of available clinical specimens, and complicated gene rearrangements. The anchored multiplex PCR (AMP) is a clinically proven technology designed, in one purpose, for robust detection of gene fusions across clinical samples of different types and varied qualities, including RNA extracted from FFPE samples.
SplitFusion is a companion data pipeline for AMP, for the detection of gene fusion based on split alignments, i.e. reads crossing fusion breakpoints, with the ability to accurately infer in-frame or out-of-frame of fusion partners of a given fusion candidate. SplitFusion also outputs example breakpoint-supporting seqeunces in FASTA format, allowing for further investigations.
Zheng Z, et al. Split multiplex PCR for targeted next-generation sequencing. Nat Med. 2014
The analysis consists of ## computational steps:
-
Retrive all alignments that have secondary alignments (the 'SA' tag in SAM format) from bam files generated by BWA MEM.
-
Remove alignments with low mapping quality (default 20).
-
The consolidated SA belonging to the same UMI-tagged read are then divided into three alignment groups (left, middle and right) according to their relative mapping position to the original read.
-
Define a genuine breakpoint supporting SA by several criteria.
-
Fusion effect inference for oncogenic fusion using snpEff.
-
Genome Visualization of high reliable fusion.
Lastly, outputs a summary table and breakpoint-spanning reads.
- R packages ("plyr", "data.table", "parallel")
> install.packages(c("plyr", "data.table", "parallel"))
git clone https://github.com/Zheng-NGS-Lab/SplitFusion.git
R CMD INSTALL SplitFusion
python ./SplitFusion/exec/SplitFusion.py -h
usage: SplitFusion.py [-h] --SplitFusionPath SPLITFUSIONPATH --R R --perl PERL
--refGenome REFGENOME --sample_id SAMPLE_ID
--database_dir DATABASE_DIR [--bam_dir BAM_DIR]
[--fastq_dir FASTQ_DIR] [--panel_dir PANEL_DIR]
[--r1filename R1FILENAME] [--r2filename R2FILENAME]
--output OUTPUT [--panel PANEL] [--steps STEPS]
[--AnnotationMethod ANNOTATIONMETHOD]
[--snpEff_ref SNPEFF_REF] [--thread THREAD]
[--minMQ MINMQ] [--minMQ1 MINMQ1]
[--minMapLength MINMAPLENGTH]
[--minMapLength2 MINMAPLENGTH2]
[--maxQueryGap MAXQUERYGAP] [--maxOverlap MAXOVERLAP]
[--minExclusive MINEXCLUSIVE]
[--FusionMinStartSite FUSIONMINSTARTSITE]
[--minPartnerEnds_BothExonJunction MINPARTNERENDS_BOTHEXONJUNCTION]
[--minPartnerEnds_OneExonJunction MINPARTNERENDS_ONEEXONJUNCTION]
Split-Fusion is a fast data analysis pipeline detects gene fusion based on
split reads and/or paired-end reads.
optional arguments:
-h, --help show this help message and exit
--SplitFusionPath SPLITFUSIONPATH
the path where Split-Fusion pipeline is installed
[required]
--R R the path of R [required]
--perl PERL the path of perl [required]
--refGenome REFGENOME
the path where human genome reference is stored
[required]
--sample_id SAMPLE_ID
the sample name of running [required]
--database_dir DATABASE_DIR
the path where large databases e.g. reference genome
and annotation databases are stored [required]
--bam_dir BAM_DIR the path where bam or fastq file is stored.
[Kickstart] The bam file of the sameple_id (xxx.bam or
xxx.consolidated.bam) will be used. Either fastq_dir
or bam_dir should be specified
--fastq_dir FASTQ_DIR
the path where fastq file is stored. Either fastq_dir
or bam_dir should be specified
--panel_dir PANEL_DIR
For Target mode: the path where panel specific files
are stored.
--r1filename R1FILENAME
Read 1 fastq filename. Can be in gzipped format. If
not specified, $fastq_dir/$sample_id.R1.fq will be
used.
--r2filename R2FILENAME
Read 2 fastq filename. Can be in gzipped format. If
not specified, $fastq_dir/$sample_id.R2.fq will be
used.
--output OUTPUT the path where output is stored [required]
--panel PANEL the path where target genes panel file is stored
--steps STEPS specify steps to run
--AnnotationMethod ANNOTATIONMETHOD
the name of annotation tools (annovar or snpEff)
--snpEff_ref SNPEFF_REF
the version of snpEff reference
--thread THREAD number of threads for computing
--minMQ MINMQ minimum mapping quality
--minMQ1 MINMQ1 minimum mapping quality of a leftmost of Read1
(rightmost of Read2
--minMapLength MINMAPLENGTH
minimum read mapping length
--minMapLength2 MINMAPLENGTH2
minimum mapping length of rightmost of Read1 (leftmost
of Read2)
--maxQueryGap MAXQUERYGAP
maximum gap length on a query read of split alignments
--maxOverlap MAXOVERLAP
maximum overlap bases of two split alignments
--minExclusive MINEXCLUSIVE
minimum exclusive length between two split alignments
--FusionMinStartSite FUSIONMINSTARTSITE
minimum number of Adaptor Ligation Read Starting Sites
to call Structure Variation/Fusion. Should be less or
equal minPartnerEnds_BothExonJunction
--minPartnerEnds_BothExonJunction MINPARTNERENDS_BOTHEXONJUNCTION
minimum number of fusion partner ends (ligation site),
when both breakpoints are at exon junctions, to call
Structure Variation/Fusion
--minPartnerEnds_OneExonJunction MINPARTNERENDS_ONEEXONJUNCTION
minimum number of fusion partner ends (ligation site),
when one breakpoint is at exon junction, to call
Structure Variation/Fusion
#python ./SplitFusion/exec/SplitFusion.py --SplitFusionPath SplitFusionPath --refGenome refGenome --bam_dir bam_dir --sample_id sample_id --output output --R R --perl perl --database_dir database_dir --panel_dir SplitFusionPath/data/panel/
# I installed SplitFusion under
# /home/zz/repo/
# Example run:
##=========================================================
## Start from FASTQ files, no panel info
## , compatible with RNA-seq whole transcriptome analysis
##=========================================================
python /home/zz/repo/SplitFusion/exec/SplitFusion.py \
--SplitFusionPath /home/zz/R/x86_64-pc-linux-gnu-library/3.5/SplitFusion \
--sample_id Lib001 \
--fastq_dir /home/zz/repo/test \
--database_dir /home/zz/repo/database \
--r1filename "Lib001".R1.fq \
--r2filename "Lib001".R2.fq \
--output /home/zz/repo/test \
--refGenome Homo_sapiens_assembly19.fasta \
--R /home/zz/R/bin/R \
--perl /usr/bin/perl \
--thread 6 &
##=========================================================
## Kickstart mode, no panel info
## , compatible with RNA-seq whole transcriptome analysis
##=========================================================
python /home/zz/repo/SplitFusion/exec/SplitFusion.py \
--SplitFusionPath /home/zz/R/x86_64-pc-linux-gnu-library/3.5/SplitFusion \
--sample_id "Lib001" \
--bam_dir /home/zz/repo/test \
--database_dir /home/zz/repo/database \
--output /home/zz/repo/test \
--refGenome Homo_sapiens_assembly19.fasta \
--R /home/zz/R/bin/R \
--perl /usr/bin/perl \
--thread 6 &
##===============================
## TARGET mode, with panel info
##===============================
python /home/zz/repo/SplitFusion/exec/SplitFusion.py \
--SplitFusionPath /home/zz/R/x86_64-pc-linux-gnu-library/3.5/SplitFusion \
--sample_id Lib001 \
--fastq_dir /home/zz/repo/test \
--database_dir /home/zz/repo/database \
--panel_dir /home/zz/repo/panel \
--panel LungFusion \
--r1filename "Lib001".R1.fq \
--r2filename "Lib001".R2.fq \
--output /home/zz/repo/test \
--refGenome Homo_sapiens_assembly19.fasta \
--R /home/zz/R/bin/R \
--perl /usr/bin/perl \
--thread 6 &
##===============================
## Selecting only some steps to run
##===============================
python /home/zz/repo/SplitFusion/exec/SplitFusion.py \
--SplitFusionPath /home/zz/R/x86_64-pc-linux-gnu-library/3.5/SplitFusion \
--sample_id "Lib001" \
--bam_dir /home/zz/repo/test \
--database_dir /home/zz/repo/database \
--panel_dir /home/zz/repo/panel \
--panel LungFusion \
--output /home/zz/repo/test \
--refGenome Homo_sapiens_assembly19.fasta \
--R /home/zz/R/bin/R \
--perl /usr/bin/perl \
--steps "3_breakpoint-filter,4_breakpoint-anno,5_breakpoint-anno-post" \
--thread 6 &
An example brief output table:
SampleID | GeneExon5_GeneExon3 | frame | num_partner_ends | num_unique_reads | exon.junction | breakpoint | transcript_5 | transcript_3 | function_5 | function_3 | gene_5 | cdna_5 | gene_3 | cdna_3 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Lib009 | EML4_intronic---ALK_exon20 | N.A. | 7 | 298 | One | 2_29446396__2_42492091 | NM_019063 | NM_004304 | intronic | exonic | EML4 | - | ALK | 3171 |
Lib009 | EML4_exon6---ALK_exon20 | in-frame | 14 | 481 | Both | 2_29446396__2_42491871 | NM_019063 | NM_004304 | exonic | exonic | EML4 | 667 | ALK | 3171 |
Lib009 | EML4_exon5---ALK_exon20 | out-frame | 14 | 83 | One | 2_29446396__2_42490447 | NM_019063 | NM_004304 | exonic | exonic | EML4 | 596 | ALK | 3171 |
An example output fastq file for the EML4_intronic---ALK_exon20 fusion of sample Lib009 is:
>NS500673:45:HHK2HAFXX:1:21106:26233:4628:
ATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTACACTTGGCTGTTTTTTTCGCGAGTTGACATTTTTGCTTGATTAAAGATGTCATCATT
>NS500673:45:HHK2HAFXX:1:21108:4972:6200:
ATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTACACTGGCTGTTTTTTTCGCGAGTTTACATTTTTGCTTGGTTGATT
>NS500673:45:HHK2HAFXX:2:11207:14331:12205:
ATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTACACTGGCTGTTTTTTTCGCGAGTTGACATTTTTGCTTGGTTGATG
>NS500673:45:HHK2HAFXX:2:11301:14903:19850:
ATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTACACTTGGCTGTTTTTTTCGCGAGTTGACATTTTTG
>NS500673:45:HHK2HAFXX:2:21111:24355:8828:
ATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTACACTTGGCTGTTTTTTTCGCGAGTTGACATTTTTG
>NS500673:45:HHK2HAFXX:4:11406:15146:11569:
ATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTACACTTGGCCGTTTTTTTCGCGAGTTGACATTTTTG
>NS500673:45:HHK2HAFXX:4:11606:2779:2081:
ATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTACACTTGGCTGTTTTTTTCGCGAGTTGACATTTTTGCTTGGTTGATGATGACATCTTT
>NS500673:45:HHK2HAFXX:4:21409:22050:11159:
ATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTACACTGGCTGTTATTTTCGCGAGTAGACATTTTTGCTTGGTTGATG
>NS500673:45:HHK2HAFXX:4:21508:24201:16676:
ATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTACACTTGGCTGTTTTTTTCGCGAGTTGACATTTTTG