Skip to content

A pipeline for calling chimeric events (gene fusion/structure variation/viral-host integration) based on split-reads.

Notifications You must be signed in to change notification settings

chloefishstar/SplitFusion

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SplitFusion - a fast pipeline for detection of gene fusion based on fusion-supporting split alignment.

Gene fusion is a hallmark of cancer. Many gene fusions are effective therapeutic targets such as BCR-ABL in chronic myeloid leukemia, EML4-ALK in lung cancer, and any of a number of partners-ROS1 in lung cancer. Accurate detection of gene fusion plays a pivotal role in precision medicine by matching the right drugs to the right patients.

Challenges in the diagnosis of gene fusions include poor sample quality, limited amount of available clinical specimens, and complicated gene rearrangements. The anchored multiplex PCR (AMP) is a clinically proven technology designed, in one purpose, for robust detection of gene fusions across clinical samples of different types and varied qualities, including RNA extracted from FFPE samples.

SplitFusion is a companion data pipeline for AMP, for the detection of gene fusion based on split alignments, i.e. reads crossing fusion breakpoints, with the ability to accurately infer in-frame or out-of-frame of fusion partners of a given fusion candidate. SplitFusion also outputs example breakpoint-supporting seqeunces in FASTA format, allowing for further investigations.

Reference publication

Zheng Z, et al. Split multiplex PCR for targeted next-generation sequencing. Nat Med. 2014

How does SplitFusion work?

The analysis consists of ## computational steps:

  1. Retrive all alignments that have secondary alignments (the 'SA' tag in SAM format) from bam files generated by BWA MEM.

  2. Remove alignments with low mapping quality (default 20).

  3. The consolidated SA belonging to the same UMI-tagged read are then divided into three alignment groups (left, middle and right) according to their relative mapping position to the original read.

  4. Define a genuine breakpoint supporting SA by several criteria.

  5. Fusion effect inference for oncogenic fusion using snpEff.

  6. Genome Visualization of high reliable fusion.

Lastly, outputs a summary table and breakpoint-spanning reads.

Installation

1. Dependencies

  • R packages ("plyr", "data.table", "parallel")
> install.packages(c("plyr", "data.table", "parallel"))

2. Installing SplitFusion

git clone https://github.com/Zheng-NGS-Lab/SplitFusion.git

R CMD INSTALL SplitFusion

Run

1. Help

python ./SplitFusion/exec/SplitFusion.py -h

usage: SplitFusion.py [-h] --SplitFusionPath SPLITFUSIONPATH --R R --perl PERL
                      --refGenome REFGENOME --sample_id SAMPLE_ID
                      --database_dir DATABASE_DIR [--bam_dir BAM_DIR]
                      [--fastq_dir FASTQ_DIR] [--panel_dir PANEL_DIR]
                      [--r1filename R1FILENAME] [--r2filename R2FILENAME]
                      --output OUTPUT [--panel PANEL] [--steps STEPS]
                      [--AnnotationMethod ANNOTATIONMETHOD]
                      [--snpEff_ref SNPEFF_REF] [--thread THREAD]
                      [--minMQ MINMQ] [--minMQ1 MINMQ1]
                      [--minMapLength MINMAPLENGTH]
                      [--minMapLength2 MINMAPLENGTH2]
                      [--maxQueryGap MAXQUERYGAP] [--maxOverlap MAXOVERLAP]
                      [--minExclusive MINEXCLUSIVE]
                      [--FusionMinStartSite FUSIONMINSTARTSITE]
                      [--minPartnerEnds_BothExonJunction MINPARTNERENDS_BOTHEXONJUNCTION]
                      [--minPartnerEnds_OneExonJunction MINPARTNERENDS_ONEEXONJUNCTION]

Split-Fusion is a fast data analysis pipeline detects gene fusion based on
split reads and/or paired-end reads.

optional arguments:
  -h, --help            show this help message and exit
  --SplitFusionPath SPLITFUSIONPATH
                        the path where Split-Fusion pipeline is installed
                        [required]
  --R R                 the path of R [required]
  --perl PERL           the path of perl [required]
  --refGenome REFGENOME
                        the path where human genome reference is stored
                        [required]
  --sample_id SAMPLE_ID
                        the sample name of running [required]
  --database_dir DATABASE_DIR
                        the path where large databases e.g. reference genome
                        and annotation databases are stored [required]
  --bam_dir BAM_DIR     the path where bam or fastq file is stored.
                        [Kickstart] The bam file of the sameple_id (xxx.bam or
                        xxx.consolidated.bam) will be used. Either fastq_dir
                        or bam_dir should be specified
  --fastq_dir FASTQ_DIR
                        the path where fastq file is stored. Either fastq_dir
                        or bam_dir should be specified
  --panel_dir PANEL_DIR
                        For Target mode: the path where panel specific files
                        are stored.
  --r1filename R1FILENAME
                        Read 1 fastq filename. Can be in gzipped format. If
                        not specified, $fastq_dir/$sample_id.R1.fq will be
                        used.
  --r2filename R2FILENAME
                        Read 2 fastq filename. Can be in gzipped format. If
                        not specified, $fastq_dir/$sample_id.R2.fq will be
                        used.
  --output OUTPUT       the path where output is stored [required]
  --panel PANEL         the path where target genes panel file is stored
  --steps STEPS         specify steps to run
  --AnnotationMethod ANNOTATIONMETHOD
                        the name of annotation tools (annovar or snpEff)
  --snpEff_ref SNPEFF_REF
                        the version of snpEff reference
  --thread THREAD       number of threads for computing
  --minMQ MINMQ         minimum mapping quality
  --minMQ1 MINMQ1       minimum mapping quality of a leftmost of Read1
                        (rightmost of Read2
  --minMapLength MINMAPLENGTH
                        minimum read mapping length
  --minMapLength2 MINMAPLENGTH2
                        minimum mapping length of rightmost of Read1 (leftmost
                        of Read2)
  --maxQueryGap MAXQUERYGAP
                        maximum gap length on a query read of split alignments
  --maxOverlap MAXOVERLAP
                        maximum overlap bases of two split alignments
  --minExclusive MINEXCLUSIVE
                        minimum exclusive length between two split alignments
  --FusionMinStartSite FUSIONMINSTARTSITE
                        minimum number of Adaptor Ligation Read Starting Sites
                        to call Structure Variation/Fusion. Should be less or
                        equal minPartnerEnds_BothExonJunction
  --minPartnerEnds_BothExonJunction MINPARTNERENDS_BOTHEXONJUNCTION
                        minimum number of fusion partner ends (ligation site),
                        when both breakpoints are at exon junctions, to call
                        Structure Variation/Fusion
  --minPartnerEnds_OneExonJunction MINPARTNERENDS_ONEEXONJUNCTION
                        minimum number of fusion partner ends (ligation site),
                        when one breakpoint is at exon junction, to call
                        Structure Variation/Fusion

2. run SplitFusion

An example command:

#python ./SplitFusion/exec/SplitFusion.py --SplitFusionPath SplitFusionPath --refGenome refGenome --bam_dir bam_dir --sample_id sample_id --output output --R R --perl perl --database_dir database_dir --panel_dir SplitFusionPath/data/panel/
# I installed SplitFusion under
#       /home/zz/repo/
# Example run:

##=========================================================
## Start from FASTQ files, no panel info
## , compatible with RNA-seq whole transcriptome analysis
##=========================================================
python /home/zz/repo/SplitFusion/exec/SplitFusion.py \
        --SplitFusionPath /home/zz/R/x86_64-pc-linux-gnu-library/3.5/SplitFusion \
        --sample_id Lib001 \
        --fastq_dir /home/zz/repo/test \
        --database_dir /home/zz/repo/database \
        --r1filename "Lib001".R1.fq \
        --r2filename "Lib001".R2.fq \
        --output /home/zz/repo/test \
        --refGenome Homo_sapiens_assembly19.fasta \
        --R /home/zz/R/bin/R \
        --perl /usr/bin/perl \
        --thread 6 &


##=========================================================
## Kickstart mode, no panel info
## , compatible with RNA-seq whole transcriptome analysis
##=========================================================
python /home/zz/repo/SplitFusion/exec/SplitFusion.py \
        --SplitFusionPath /home/zz/R/x86_64-pc-linux-gnu-library/3.5/SplitFusion \
        --sample_id "Lib001" \
        --bam_dir /home/zz/repo/test \
        --database_dir /home/zz/repo/database \
        --output /home/zz/repo/test \
        --refGenome Homo_sapiens_assembly19.fasta \
        --R /home/zz/R/bin/R \
        --perl /usr/bin/perl \
        --thread 6 &


##===============================
## TARGET mode, with panel info
##===============================
python /home/zz/repo/SplitFusion/exec/SplitFusion.py \
        --SplitFusionPath /home/zz/R/x86_64-pc-linux-gnu-library/3.5/SplitFusion \
        --sample_id Lib001 \
        --fastq_dir /home/zz/repo/test \
        --database_dir /home/zz/repo/database \
        --panel_dir /home/zz/repo/panel \
        --panel LungFusion \
        --r1filename "Lib001".R1.fq \
        --r2filename "Lib001".R2.fq \
        --output /home/zz/repo/test \
        --refGenome Homo_sapiens_assembly19.fasta \
        --R /home/zz/R/bin/R \
        --perl /usr/bin/perl \
        --thread 6 &


##===============================
## Selecting only some steps to run
##===============================
python /home/zz/repo/SplitFusion/exec/SplitFusion.py \
        --SplitFusionPath /home/zz/R/x86_64-pc-linux-gnu-library/3.5/SplitFusion \
        --sample_id "Lib001" \
        --bam_dir /home/zz/repo/test \
        --database_dir /home/zz/repo/database \
        --panel_dir /home/zz/repo/panel \
        --panel LungFusion \
        --output /home/zz/repo/test \
        --refGenome Homo_sapiens_assembly19.fasta \
        --R /home/zz/R/bin/R \
        --perl /usr/bin/perl \
        --steps "3_breakpoint-filter,4_breakpoint-anno,5_breakpoint-anno-post" \
        --thread 6 &

Output

An example brief output table:

SampleID GeneExon5_GeneExon3 frame num_partner_ends num_unique_reads exon.junction breakpoint transcript_5 transcript_3 function_5 function_3 gene_5 cdna_5 gene_3 cdna_3
Lib009 EML4_intronic---ALK_exon20 N.A. 7 298 One 2_29446396__2_42492091 NM_019063 NM_004304 intronic exonic EML4 - ALK 3171
Lib009 EML4_exon6---ALK_exon20 in-frame 14 481 Both 2_29446396__2_42491871 NM_019063 NM_004304 exonic exonic EML4 667 ALK 3171
Lib009 EML4_exon5---ALK_exon20 out-frame 14 83 One 2_29446396__2_42490447 NM_019063 NM_004304 exonic exonic EML4 596 ALK 3171

An example output fastq file for the EML4_intronic---ALK_exon20 fusion of sample Lib009 is:

>NS500673:45:HHK2HAFXX:1:21106:26233:4628:
ATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTACACTTGGCTGTTTTTTTCGCGAGTTGACATTTTTGCTTGATTAAAGATGTCATCATT
>NS500673:45:HHK2HAFXX:1:21108:4972:6200:
ATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTACACTGGCTGTTTTTTTCGCGAGTTTACATTTTTGCTTGGTTGATT
>NS500673:45:HHK2HAFXX:2:11207:14331:12205:
ATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTACACTGGCTGTTTTTTTCGCGAGTTGACATTTTTGCTTGGTTGATG
>NS500673:45:HHK2HAFXX:2:11301:14903:19850:
ATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTACACTTGGCTGTTTTTTTCGCGAGTTGACATTTTTG
>NS500673:45:HHK2HAFXX:2:21111:24355:8828:
ATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTACACTTGGCTGTTTTTTTCGCGAGTTGACATTTTTG
>NS500673:45:HHK2HAFXX:4:11406:15146:11569:
ATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTACACTTGGCCGTTTTTTTCGCGAGTTGACATTTTTG
>NS500673:45:HHK2HAFXX:4:11606:2779:2081:
ATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTACACTTGGCTGTTTTTTTCGCGAGTTGACATTTTTGCTTGGTTGATGATGACATCTTT
>NS500673:45:HHK2HAFXX:4:21409:22050:11159:
ATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTACACTGGCTGTTATTTTCGCGAGTAGACATTTTTGCTTGGTTGATG
>NS500673:45:HHK2HAFXX:4:21508:24201:16676:
ATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTACACTTGGCTGTTTTTTTCGCGAGTTGACATTTTTG

Visualization

An visualization of example output fastq for the fusion: image

About

A pipeline for calling chimeric events (gene fusion/structure variation/viral-host integration) based on split-reads.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C 38.9%
  • Perl 36.5%
  • HTML 14.2%
  • JavaScript 2.6%
  • R 2.5%
  • Roff 2.0%
  • Other 3.3%