Skip to content

CMUSchwartzLab/TUSV-INT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Deconvolution and Phylogeny Inference of Diverse Variant Types Integrating Bulk DNA-seq with Single-cell RNA-seq

We develop TUSV-INT, a tool for clonal evolution studies integrating bulk DNA-seq and scRNA-seq with diverse variant types (SNV, CNA, and SV). The work uses a general integer linear programming (ILP) framework for clonal lineage reconstruction.

Contents

  1. Installation
  2. Running TUSV-INT
  3. Input Settings
  4. Example

Installation

TUSV-INT is built with python 2.7. We provide the following commands to set up the environment -

conda create -n tusvint python=2.7
conda activate tusvint
conda config --add channels conda-forge
conda config --add channels bioconda

Then, you will need the following packages in the tusvint environment.
- numpy
- pandas
- ete2
- gurobipy
- graphviz
- biopython=1.76
- scipy
- PyVCF

Running TUSV-INT

Input

The method requires two types of inputs. The first is a directory with the bulk DNAseq samples containing SNVs, CNAs and SVs. The second is the allele-specific CNA calls from scRNA-seq. The details of the inputs are given below -

  • Bulk DNA-seq samples: A directory containing the processed variant calls of the bulk DNAseq samples in VCF format. An example can be found in simulation_data/input/samples/.
  • ScRNA-seq: The allele-specific clonal copy numbers from scRNA-seq in .tsv format. For each scRNA clone, the file will have one row. The first r columns will contain the major copy numbers, the later r columns will contain the minor copy numbers. Here is a tab-separated version of the file where the first line is the header and rows correspond to scRNA clones -
chr_start_end_p .. chr_start_end_p .. chr_start_end_m .. chr_start_end_m
1 .. 1 .. 2 .. 1
2 .. 1 .. 1 .. 1
1 .. 1 .. 1 .. 1

An example of the inputs can be found in the simulation_data/input/ folder.

Output

  • T.dot: Output tree with the clone assignments in the nodes and phylogenetic cost/number of SNV and SV mapped in the branches.
  • M.tsv: Bulk DNA-seq clone in the tree to ScRNA-seq clonal assignment matrix.
  • C.tsv: The variant copy number profile matrix (Size: clones * variants)
  • U.tsv: The clonal Mixture fraction matrix (Size: sample * clones)

Input Settings

Following inputs are mandatory:

  • -i : input directory containing bulk DNA-seq VCF files
  • -f : input .tsv file containing scRNA-seq CNAs
  • -o : output directory
  • -n : number of leaves
  • -c : maximum copy number allowed for any breakpoint or segment on any node
  • -t : maximum number of coordinate-descent iterations
  • -r : number of random initializations of the coordinate-descent algorithm
  • -col : binary flag whether to collapse the redundant nodes
  • -sv_ub : the number of subsampled SV breakpoints
  • -const : number of total subsampled breakpoints and SNVs
  • -m : maximum time (seconds) in each coordinate descent iteration

Optional parameters:

  • -x : cell consensus percentage within each clone (default = 34)
  • -b : binary flag for the regularization parameters to be set automatically
  • -l : lambda regularization parameter for weighting the phylogenetic cost
  • -p : number of processors to use (uses all the available cores by default)
  • -s : number of segments (in addition to those containing breakpoints) that are randomly kept (default keeps all the segments)

Example

python -u tusv-int.py -i simulation_data/input/sample/ -f simulation_data/input/C_scRNA_CNVs.tsv -o simulation_data/output/ -n 2 -c 10 -t 3 -r 3 -m 1000 -b -C 120 -sv_ub 80

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages