This repository contains code and instructions for simulating data generated by CRISPR regulatory screens. It contains the following sections:
This software can be used to simulate data that would be generated by a standard CRISPR regulatory screen.
The simulation parameters mimic the experimental procedure of a CRISPR regulatory screen by taking the following variables into account:
Experimental step | Simulated step |
---|---|
NA | Size and number of true enhancers |
NA | Enhancer strength |
NA | Guide efficiency |
Transduction of cells with guide library | Generate guide distribution |
Load sorter with cells* | Specify number of cells to sort |
Sort cells** | Specify sorting probabilities for sorting using Dirichlet-Multinomial distribution |
Specify sequencing depth | Specify sequencing depth |
Sequence pools | Specify whether PCR duplicates are accounted for |
* If a selection screen is specified, the user must provide the simulation with the number of cells in the input pool prior to sorting.
** dropout rate
The simulations are executed in R. Please make sure you have R version 3.5.1 or higher installed on your computer.
Clone the CRSsim repository to your computer with the following command:
git clone https://github.com/patfiaux/CRSsim.git
or manually download the repository.
To simulate data, you will need the packages listed below. If you don't have them, install them using the following commands in R. Installations should take about 5 minutes on a standard laptop.
- MCMCpack
install.packages('MCMCpack')
- transport
install.packages('transport')
- splines
install.packages('splines')
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
- IRanges
BiocManager::install("IRanges")
- GenomicRanges
BiocManager::install("GenomicRanges")
Here is a walkthrough of simulating data for a selection screen with sample data provided in ./CRSsim/Example_data
.
Running a simulation will generate three .csv files:
- A guide information file, containing information about all the simulated guides (chromosome, start, end, label). The output file will be named
{output_name}_info.csv
. A guide will have one of the following labels:
- 'exon': guides overlapping exons of the gene of interest
- 'neg': non-targeting negative control guides
- 'pos': guides overlapping enhancers. These are true positives
- 'chr': guides targeting the region of interest but not overlapping with either an exon or an enhancer Note: Guides not overlapping any regulatory element are all generated using the same probability distribution. A small, random subset (default 300) is labelled as 'neg', the rest as 'chr'. See Advanced Simulations for how to increase or decrease the number of negative controls
-
A counts file, containing the counts for each guide in each pool. The output file will be named
{output_name}_counts.csv
. Each row corresponds to a guide, each column to a pool, and each number to the number of times a guide was observed in a pool. -
An enhancer file, containing the locations of the simulated regulatory regions (e.g. enhancers). The output file will be named
{output_name}_enhancers.csv
. Each row corresponds to an enhancer.
Follow the steps below to simulate data for a selection screen.
We recommend that you navigate into the ./CRSsim/Example_simulations
directory (included in the repository) and generate all output files there. After navigating into that directory, run the following commands in R to access all functions called by the simulation script:
source('/path/to/script/CRSsim.r')
Running a simulation will generate a guide information file, a counts file and an enhancer file (described above).
After sourcing the CRSsim script you can begin setting the option flags. There are several different flags which have no default arguments and must be supplied by the user. Below is an outline on how to set these required flags.
- Flags are set up as an R list object. Run the following command in R to create this object:
sim.flags <- list()
- Set the output name for the simulation. All output files generated by this run of the simulation will contain this prefix in the filehandle. Here, we will name all outputs
Example_simulation
:
sim.flags$simName <- 'Example_simulation'
- Provide information about the intended guide targets. Either supply them directly by giving the .csv location as a string, as is demonstrated here, or generate them within the script (see details under Advanced Simulations). For details about file format see the file format section.
sim.flags$guideFile <- '../Example_data/Example_selectionScreen_info.csv'
- If guide targets are provided as in step 3, then the marker gene of interest should also be provided. This assumes that some of the guides provided are targeting the gene of interest and can serve as positive controls. The input should be a string to the location of a .csv file. For file format see section file format section.
sim.flags$exon <- '../Example_data/Example_gene.csv'
- Provide the original guide distribution for each replicate. The guide counts can be supplied by an existing data set, as demonstrated below. Here, we supply the guide distribution data from
../Example_data/Example_selectionScreen_counts.csv
. Specifically, we use the guide counts from the 'before' pools. For details about file format see the file format section. Another option is to generate the distributions using a zero-inflated negative binomial (ZINB) distribution. See Advanced Simulations for details about this option.
example.counts <- read.csv('../Example_data/Example_selectionScreen_counts.csv', stringsAsFactors = F)
# note here: 'before_1' and 'before_2' are just placeholder names. They could be called something else.
sim.flags$inputGuideDistr <- cbind(before_1 = example.counts$before_repl1,
before_2 = example.counts$before_repl2)
- Specify the screen type. The user can specify one of two screen types: either a
selectionScreen
as shown here or aFACSscreen
, where cells are sorted into different pools. See FACS screen simulation for an example of simulating a FACS screen.
sim.flags$screenType <- 'selectionScreen'
- Provide names for the different pools in each replicate. In this case, each replicate will have a before and an after selection pool. Their names will be:
before_repl1
,after_repl1
,before_repl2
, etc.
sim.flags$poolNames <- c('before', 'after')
- Specify the CRISPR system used to generate the data. For CRISPRi, the default range of the perturbation effect is assumed to be 1kb. However, this can be manually specified. See the Advanced Simulations section for more details on how to simulate data generated by other CRISPR systems.
sim.flags$crisprSystem <- 'CRISPRi'
- Specify the number and the size of enhancers to be simulated. They will be placed at random genomic positions throughout the screen.
sim.flags$nrEnhancers <- 25
sim.flags$enhancerSize <- 50 # base pairs
- Specify the sequencing depth for each of the pools. Here, the parameters have been set such that the average guide count is 15. A sequencing depth for each pool in each replicate must be defined:
repl1.seqDepth <- c(nrow(example.counts) * 15, nrow(example.counts) * 15) # set depths for 'before' and 'after' pools of replicate 1
repl2.seqDepth <- c(nrow(example.counts) * 15, nrow(example.counts) * 15) # set depths for 'before' and 'after' pools of replicate 2
To set the seqDepth
flag, you must provide a list object where each named list entry represents a replicate and its corresponding sequencing depth(s).
sim.flags$seqDepth <- list(repl1 = repl1.seqDepth, repl2 = repl2.seqDepth)
- Additionally, the simulations can simulate data sets where PCR duplicates are either accounted for or not. If you would like to generate a data set where duplicates are accounted for, set the
pcrDupl
flag toFALSE
orF
.
sim.flags$pcrDupl <- TRUE
- You must also specify:
- selection strength: how strong the effect of disrupting the gene of interest is (
very-high
,high
orlow
). - guide efficiency: what proportion of the guides successfully perturb their targets (
high
,medium
, orlow
) - enhancer strength: how should the enhancer signal is (
high
,medium
, orlow
)
Each of these parameters can be manually provided as strings. See Advanced Simulations for details.
sim.flags$selectionStrength <- 'high'
sim.flags$guideEfficiency <- 'medium'
sim.flags$enhancerStrength <- 'medium'
- Specify the area of effect (AoE) of guides (for details see section 1.2.4). We recommend using
normal
as this is more likely to reflect the behavior of CRISPR.
sim.flags$areaOfEffect_type <- 'normal'
- Specify the relationship between counts and dispersion (for details see section 1.2.5). We recommend either
exponential
(this example) orspline
.
sim.flags$dispersionType <- 'exponential'
- Run the simulation! ✨
simulate_data(sim.flags)
Here is a demonstration for simulating data generated by a FACS screen. Highlighted below are the main option differences from a selection screen simulation.
- Specify the type of screen as
FACSscreen
in your argument flags.
sim.flags$screenType <- 'FACSscreen'
- As in the selection screen example, each pool in each replicate must be named. The difference in this example is that there must be more than one pool. Here is an example where cells are sorted from an input pool into either a high, medium, or low expression pool.
sim.flags$poolNames <- c('input', 'high', 'medium', 'low')
- The sequencing depth must also be specified for each pool in each replicate.
sim.flags$seqDepth <- list(repl1 = rep(nrow(example.counts) * 15, 4),
repl2 = rep(nrow(example.counts) * 15, 4) )
An average guide count of 15 vs. 100 vs. 500 has a major effect on power to detect true signal when everything else is held constant. Make sure to adjust seqDepth
when changing the number of guides used for simulating the data.
When CRISPR is directed to disrup a target site, then guide efficiency specifies how likely this disruption is to take place. However, the size of this disruption won't always be the same. For example, the effects of a CRISPRi perturbation can be up to 1kb but it's unlikely that every CRISPRi perturbation actually results in the silencing of an entire 1kb region. It's more likely that regions close to the target site have a higher probability of being silenced then sites far away. This can be modeled with a normal distribution. For comparison we also left the option of a uniform
AoE.
sim.flags$areaOfEffect_type <- 'normal' # alternatively 'uniform'
Similarly, in case of a dual-guide experiment (dualCRISPR
), it's more likely that each guide creates a short InDel instead of deleting the entire region. We model this with a normal AoE for the regions immediately around the target site of the two guides as well as a deletion probabilitiy (deletionProb
) across the entire targeted region. By default, deletionProb
is set to 0.2
sim.flags$areaOfEffect_type <- 'normal' # alternatively 'uniform'
sim.flags$deletionProb <- 0.2
See 3. Advanced flags for details on how to change the range of the AoE.
For most biological data there's usually a relationship between guide counts and dispersion (variance). CRSsim models this using either an exponential relationship or with splines (functions defined as piecewise polynomials which allows them to have great flexibility). CRSsim has suggested distribution parameters for each (exponential parameters from Fulco et al. replicate 2, spline parameters from Simeonov et al. IL2RA replicate 1). In case of splines, the user can provide a model object for making the dispersion predictions. Alternatively, the relationship between counts and dispersion can also be removed with an independent
flag at dispersionType
.
sim.flags$dispersionType <- 'spline' # alternatively 'exponential' or 'uniform'
sim.flags$splineModelLoc <- 'path-to-CRSsim/Spline_models/IL2RA_r1_df3_splineModel'
2.1.1 Generate guides and their targets. Both single-guide as well as dual-guide screens can be simulated. For both of them, the number of guides (nrGuides
) must be specified, as well as the screen system (crisprSystem
) and the step size between guides (stepSize
). Additionally, if a dual CRISPR screen is selected, the deletion size must be specified (deletionSize
).
Possible crisprSystem
options include: CRISPRi
, CRISPRa
, CRISPRcas9
and dualCRISPR
If this option is chosen, all guides are arbitrarily chosen to be located on chromosome 1 and ~5% of the guides will be selected to serve as positive controls.
sim.flags$nrGuides <- 10000
sim.flags$crisprSystem <- 'dualCRISPR'
sim.flags$stepSize <- 20
sim.flags$deletionSize <- 1000 # only used if $crisprSystem is 'dualCRISPR'
2.1.2 To change the default number of negative controls set the nrNeg
flag.
sim.flags$nrNeg <- 300
2.1.3 Generate guide distributions. The input count distribution for the different replicates can be taken from an existing data set. It is also possible to generalize existing distributions using the zero-inflated negative binomial distribution (ZINB). The ZINB has both a mean (rate) and a dispersion parameter, as well as a parameter describing the fraction of the distribution originating from the zero mass (eta). Below are the steps to obtain and use the parameters from a ZINB:
# obtain ZINB parameters which describe the distribution
before.repl1.par <- obtain_ZINB_pars(example.counts$before_repl1)
example.rate <- before.repl1.par$rate # rate = 76.5
example.dispersion <- before.repl1.par$dispersion # dispersion = 2.6
example.eta <- before.repl1.par$eta # eta = 1e-4
# to generate a ZINB distribution with 10000 guides
before.repl1.simulated <- create_ZINB_shape(10000, example.eta, example.rate, example.dispersion)
before.repl2.simulated <- create_ZINB_shape(10000, example.eta, example.rate, example.dispersion)
# combine the two simulated replicates and set them as input distributions
sim.flags$inputGuideDistr <- cbind(before_1 = before.repl1.simulated, before_2 = before.repl2.simulated)
2.1.4 Set effect range. Currently, four different CRISPR systems can be simulated: CRISPRi
, CRISPRa
, CRISPRcas9
, and dualCRISPR
.
By default, CRISPRi and CRISPRa are assumed to have an effect range of 1kb and Cas9 of 20bp. However, it is also possible to manually set this range with the crisprEffectRange
flag.
For dualCRISPR, the effect range is equivalent to the deletion size. The deletion size introduced by two guides must be represented by 'start' set as the target site of guide 1 and 'end' as the target site of guide 2. Use the deletionSize
flag when using dualCRISPR
.
# example for how to change the effect range of a CRISPR system used
sim.flags$crisprSystem <- 'CRISPRi'
sim.flags$crisprEffectRange <- 500
2.1.5 Enhancer strength and guide efficiency. Both the guide efficiency and the enhancer strength are simulated from a beta distribution. The two parameters can be specified by setting guideEfficiency
and enhancerStrength
to either high
, medium
or low
. It is also possible to directly specify the two shape parameters of the beta distribution. As a general rule, if the shape parameters provided are large, the observed distribution variance is reduced. The larger shape 1 parameter is compared to shape 2 parameter, the more the distribution will be skewed towards 1. To visualize this phenomenon, you can also plot the histogram by randomly sampling from a beta-distribution and then subsequently changing the shape parameters. The default parameters are:
- high: enhancerShape1 = 7, enhancerShape2 = 2
- medium: enhancerShape1 = 5, enhancerShape2 = 5
- low: enhancerShape1 = 2, enhancerShape2 = 7
This also applies for guideShape1
and guideShape2
.
hist(rbeta(10000, shape1 = 8, shape2 = 1)) # randomly generate 10000 instances of the beta distribution
sim.flags$guideEfficiency <- 'medium'
sim.flags$enhancerStrenth <- 'medium'
# the above is equivalent to what's below
sim.flags$enhancerShape1 <- 5
sim.flags$enhancerShape2 <- 5
sim.flags$guideShape1 <- 5
sim.flags$guideShape2 <- 5
2.1.6 Selection strength. To understand the details of the selection strength it is helpful to have some understanding of the Dirichlet distribution. Both for the selection screen as well as for the FACS screen the probability of each guide being either selected or sorted into a given pool is given by a Dirichlet random variate. As an example:
Cells are sorted into three pools: high, medium and low gene expression. The expected probability that a negative control will be sorted into each of the given pools are 0.48, 0.48, and 0.04, respectively. For each guide, this probability shifts slightly. The Dirichlet random variable provides this variation. All cells containing a negative control guide are subsequently assigned to each of the three pools with following probabilities: rdirichlet(1, c(48, 48, 4))
.
Continuing the example above: assume the expected probabilities that a positive control is sorted into each of the three pools - high, medium and low gene expression - are 0.45, 0.45, and 0.1, respectively. All cells containing a positive control guide is subsequently assigned to these pools by observing a Dirichlet random variable as follows: rdirichlet(1, c(45, 45, 10))
.
To manually set the Dirichlet probabilities, use the posSortingFrequency
and the negSortingFrequency
flags. To continue the example from above:
sim.flags$posSortingFrequency <- c(45, 45, 10)
sim.flags$negSortingFrequency <- c(48, 48, 4)
Note:
-
The sum of the frequencies does not have to equal 1.
-
The larger the numbers chosen, the less variable the sorting becomes.
The defaults for the high
flags were chosen based on their ability to accurately represent the sorting parameters for either a selection screen or a FACS screen. The default low
flags were chosen as an arbitrary fraction of the high
selection.
The default flags used for high
:
# for a selection screen:
sim.flags$posSortingFrequency <- c(1)
sim.flags$negSortingFrequency <- c(5)
# in a FACS screen sorted into 3 pools:
# '97' is repeated for all pools except the last one
sim.flags$posSortingFrequency <- c(97, 97, 13) * 0.5
sim.flags$negSortingFrequency <- c(97, 97, 3) * 0.5
The default flags used for low
:
# for a selection screen:
sim.flags$posSortingFrequency <- c(4)
sim.flags$negSortingFrequency <- c(5)
# in a FACS screen sorted into 3 pools:
# '97' is repeated for all pools except the last one
sim.flags$posSortingFrequency <- c(97, 97, 5) * 0.5
sim.flags$negSortingFrequency <- c(97, 97, 3) * 0.5
2.1.7 Area of Effect
Under default settings, the normal
AoE for Cas9
assumes that 50% of all perturbation happen within 10bp and by 95% within 21bp using a standard deviation of 8.5. In the case of CRISPRi or CRISPRa 50% of the activity is within 200bp from the target site and 95% within 415bp from the target site using a standard deviation of 170.
To change these settings both the crisprEffectRange
and the normal_areaOfEffect_sd
flags have to be adjusted. It's recommended to use the crisprEffectRange
as the 95% cutoff point.
# in the case of CRISPRi or CRISPRa
sim.flags$normal_areaOfEffect_sd <- 170
sim.flags$crisprEffectRange <- 415
2.2.1 alphaRRA. To analyze data with MAGeCK as used by Diao et al. 2017, specify use of alphaRRA
and the bins within which to tile the analyzed region. Note, alphaRRA
will be applied to per-guide scores from all methods given by the Method
flag.
analysis.specs$postScoreAlphaRRA <- TRUE
analysis.specs$binSize <- 50
2.2.2 Sliding Window. To combine guide scores using a sliding window as in Fulco et al. 2016 or Simeonov et al. 2017, specify the number of guides to include per sliding window as well as the maximum window size (if the tiled deletion generates a 10KB gap it would not make sense to consider the entire region as one score).
analysis.specs$postScoreSlidingWindow <- TRUE
analysis.specs$guidePerSlidingWindow <- 15
analysis.specs$maxWindowSize <- 8000
If guides are directly provided by the user, the .csv file should have columns for chromosome, start position and end position labeled chrom
, start
, and end
, respectively. All chromosome are given as a string. For chrom
the spelling of the individual chromosomes not matter as long as the same chromosome has the same spelling ('chr8' and 'Chr8' will be recognized as two different chromosomes). start
and end
values are given as numbers. For Cas9, CRISPRi, and CRISPRa screens, the distance between the start and end sites should be set to something small, such as start = target site - 20 and end = target site. Here the top from the file provided at Example_selectionScreen_info.csv
:
chrom | start | end |
---|---|---|
chr8 | 128703371 | 128703391 |
chr8 | 128703511 | 128703531 |
chr8 | 128703521 | 128703541 |
chr8 | 128703539 | 128703559 |
If guide targets are provided for the simulations, then the marker gene of interest should also be provided. This assumes that some of the guides provided are targeting the gene of interest and can serve as positive controls. The input should be a string to the location of a .csv file containing the chromosome, start, and end sites of all exons of the gene of interest. The column names of this data frame should be chrom
, start
, and end
. All chromosome are given as a string. For chrom
the spelling of the individual chromosomes not matter as long as the same chromosome has the same spelling ('chr8' and 'Chr8' will be recognized as two different chromosomes). start
and end
values are given as numbers. Here, we supply the exon information from ../Example_data/Example_gene.csv
.
chrom | start | end |
---|---|---|
chr8 | 128748314 | 128748869 |
chr8 | 128750493 | 128751265 |
chr8 | 128752641 | 128753680 |
The input for inputGuideDistr
is a data frame object and each column will be used to generate the guide count distribution for a replicate. Here, we supply the guide distribution data from ../Example_data/Example_selectionScreen_counts.csv
. Specifically, we use the guide counts from the 'before' pools.
example.counts <- read.csv('../Example_data/Example_selectionScreen_counts.csv', stringsAsFactors = F)
# note here: 'before_1' and 'before_2' are just placeholder names. They could be called something else.
sim.flags$inputGuideDistr <- cbind(before_1 = example.counts$before_repl1,
before_2 = example.counts$before_repl2)
In the case above, inputGuideDistr
looks like the following:
before_1 | before_2 |
---|---|
303 | 342 |
110 | 149 |
145 | 178 |
305 | 248 |