diff --git a/completed-notebooks/.gitkeep b/completed-notebooks/.gitkeep deleted file mode 100644 index e69de29..0000000 diff --git a/completed-notebooks/intro-to-R-tidyverse/01-intro_to_base_R.nb.html b/completed-notebooks/intro-to-R-tidyverse/01-intro_to_base_R.nb.html new file mode 100644 index 0000000..efd9579 --- /dev/null +++ b/completed-notebooks/intro-to-R-tidyverse/01-intro_to_base_R.nb.html @@ -0,0 +1,4031 @@ + + + + +
+ + + + + + + + + + +This notebook will demonstrate how to:
+R is a statistical computing language that is +open source, meaning the underlying code for the language is +freely available to anyone. You do not need a special license or set of +permissions to use and develop code in R.
+R itself is an interpreted computer language and comes with +functionality that comes bundled with the language itself, known as +“base R”. But there is also rich additional +functionality provided by external packages, or +libraries of code that assist in accomplishing certain tasks and can be +freely downloaded and loaded for use.
+In the next notebook and subsequent modules, we will be using a suite
+of packages collectively known as The Tidyverse. The
+tidyverse
is geared towards intuitive data science
+applications that follow a shared data philosophy. But there are still
+many core features of base R which are important to be aware of, and we
+will be using concepts from both base R and the tidyverse in our
+analyses, as well as task specific packages for analyses such as gene
+expression.
RStudio is a graphical environment (“integrated development +environment” or IDE) for writing and developing R code. RStudio is NOT a +separate programming language - it is an interface we use to facilitate +R programming. In other words, you can program in R without RStudio, but +you can’t use the RStudio environment without R.
+For more information about RStudio than you ever wanted to know, see +this RStudio +IDE Cheatsheet (pdf).
+The RStudio environment has four main panes, each of +which may have a number of tabs that display different information or +functionality. (their specific location can be changed under Tools -> +Global Options -> Pane Layout).
+The Editor pane is where you can write R scripts +and other documents. Each tab here is its own document. This is your +text editor, which will allow you to save your R code for +future use. Note that change code here will not run automatically until +you run it.
The Console pane is where you can +interactively run R code.
The Environment pane primarily displays the +variables, sometimes known as objects that are defined during a +given R session, and what data or values they might hold.
The final pane, Files, Plots, Help, …, has +several pretty important tabs:
+The most basic use of R is as a regular calculator:
+Operation | +Symbol | +
---|---|
Add | ++ |
+
Subtract | +- |
+
Multiply | +* |
+
Divide | +/ |
+
Exponentiate | +^ or ** |
+
For example, we can do some simple multiplication like this. When you +execute code within the notebook, the results appear beneath the code. +Try executing this chunk by clicking the Run button within the +chunk or by placing your cursor inside it and pressing +Cmd+Shift+Enter on a Mac, or Ctrl+Shift+Enter on a +PC.
+ + + +5 * 6
+
+
+[1] 30
+
+
+
+Use the console to calculate other expressions. Standard order of
+operations applies (mostly), and you can use parentheses ()
+as you might expect (but not brackets []
or
+braces{}
, which have special meanings). Note however, that
+you must always specify multiplication with
+*
; implicit multiplication such as 10(3 + 4)
+or 10x
will not work and will generate an error, or
+worse.
10 * (3 + 4)^2
+
+
+[1] 490
+
+
+
+To define a variable, we use the assignment operator which
+looks like an arrow: <-
, for example
+x <- 7
takes the value on the right-hand side of the
+operator and assigns it to the variable name on the left-hand side.
# Define a variable x to equal 7, and print out the value of x
+x <- 7
+
+# We can have R repeat back to us what `x` is by just using `x`
+x
+
+
+[1] 7
+
+
+
+Some features of variables, considering the example
+x <- 7
: Every variable has a name, a
+value, and a type. This variable’s
+name is x
, its value is 7
, and its type is
+numeric
(7 is a number!). Re-defining a variable will
+overwrite the value.
x <- 5.5
+
+x
+
+
+[1] 5.5
+
+
+
+We can modify an existing variable by reassigning it to its same
+name. Here we’ll add 2
to x
and reassign the
+result back to x
.
x <- x + 2
+
+x
+
+
+[1] 7.5
+
+
+
+As best you can, it is a good idea to make your variable names
+informative (e.g. x
doesn’t mean anything, but
+sandwich_price
is meaningful… if we’re talking about the
+cost of sandwiches, that is..).
We can use pre-built computation methods called “functions” for other
+operations. Functions have the following format, where the
+argument is the information we are providing to the function
+for it to run. An example of this was the atan()
function
+used above.
function_name(argument)
+To learn about functions, we’ll examine one called log()
+first.
To know what a function does and how to use it, use the question mark
+which will reveal documentation in the help pane:
+?log
The documentation tells us that log()
is derived from
+{base}
, meaning it is a function that is part of base R. It
+provides a brief description of what the function does and shows several
+examples of to how use it.
In particular, the documentation tells us about what argument(s) to +provide:
+e
.Functions also return values for us to use. In the case of
+log()
, the returned value is the log’d value the function
+computed.
log(73)
+
+
+[1] 4.290459
+
+
+
+Here we can specify an argument of base
to
+calculate log base 3.
log(81, base = 3)
+
+
+[1] 4
+
+
+
+If we don’t specify the argument names, it assumes they are
+in the order that log
defines them. See ?log
+to see more about its arguments.
log(8, 2)
+
+
+[1] 3
+
+
+
+We can switch the order if we specify the argument names.
+ + + +log(base = 10, x = 4342)
+
+
+[1] 3.63769
+
+
+
+We can also provide variables as arguments in the same way as the raw +values.
+ + + +meaning <- 42
+log(meaning)
+
+
+[1] 3.73767
+
+
+
+Variable types in R can sometimes be coerced (converted) +from one type to another.
+ + + +# Define a variable with a number
+x <- 15
+
+
+
+The function class()
will tell us the variable’s
+type.
class(x)
+
+
+[1] "numeric"
+
+
+numeric
+
+
+
+Let’s coerce it to a character.
+ + + +x <- as.character(x)
+class(x)
+
+
+[1] "character"
+
+
+character
+
+
+
+See it now has quotes around it? It’s now a character and will behave +as such.
+ + + +x
+
+
+[1] "15"
+
+
+15
+
+
+
+Use this chunk to try to perform calculations with x
,
+now that it is a character, what happens?
# Try to perform calculations on `x`
+
+
+
+But we can’t coerce everything:
+ + + +# Let's create a character variable
+x <- "look at my character variable"
+
+
+
+Let’s try making this a numeric variable:
+ + + +x <- as.numeric(x)
+
+
+Warning: NAs introduced by coercion
+
+
+
+Print out x
.
x
+
+
+[1] NA
+
+
+
+R is telling us it doesn’t know how to convert this to a numeric
+variable, so it has returned NA
instead.
For reference, here’s a summary of some of the most important +variable types.
+Variable Type | +Definition | +Examples | +Coercion | +
---|---|---|---|
numeric |
+Any number value | +5 7.5 -1 |
+as.numeric() |
+
integer |
+Any whole number value (no decimals) | +5 -100 |
+as.integer() |
+
character |
+Any collection of characters defined within quotation +marks. Also known as a “string”. | +"a" (a single letter)
+"stringofletters" (a whole bunch of characters put
+together as one) "string of letters and spaces" + "5" 'single quotes are also good' |
+as.character() |
+
logical |
+A value of TRUE , FALSE , or
+NA |
+TRUE FALSE NA (not
+defined) |
+as.logical() |
+
factor |
+A special type of variable that denotes specific categories of a +categorical variable | +(stay tuned..) | +as.factor() |
+
You will have noticed that all your computations tend to pop up with
+a [1]
preceding them in R’s output. This is because, in
+fact, all (ok mostly all) variables are by default vectors, and
+our answers are the first (in these cases only) value in the vector. As
+vectors get longer, new index indicators will appear at the start of new
+lines.
# This is actually an vector that has one item in it.
+x <- 7
+
+
+
+
+
+
+# The length() functions tells us how long an vector is:
+length(x)
+
+
+[1] 1
+
+
+
+We can define vectors with the function c()
, which
+stands for “combine”. This function takes a comma-separated set of
+values to place in the vector, and returns the vector itself:
my_numeric_vector <- c(1, 1, 2, 3, 5, 8, 13, 21)
+my_numeric_vector
+
+
+[1] 1 1 2 3 5 8 13 21
+
+
+
+We can build on vectors in place by redefining them:
+ + + +# add the next two Fibonacci numbers to the series.
+my_numeric_vector <- c(my_numeric_vector, 34, 55)
+my_numeric_vector
+
+
+ [1] 1 1 2 3 5 8 13 21 34 55
+
+
+
+We can pull out specific items from an vector using a process called
+indexing, which uses brackets []
to specify the
+position of an item.
# Grab the fourth value from my_numeric_vector
+# This gives us an vector of length 1
+my_numeric_vector[4]
+
+
+[1] 3
+
+
+
+Colons are also a nice way to quickly make ordered numeric vectors +Use a colon to specify an inclusive range of indices This will return an +vector with 2, 3, 4, and 5.
+ + + +my_numeric_vector[2:5]
+
+
+[1] 1 2 3 5
+
+
+
+One major benefit of vectors is the concept of +vectorization, where R by default performs operations +on the entire vector at once. For example, we can get the log +of all numbers 1-20 with a single, simple call, and more!
+ + + +values_1_to_20 <- 1:20
+
+
+
+
+
+
+# calculate the log of values_1_to_20
+log(values_1_to_20)
+
+
+ [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
+ [8] 2.0794415 2.1972246 2.3025851 2.3978953 2.4849066 2.5649494 2.6390573
+[15] 2.7080502 2.7725887 2.8332133 2.8903718 2.9444390 2.9957323
+
+
+
+Finally, we can apply logical expressions to vectors, just as we can +do for single values. The output here is a logical vector telling us +whether each value in example_vector is TRUE or FALSE
+ + + +# Which values are <= 3?
+values_1_to_20 <= 3
+
+
+ [1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
+[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
+
+
+
+There are several key functions which can be used on vectors +containing numeric values, some of which are below.
+mean()
: The average value in the vectormin()
: The minimum value in the vectormax()
: The maximum value in the vectorsum()
: The sum of all values in the vectorWe can try out these functions on the vector
+values_1_to_20
we’ve created.
mean(values_1_to_20)
+
+
+[1] 10.5
+
+
+# Try out some of the other functions we've listed above
+
+
+
+We have learned functions such as c
,
+length
, sum
, and etc. Imagine defining a
+variable called c
: This will work, but it will lead to a
+lot of unintended bugs, so it’s best to avoid this.
%in%
logical operator%in%
is useful for determining whether a given item(s)
+are in an vector.
# is `7` in our vector?
+7 %in% values_1_to_20
+
+
+[1] TRUE
+
+
+
+
+
+
+# is `50` in our vector?
+50 %in% values_1_to_20
+
+
+[1] FALSE
+
+
+
+We can test a vector of values being within another vector of +values.
+ + + +question_values <- c(1:3, 7, 50)
+# Are these values in our vector?
+question_values %in% values_1_to_20
+
+
+[1] TRUE TRUE TRUE TRUE FALSE
+
+
+
+Data frames are one of the most useful tools for data analysis in
+R. They are tables which consist of rows and columns, much like a
+spreadsheet. Each column is a variable which behaves as a
+vector, and each row is an observation. We will begin our
+exploration with dataset of measurements from three penguin species
+measured, which we can find in the palmerpenguins
+package. We’ll talk more about packages soon! To use this dataset,
+we will load it from the palmerpenguins
package using a
+::
(more on this later) and assign it to a variable named
+penguins
in our current environment.
penguins <- palmerpenguins::penguins
+
+
+
+Artwork by @allison_horst
+The first step to using any data is to look at it!!! RStudio contains
+a special function View()
which allows you to literally
+view a variable. You can also click on the object in the environment
+pane to see its overall properties, or click the table icon on the
+object’s row to automatically view the variable.
Some useful functions for exploring our data frame include:
+head()
to see the first 6 rows of a data frame.
+Additional arguments supplied can change the number of rows.tail()
to see the last 6 rows of a data frame.
+Additional arguments supplied can change the number of rows.names()
to see the column names of the data frame.nrow()
to see how many rows are in the data framencol()
to see how many columns are in the data
+frame.We can additionally explore overall properties of the data
+frame with two different functions: summary()
and
+str()
.
This provides summary statistics for each column:
+ + + +summary(penguins)
+
+
+ species island bill_length_mm bill_depth_mm
+ Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
+ Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
+ Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
+ Mean :43.92 Mean :17.15
+ 3rd Qu.:48.50 3rd Qu.:18.70
+ Max. :59.60 Max. :21.50
+ NA's :2 NA's :2
+ flipper_length_mm body_mass_g sex year
+ Min. :172.0 Min. :2700 female:165 Min. :2007
+ 1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
+ Median :197.0 Median :4050 NA's : 11 Median :2008
+ Mean :200.9 Mean :4202 Mean :2008
+ 3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
+ Max. :231.0 Max. :6300 Max. :2009
+ NA's :2 NA's :2
+
+
+
+This provides a short view of the structure and +contents of the data frame.
+ + + +str(penguins)
+
+
+tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
+ $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
+ $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
+ $ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
+ $ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
+ $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
+ $ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
+ $ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
+ $ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
+
+
+
+You’ll notice that the column species
is a
+factor: This is a special type of character variable that
+represents distinct categories known as “levels”. We have learned here
+that there are three levels in the species
column: Adelie,
+Chinstrap, and Gentoo. We might want to explore individual columns of
+the data frame more in-depth. We can examine individual columns using
+the dollar sign $
to select one by name:
# Extract bill_length_mm as a vector
+penguins$bill_length_mm
+
+
+ [1] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42.0 37.8 37.8 41.1 38.6 34.6
+ [16] 36.6 38.7 42.5 34.4 46.0 37.8 37.7 35.9 38.2 38.8 35.3 40.6 40.5 37.9 40.5
+ [31] 39.5 37.2 39.5 40.9 36.4 39.2 38.8 42.2 37.6 39.8 36.5 40.8 36.0 44.1 37.0
+ [46] 39.6 41.1 37.5 36.0 42.3 39.6 40.1 35.0 42.0 34.5 41.4 39.0 40.6 36.5 37.6
+ [61] 35.7 41.3 37.6 41.1 36.4 41.6 35.5 41.1 35.9 41.8 33.5 39.7 39.6 45.8 35.5
+ [76] 42.8 40.9 37.2 36.2 42.1 34.6 42.9 36.7 35.1 37.3 41.3 36.3 36.9 38.3 38.9
+ [91] 35.7 41.1 34.0 39.6 36.2 40.8 38.1 40.3 33.1 43.2
+ [ reached getOption("max.print") -- omitted 244 entries ]
+
+
+# indexing operators can be used on these vectors too
+penguins$bill_length_mm[1:10]
+
+
+ [1] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42.0
+
+
+
+We can perform our regular vector operations on columns directly.
+ + + +# calculate the mean of the bill_length_mm column
+mean(penguins$bill_length_mm,
+ na.rm = TRUE) # remove missing values before calculating the mean
+
+
+[1] 43.92193
+
+
+
+We can also calculate the full summary statistics for a single column +directly.
+ + + +# show a summary of the bill_length_mm column
+summary(penguins$bill_length_mm)
+
+
+ Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
+ 32.10 39.23 44.45 43.92 48.50 59.60 2
+
+
+
+Extract species
as a vector and subset it to see a
+preview.
# get the first 10 values of the species column
+penguins$species[1:10]
+
+
+ [1] Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie
+Levels: Adelie Chinstrap Gentoo
+
+
+
+And view its levels with the levels()
+function.
levels(penguins$species)
+
+
+[1] "Adelie" "Chinstrap" "Gentoo"
+
+
+Adelie
+Chinstrap
+Gentoo
+
+
+
+In many situations, we will be reading in tabular data from a file
+and using it as a data frame. To practice, we will read in a file we
+will be using in the next notebook as well,
+gene_results_GSE44971.tsv
, in the data
folder.
+File paths are relative to the location where this notebook file (.Rmd)
+is saved.
Here we will use a function, read_tsv()
from the
+readr
package. Before we are able to use the function, we
+have to load the package using library()
.
library(readr)
+
+
+
+file.path()
creates a properly formatted file path by
+adding a path separator (/
on Mac and Linux operating
+systems, the latter of which is the operating system that our RStudio
+Server runs on) between separate folders or directories. Because file
+path separators can differ between your computer and the computer of
+someone who wants to use your code, we use file.path()
+instead of typing out "data/gene_results_GSE44971.tsv"
.
+Each argument to file.path()
is a directory or
+file name. You’ll notice each argument is in quotes, we specify
+data
first because the file,
+gene_results_GSE44971.tsv
is in the data
+folder.
file.path("data", "gene_results_GSE44971.tsv")
+
+
+[1] "data/gene_results_GSE44971.tsv"
+
+
+data/gene_results_GSE44971.tsv
+
+
+
+As you can see above, the result of running file.path()
+is that it creates a string with an accurately-formatted path
+for your file system. This string can be used moving forward when you
+need to refer to the path to your file. Let’s go ahead and store this
+file path as a variable in our environment.
gene_file_path <- file.path("data", "gene_results_GSE44971.tsv")
+
+
+
+Now we are ready to use read_tsv()
to read the file into
+R. The resulting data frame will be stored in a variable named
+stats_df
. Note the <-
(assignment
+operator!) is responsible for saving this to our global environment.
# read in the file `gene_results_GSE44971.tsv` from the data directory
+stats_df <- read_tsv(gene_file_path)
+
+
+Rows: 6804 Columns: 8
+── Column specification ────────────────────────────────────────────────────────
+Delimiter: "\t"
+chr (3): ensembl_id, gene_symbol, contrast
+dbl (5): log_fold_change, avg_expression, t_statistic, p_value, adj_p_value
+
+ℹ Use `spec()` to retrieve the full column specification for this data.
+ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
+
+
+
+Take a look at your environment panel to see what
+stats_df
looks like. We can also print out a preview of the
+stats_df
data frame here.
# display stats_df
+stats_df
+
+At the end of every notebook, you will see us print out
+sessionInfo
. This aids in the reproducibility of your code
+by showing exactly what packages and versions were being used the last
+time the notebook was run.
sessionInfo()
+
+
+R version 4.4.0 (2024-04-24)
+Platform: x86_64-pc-linux-gnu
+Running under: Ubuntu 22.04.4 LTS
+
+Matrix products: default
+BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
+LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
+
+locale:
+ [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
+ [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
+ [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
+ [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
+ [9] LC_ADDRESS=C LC_TELEPHONE=C
+[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
+
+time zone: Etc/UTC
+tzcode source: system (glibc)
+
+attached base packages:
+[1] stats graphics grDevices utils datasets methods base
+
+other attached packages:
+[1] readr_2.1.5 optparse_1.7.5
+
+loaded via a namespace (and not attached):
+ [1] crayon_1.5.2 vctrs_0.6.5 cli_3.6.2
+ [4] knitr_1.46 rlang_1.1.3 xfun_0.43
+ [7] stringi_1.8.3 jsonlite_1.8.8 bit_4.0.5
+[10] glue_1.7.0 htmltools_0.5.8.1 sass_0.4.9
+[13] hms_1.1.3 fansi_1.0.6 rmarkdown_2.26
+[16] evaluate_0.23 jquerylib_0.1.4 tibble_3.2.1
+[19] tzdb_0.4.0 fastmap_1.1.1 yaml_2.3.8
+[22] lifecycle_1.0.4 palmerpenguins_0.1.1 stringr_1.5.1
+[25] compiler_4.4.0 getopt_1.20.4 pkgconfig_2.0.3
+[28] digest_0.6.35 R6_2.5.1 tidyselect_1.2.1
+[31] utf8_1.2.4 parallel_4.4.0 vroom_1.6.5
+[34] pillar_1.9.0 magrittr_2.0.3 bslib_0.7.0
+[37] bit64_4.0.5 tools_4.4.0 cachem_1.0.8
+
+
+This notebook will demonstrate how to:
+ggplot2
to plot and visualize dataggplot2
We’ll use a real gene expression dataset to get comfortable making +visualizations using ggplot2. We’ve performed differential expression +analyses on a pre-processed astrocytoma +microarray dataset. We’ll start by making a volcano plot of +differential gene expression results from this experiment. We performed +three sets of contrasts:
+sex
category contrasting: Male
vs
+Female
tissue
category contrasting :
+Pilocytic astrocytoma tumor
samples vs
+normal cerebellum
samplessex
and
+tissue
.More ggplot2 resources:
+ +We saved these results to a tab separated values (TSV) file called
+gene_results_GSE44971.tsv
. It’s been saved to the
+data
folder. File paths are relative to where this notebook
+file (.Rmd) is saved. So we can reference it later, let’s make a
+variable with our data directory name.
data_dir <- "data"
+
+
+
+Let’s declare our output folder name as its own variable.
+ + + +plots_dir <- "plots"
+
+
+
+We can also create a directory if it doesn’t already exist.
+There’s a couple ways that we can create that directory from within
+R. One way is to use the base R function dir.create()
,
+which (as the name suggests) will create a directory. But this function
+assumes that the directory does not yet exist, and it will throw an
+error if you try to create a directory that already exists. To avoid
+this error, we can place the directory creation inside an
+if
statement, so the code will only run if the directory
+does not yet exist:
# The if statement here tests whether the plot directory exists and
+# only executes the expressions between the braces if it does not.
+if (!dir.exists(plots_dir)) {
+ dir.create(plots_dir)
+}
+
+
+
+In this notebook we will be using functions from the Tidyverse set of
+packages, so we need to load in those functions using
+library()
. We could load the individual packages we need
+one at a time, but it is convenient for now to load them all with the
+tidyverse
“package,” which groups many of them together as
+a shortcut. Keep a look out for where we tell you which individual
+package different functions come from.
library(tidyverse)
+
+
+── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
+✔ dplyr 1.1.4 ✔ readr 2.1.5
+✔ forcats 1.0.0 ✔ stringr 1.5.1
+✔ ggplot2 3.5.1 ✔ tibble 3.2.1
+✔ lubridate 1.9.3 ✔ tidyr 1.3.1
+✔ purrr 1.0.2
+── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
+✖ dplyr::filter() masks stats::filter()
+✖ dplyr::lag() masks stats::lag()
+ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
+
+
+
+Here we are using a tidyverse
function
+read_tsv()
from the readr
package. Like we did
+in the previous notebook, we will store the resulting data frame as
+stats_df
.
# read in the file `gene_results_GSE44971.tsv` from the data directory
+stats_df <- read_tsv(file.path(
+ data_dir,
+ "gene_results_GSE44971.tsv"
+))
+
+
+Rows: 6804 Columns: 8
+── Column specification ────────────────────────────────────────────────────────
+Delimiter: "\t"
+chr (3): ensembl_id, gene_symbol, contrast
+dbl (5): log_fold_change, avg_expression, t_statistic, p_value, adj_p_value
+
+ℹ Use `spec()` to retrieve the full column specification for this data.
+ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
+
+
+
+We can take a look at a column individually by using a
+$
. Note we are using head()
so the whole thing
+doesn’t print out.
head(stats_df$contrast)
+
+
+[1] "male_female" "male_female" "male_female" "male_female" "male_female"
+[6] "male_female"
+
+
+male_female
+male_female
+male_female
+male_female
+male_female
+male_female
+
+
+
+If we want to see a specific set of values, we can use brackets with +the indices of the values we’d like returned.
+ + + +stats_df$avg_expression[6:10]
+
+
+[1] 19.084011 8.453933 5.116563 6.345609 25.473133
+
+
+
+Let’s look at some basic statistics from the data set using
+summary()
# summary of stats_df
+summary(stats_df)
+
+
+ ensembl_id gene_symbol contrast log_fold_change
+ Length:6804 Length:6804 Length:6804 Min. :-180.8118
+ Class :character Class :character Class :character 1st Qu.: -1.6703
+ Mode :character Mode :character Mode :character Median : 0.1500
+ Mean : 0.2608
+ 3rd Qu.: 2.1049
+ Max. : 129.3009
+ avg_expression t_statistic p_value adj_p_value
+ Min. : 5.003 Min. :-32.84581 Min. :0.00000 Min. :0.00000
+ 1st Qu.: 6.304 1st Qu.: -1.16444 1st Qu.:0.01309 1st Qu.:0.05657
+ Median : 8.482 Median : 0.10619 Median :0.18919 Median :0.41354
+ Mean : 13.847 Mean : -0.00819 Mean :0.31223 Mean :0.44833
+ 3rd Qu.: 14.022 3rd Qu.: 1.46589 3rd Qu.:0.57634 3rd Qu.:0.82067
+ Max. :190.708 Max. : 10.48302 Max. :0.99979 Max. :0.99988
+
+
+
+The statistics for contrast
are not very informative, so
+let’s do that again with just the contrast
column after
+converting it to a factor
# summary of `stats_df$contrast` as a factor
+summary(as.factor(stats_df$contrast))
+
+
+astrocytoma_normal interaction male_female
+ 2268 2268 2268
+
+
+
+Before we make our plot, we want to calculate a set of new values for
+each row; transformations of the raw statistics in our table. To do this
+we will use a function from the dplyr
package called
+mutate()
to make a new column of -log10 p values.
# add a `neg_log10_p` column to the data frame
+stats_df <- mutate(stats_df, # data frame we'd like to add a variable to
+ neg_log10_p = -log10(p_value) # column name and values
+)
+
+
+
+Let’s filter to only male_female
contrast data. First
+let’s try out a logical expression:
stats_df$contrast == "male_female"
+
+
+
+Now we can try out the filter()
function. Notice that we
+are not assigning the results to a variable, so this filtered dataset
+will not be saved to the environment.
# filter stats_df to "male_female" only
+filter(stats_df, contrast == "male_female")
+
+Now we can assign the results to a new data frame:
+male_female_df
.
# filter and save to male_female_df
+male_female_df <- filter(stats_df, contrast == "male_female")
+
+
+
+Let’s make a volcano plot with this data. First let’s take a look at +only the tumor vs. normal comparison. Let’s save this as a separate data +frame by assigning it a new name.
+ + + +tumor_normal_df <- filter(stats_df, contrast == "astrocytoma_normal")
+
+
+
+To make this plot we will be using functions from the
+ggplot2
package, the main plotting package of the
+tidyverse. We use the first function, ggplot()
to define
+the data that will be plotted. Remember, the name of this package is
+ggplot2
, but the function we use is called
+ggplot()
without the 2
. ggplot()
+takes two main arguments:
data
, which is the data frame that contains the data we
+want to plot.mapping
, which is a special list made with the
+aes()
function to describe which values will be used for
+each aesthetic component of the plot, such as the x and
+y coordinates of each point. (If you find calling things like the x and
+y coordinates “aesthetics” confusing, don’t worry, you are not alone.)
+Specifically, the aes()
function is used to specify that a
+given column (variable) in your data frame be mapped to a given
+aesthetic component of the plot.ggplot(
+ tumor_normal_df, # This first argument is the data frame with the data we want to plot
+ aes(
+ x = log_fold_change, # This is the column name of the values we want to use
+ # for the x coordinates
+ y = neg_log10_p
+ ) # This is the column name of the data we want for the y-axis
+)
+
+
+
+
+
+
+You’ll notice this plot doesn’t have anything on it because we
+haven’t specified a plot type yet. To do that, we will add another
+ggplot layer with +
which will specify exactly what we want
+to plot. A volcano plot is a special kind of scatter plot, so to make
+that we will want to plot individual points, which we can do with
+geom_point()
.
# This first part is the same as before
+ggplot(
+ tumor_normal_df,
+ aes(
+ x = log_fold_change,
+ y = neg_log10_p
+ )
+) +
+ # Now we are adding on a layer to specify what kind of plot we want
+ geom_point()
+
+
+
+
+
+
+Here’s a brief summary of ggplot2 structure.
+Now that we have a base plot that shows our data, we can add layers
+on to it and adjust it. We can adjust the color of points using the
+color
aesthetic.
ggplot(
+ tumor_normal_df,
+ aes(
+ x = log_fold_change,
+ y = neg_log10_p,
+ color = avg_expression
+ ) # We added this argument to color code the points!
+) +
+ geom_point()
+
+
+
+
+
+
+Because we have so many points overlapping one another, we will want
+to adjust the transparency, which we can do with an alpha
+argument.
ggplot(
+ tumor_normal_df,
+ aes(
+ x = log_fold_change,
+ y = neg_log10_p,
+ color = avg_expression
+ )
+) +
+ geom_point(alpha = 0.2) # We are using the `alpha` argument to make our points transparent
+
+
+
+
+
+
+Notice that we added the alpha within the geom_point()
+function, not to the aes()
. We did this because we want all
+of the points to have the same level of transparency, and it will not
+vary depending on any variable in the data. We can also change the
+background and appearance of the plot as a whole by adding a
+theme
.
ggplot(
+ tumor_normal_df,
+ aes(
+ x = log_fold_change,
+ y = neg_log10_p,
+ color = avg_expression
+ )
+) +
+ geom_point(alpha = 0.2) +
+ # Add on this set of appearance presets to make it pretty
+ theme_bw()
+
+
+
+
+
+
+We are not limited to a single plotting layer. For example, if we
+want to add a horizontal line to indicate a significance cutoff, we can
+do that with geom_hline()
. For now, we will choose the
+value of 5.5 (that is close to a Bonferroni correction) and add that to
+the plot.
ggplot(
+ tumor_normal_df,
+ aes(
+ x = log_fold_change,
+ y = neg_log10_p,
+ color = avg_expression
+ )
+) +
+ geom_point(alpha = 0.2) +
+ geom_hline(yintercept = 5.5, color = "darkgreen") # we can specify colors by names here
+
+
+
+
+
+
+We can change the x and y labels using a few different strategies.
+One approach is to use functions xlab()
and
+ylab()
individually to set, respectively, the x-axis label
+and the the y-axis label.
ggplot(
+ tumor_normal_df,
+ aes(
+ x = log_fold_change,
+ y = neg_log10_p,
+ color = avg_expression
+ )
+) +
+ geom_point(alpha = 0.2) +
+ geom_hline(yintercept = 5.5, color = "darkgreen") +
+ theme_bw() +
+ # Add labels with separate functions:
+ xlab("log2 Fold Change Tumor/Normal") +
+ ylab("-log10 p value")
+
+
+
+
+
+
+Alternatively, we can use the ggplot2
function
+labs()
, which takes individual arguments for each label we
+want want to set. We can also include the argument title
to
+add an overall plot title.
ggplot(
+ tumor_normal_df,
+ aes(
+ x = log_fold_change,
+ y = neg_log10_p,
+ color = avg_expression
+ )
+) +
+ geom_point(alpha = 0.2) +
+ geom_hline(yintercept = 5.5, color = "darkgreen") +
+ theme_bw() +
+ # Add x and y labels and overall plot title with arguments to labs():
+ labs(
+ x = "log2 Fold Change Tumor/Normal",
+ y = "-log10 p value",
+ title = "Astrocytoma Tumor vs Normal Cerebellum"
+ )
+
+
+
+
+
+
+Something great about the labs()
function is you can
+also use it to specify labels for your legends derived from
+certain aesthetics. In this plot, our legend is derived from a color
+aesthetic, so we can specify the keyword “color” to update the
+legend title.
ggplot(
+ tumor_normal_df,
+ aes(
+ x = log_fold_change,
+ y = neg_log10_p,
+ color = avg_expression
+ )
+) +
+ geom_point(alpha = 0.2) +
+ geom_hline(yintercept = 5.5, color = "darkgreen") +
+ theme_bw() +
+ # Add x and y labels and overall plot title (and more!) with arguments to labs():
+ labs(
+ x = "log2 Fold Change Tumor/Normal",
+ y = "-log10 p value",
+ title = "Astrocytoma Tumor vs Normal Cerebellum",
+ # Use the color keyword to label the color legend
+ color = "Average expression"
+ )
+
+
+
+
+
+
+Use this chunk to make the same kind of plot as the previous chunk
+but instead plot the male female contrast data, that is stored in
+male_female_df
.
# Use this chunk to make the same kind of volcano plot, but with the male-female contrast data.
+ggplot(
+ male_female_df,
+ aes(
+ x = log_fold_change,
+ y = neg_log10_p,
+ color = avg_expression
+ )
+) +
+ geom_point(alpha = 0.2) +
+ geom_hline(yintercept = 5.5, color = "darkgreen") +
+ theme_bw() +
+ labs(
+ x = "log2 Fold Change Male/Female",
+ y = "-log10 p value",
+ color = "Average expression"
+ )
+
+
+
+
+
+
+Turns out, we don’t have to plot each contrast separately, instead,
+we can use the original data frame that contains all three contrasts’
+data, stats_df
, and add a facet_wrap
to make
+each contrast its own plot.
ggplot(
+ stats_df, # Switch to the bigger data frame with all three contrasts' data
+ aes(
+ x = log_fold_change,
+ y = neg_log10_p,
+ color = avg_expression
+ )
+) +
+ geom_point(alpha = 0.2) +
+ geom_hline(yintercept = 5.5, color = "darkgreen") +
+ theme_bw() +
+ facet_wrap(vars(contrast)) +
+ labs(
+ # Now that this includes the other contrasts,
+ # we'll make the x-axis label more general
+ x = "log2 Fold Change",
+ y = "-log10 p value",
+ color = "Average expression"
+ ) +
+ coord_cartesian(xlim = c(-25, 25)) # zoom in on the x-axis
+
+
+
+
+
+
+We can store the plot as an object in the global environment by using
+<-
operator. Here we will call this
+volcano_plot
.
# We are saving this plot to a variable named `volcano_plot`
+volcano_plot <- ggplot(
+ stats_df,
+ aes(
+ x = log_fold_change,
+ y = neg_log10_p,
+ color = avg_expression
+ )
+) +
+ geom_point(alpha = 0.2) +
+ geom_hline(yintercept = 5.5, color = "darkgreen") +
+ theme_bw() +
+ facet_wrap(vars(contrast)) +
+ labs(
+ x = "log2 Fold Change",
+ y = "-log10 p value",
+ color = "Average expression"
+ ) +
+ coord_cartesian(xlim = c(-25, 25))
+
+
+
+When we are happy with our plot, we can save the plot using
+ggsave
. It’s a good idea to also specify width
+and height
arguments (units in inches) to ensure the saved
+plot is always the same size every time you run this code. Here, we’ll
+save a 6”x6” plot.
ggsave(
+ plot = volcano_plot,
+ filename = file.path(plots_dir, "volcano_plot.png"),
+ width = 6,
+ height = 6
+)
+
+
+
+# Print out the versions and packages we are using in this session
+sessionInfo()
+
+
+R version 4.4.0 (2024-04-24)
+Platform: x86_64-pc-linux-gnu
+Running under: Ubuntu 22.04.4 LTS
+
+Matrix products: default
+BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
+LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
+
+locale:
+ [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
+ [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
+ [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
+ [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
+ [9] LC_ADDRESS=C LC_TELEPHONE=C
+[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
+
+time zone: Etc/UTC
+tzcode source: system (glibc)
+
+attached base packages:
+[1] stats graphics grDevices utils datasets methods base
+
+other attached packages:
+ [1] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
+ [5] purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
+ [9] ggplot2_3.5.1 tidyverse_2.0.0 optparse_1.7.5
+
+loaded via a namespace (and not attached):
+ [1] sass_0.4.9 utf8_1.2.4 generics_0.1.3 stringi_1.8.3
+ [5] hms_1.1.3 digest_0.6.35 magrittr_2.0.3 evaluate_0.23
+ [9] grid_4.4.0 timechange_0.3.0 fastmap_1.1.1 jsonlite_1.8.8
+[13] fansi_1.0.6 scales_1.3.0 textshaping_0.3.7 getopt_1.20.4
+[17] jquerylib_0.1.4 cli_3.6.2 crayon_1.5.2 rlang_1.1.3
+[21] bit64_4.0.5 munsell_0.5.1 withr_3.0.0 cachem_1.0.8
+[25] yaml_2.3.8 parallel_4.4.0 tools_4.4.0 tzdb_0.4.0
+[29] colorspace_2.1-0 vctrs_0.6.5 R6_2.5.1 lifecycle_1.0.4
+[33] bit_4.0.5 vroom_1.6.5 ragg_1.3.0 pkgconfig_2.0.3
+[37] pillar_1.9.0 bslib_0.7.0 gtable_0.3.5 glue_1.7.0
+[41] systemfonts_1.0.6 highr_0.10 xfun_0.43 tidyselect_1.2.1
+[45] knitr_1.46 farver_2.1.1 htmltools_0.5.8.1 labeling_0.4.3
+[49] rmarkdown_2.26 compiler_4.4.0
+
+
+This notebook will demonstrate how to:
+|>
) to combine multiple operationsapply()
function to apply functions across rows
+or columns of a matrixWe’ll use the same gene expression dataset we used in the previous notebook. It is a +pre-processed astrocytoma +microarray dataset that we performed a set of differential expression analyses +on.
+More tidyverse resources:
+ +The tidyverse is a collection of packages that are handy for general +data wrangling, analysis, and visualization. Other packages that are +specifically handy for different biological analyses are found on Bioconductor. If we want to use +a package’s functions we first need to install them.
+Our RStudio Server already has the tidyverse
group of
+packages installed for you. But if you needed to install it or other
+packages available on CRAN, you do it using the
+install.packages()
function like this:
+install.packages("tidyverse")
.
library(tidyverse)
+
+
+── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
+✔ dplyr 1.1.4 ✔ readr 2.1.5
+✔ forcats 1.0.0 ✔ stringr 1.5.1
+✔ ggplot2 3.5.1 ✔ tibble 3.2.1
+✔ lubridate 1.9.3 ✔ tidyr 1.3.1
+✔ purrr 1.0.2
+── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
+✖ dplyr::filter() masks stats::filter()
+✖ dplyr::lag() masks stats::lag()
+ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
+
+
+
+::
Note that if we had not imported the tidyverse set of packages using
+library()
like above, and we wanted to use a tidyverse
+function like read_tsv()
, we would need to tell R what
+package to find this function in. To do this, we would use
+::
to tell R to load in this function from the
+readr
package by using readr::read_tsv()
. You
+will see this ::
method of referencing libraries within
+packages throughout the course. We like to use it in part to remove any
+ambiguity in which version of a function we are using; it is not too
+uncommon for different packages to use the same name for very different
+functions!
Before we can import the data we need, we should double check where R
+is looking for files, aka the current working
+directory. We can do this by using the getwd()
+function, which will tell us what folder we are in.
# Let's check what directory we are in:
+getwd()
+
+
+[1] "/__w/training-modules/training-modules/intro-to-R-tidyverse"
+
+
+/__w/training-modules/training-modules/intro-to-R-tidyverse
+
+
+
+For Rmd files, the working directory is wherever the file is located, +but commands executed in the console may have a different working +directory.
+We will want to make a directory for our output and we will call this
+directory: results
. But before we create the directory, we
+should check if it already exists. We will show two ways that we can do
+this.
First, we can use the dir()
function to have R list the
+files in our working directory.
# Let's check what files are here
+dir()
+
+
+ [1] "00a-rstudio_guide.Rmd"
+ [2] "00b-debugging_resources.Rmd"
+ [3] "00c-good_scientific_coding_practices.Rmd"
+ [4] "01-intro_to_base_R-live.Rmd"
+ [5] "01-intro_to_base_R.nb.html"
+ [6] "01-intro_to_base_R.Rmd"
+ [7] "02-intro_to_ggplot2-live.Rmd"
+ [8] "02-intro_to_ggplot2.nb.html"
+ [9] "02-intro_to_ggplot2.Rmd"
+[10] "03-intro_to_tidyverse-live.Rmd"
+[11] "03-intro_to_tidyverse.nb.html"
+[12] "03-intro_to_tidyverse.Rmd"
+[13] "data"
+[14] "diagrams"
+[15] "exercise_01-intro_to_base_R.Rmd"
+[16] "exercise_02-intro_to_R.Rmd"
+[17] "exercise_03a-intro_to_tidyverse.Rmd"
+[18] "exercise_03b-intro_to_tidyverse.Rmd"
+[19] "plots"
+[20] "README.md"
+[21] "screenshots"
+[22] "scripts"
+
+
+00a-rstudio_guide.Rmd
+00b-debugging_resources.Rmd
+00c-good_scientific_coding_practices.Rmd
+01-intro_to_base_R-live.Rmd
+01-intro_to_base_R.nb.html
+01-intro_to_base_R.Rmd
+02-intro_to_ggplot2-live.Rmd
+02-intro_to_ggplot2.nb.html
+02-intro_to_ggplot2.Rmd
+03-intro_to_tidyverse-live.Rmd
+03-intro_to_tidyverse.nb.html
+03-intro_to_tidyverse.Rmd
+data
+diagrams
+exercise_01-intro_to_base_R.Rmd
+exercise_02-intro_to_R.Rmd
+exercise_03a-intro_to_tidyverse.Rmd
+exercise_03b-intro_to_tidyverse.Rmd
+plots
+README.md
+screenshots
+scripts
+
+
+
+This shows us there is no folder called “results” yet.
+If we want to more pointedly look for “results” in our working
+directory we can use the dir.exists()
function.
# Check if the results directory exists
+dir.exists("results")
+
+
+[1] FALSE
+
+
+
+If the above says FALSE
that means we will need to
+create a results
directory. We’ve previously seen that we
+can make directories in R using the base R function
+dir.create()
. But we’ve also seen that this function will
+throw an error if you try to create a directory that already exists,
+which can be frustrating if you are re-running code! A different option
+is to use the fs
+package, which provides functions for you to interact with your
+computer’s file system with a more consistent behavior than the base R
+functions. One function from this package is
+fs::dir_create()
(note that it has an underscore,
+not a period), and much like the base R dir.create()
, it
+creates directories. It has some other helpful features too: - It will
+simply do nothing if that directory already exists; no errors, and
+nothing will get overwritten - It allows creating nested
+directories by default, i.e. in one call make directories inside of
+other directories
Let’s go ahead and use it to create our results
+directory:
# Make a directory within the working directory called 'results'
+fs::dir_create("results")
+
+
+
+After creating the results directory above, let’s re-run
+dir.exists()
to see if now it exists.
# Re-check if the results directory exists
+dir.exists("results")
+
+
+[1] TRUE
+
+
+
+The dir.exists()
function will not work on files
+themselves. In that case, there is an analogous function called
+file.exists()
.
Try using the file.exists()
function to see if the file
+gene_results_GSE44971.tsv
exists in the current directory.
+Use the code chunk we set up for you below. Note that in our notebooks
+(and sometimes elsewhere), wherever you see a
+<FILL_IN_THE_BLANK>
like in the chunk below, that is
+meant for you to replace (including the angle brackets) with the correct
+phrase before you run the chunk (otherwise you will get an error).
# Replace the <PUT_FILE_NAME_HERE> with the name of the file you are looking for
+# Remember to use quotes to make it a character string
+file.exists(<PUT_FILE_NAME_HERE>)
+
+
+
+It doesn’t seem that file exists in our current directory,
+but that doesn’t mean it doesn’t exist it all. In fact, this file is
+inside the relative path data/
, so let’s check
+again if the whole relative path to that file exists.
# This time, use file.path() to form your argument to file.exists()
+file.exists(<PUT_PATH_TO_FILE_HERE>)
+
+
+
+With the right relative path, we can confirm this file exists.
+Declare the name of the directory where we will read in the data.
+ + + +data_dir <- "data"
+
+
+
+Although base R has functions to read in data files, the functions in
+the readr
package (part of the tidyverse) are faster and
+more straightforward to use so we are going to use those here. Because
+the file we are reading in is a TSV (tab separated values) file we will
+be using the read_tsv
function. There are analogous
+functions for CSV (comma separated values) files
+(read_csv()
) and other files types.
stats_df <- readr::read_tsv(
+ file.path(data_dir,
+ "gene_results_GSE44971.tsv")
+ )
+
+
+Rows: 6804 Columns: 8
+── Column specification ────────────────────────────────────────────────────────
+Delimiter: "\t"
+chr (3): ensembl_id, gene_symbol, contrast
+dbl (5): log_fold_change, avg_expression, t_statistic, p_value, adj_p_value
+
+ℹ Use `spec()` to retrieve the full column specification for this data.
+ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
+
+
+
+Following the template of the previous chunk, use this chunk to read
+in the file GSE44971.tsv
that is in the data
+folder and save it in the variable gene_df
.
# Use this chunk to read in data from the file `GSE44971.tsv`
+gene_df <- readr::read_tsv(
+ file.path(data_dir,
+ "GSE44971.tsv")
+ )
+
+
+Rows: 20056 Columns: 59
+── Column specification ────────────────────────────────────────────────────────
+Delimiter: "\t"
+chr (1): Gene
+dbl (58): GSM1094814, GSM1094815, GSM1094816, GSM1094817, GSM1094818, GSM109...
+
+ℹ Use `spec()` to retrieve the full column specification for this data.
+ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
+
+
+
+Use this chunk to explore what gene_df
looks like.
# Explore `gene_df`
+
+
+
+What information is contained in gene_df
?
R
pipesOne nifty feature that was added to R
in version 4.1 is
+the pipe: |>
. Pipes are very handy things that allow you
+to funnel the result of one expression to the next, making your code
+more streamlined and fluently expressing the flow of data through a
+series of operations.
Note: If you are using a version of R
prior to
+4.1 (or looking at older code), pipe functionality was available through
+the magrittr
package , which used a pipe that looked like
+this: %>%
. That pipe was the inspiration for the native
+R pipe we are using here. While there are some minor differences, you
+can mostly treat them interchangeably as long as you load the
+magrittr
package or dplyr
, which also loads
+that version of the pipe.
For example, the output from this:
+ + + +filter(stats_df, contrast == "male_female")
+
+…is the same as the output from this:
+ + + +stats_df |> filter(contrast == "male_female")
+
+This can make your code cleaner and easier to follow a series of +related commands. Let’s look at an example with our stats of of how the +same functions look with or without pipes:
+Example 1: without pipes:
+ + + +stats_arranged <- arrange(stats_df, t_statistic)
+stats_filtered <- filter(stats_arranged, avg_expression > 50)
+stats_nopipe <- select(stats_filtered, contrast, log_fold_change, p_value)
+
+
+
+UGH, we have to keep track of all of those different intermediate +data frames and type their names so many times here! We could maybe +streamline things by using the same variable name at each stage, but +even then there is a lot of extra typing, and it is easy to get confused +about what has been done where. It’s annoying and makes it harder for +people to read.
+Example 2: Same result as 1 but with pipes!
+ + + +# Example of the same modifications as above but with pipes!
+stats_pipe <- stats_df |>
+ arrange(t_statistic) |>
+ filter(avg_expression > 50) |>
+ select(contrast, log_fold_change, p_value)
+
+
+
+What the |>
(pipe) is doing here is feeding the
+result of the expression on its left into the first argument of the next
+function (to its right, or on the next line here). We can then skip that
+first argument (the data in these cases), and move right on to the part
+we care about at that step: what we are arranging, filtering, or
+selecting in this case. The key insight that makes the pipe work here is
+to recognize that each of these functions (arrange
,
+filter
, and select
) are fundamental
+dplyr
(a tidyverse package) functions which work as “data
+in, data out.” In other words, these functions operate on data frames,
+and return data frames; you give them a data frame, and they give you
+back a data frame. Because these functions all follow a “data in, data
+out” framework, we can chain them together with pipe and send data all
+the way through the…pipeline!
Let’s double check that these versions with and without pipe yield
+the same solution by using the base R function
+all.equal()
.
all.equal(stats_nopipe, stats_pipe)
+
+
+[1] TRUE
+
+
+
+all.equal()
is letting us know that these two objects
+are the same.
Now that hopefully you are convinced that the tidyverse can help you +make your code neater and easier to use and read, let’s go through some +of the popular tidyverse functions and so we can create pipelines like +this.
+Let’s say we wanted to filter this gene expression dataset to
+particular sample groups. In order to do this, we would use the function
+filter()
as well as a logic statement (usually one that
+refers to a column or columns in the data frame).
# Here let's filter stats_df to only keep the gene_symbol "SNCA"
+stats_df |>
+ filter(gene_symbol == "SNCA")
+
+We can use filter()
similarly for numeric
+statements.
# Here let's filter the data to rows with average expression values above 50
+stats_df |>
+ filter(avg_expression > 50)
+
+We can apply multiple filters at once, which will require all of them +to be satisfied for every row in the results:
+ + + +# filter to highly expressed genes with contrast "male_female"
+stats_df |>
+ filter(contrast == "male_female",
+ avg_expression > 50)
+
+When we are filtering, the %in%
operator can come in
+handy if we have multiple items we would like to match. Let’s take a
+look at what using %in%
does.
genes_of_interest <- c("SNCA", "CDKN1A")
+# Are these genes present in the `gene_symbol` column in stats_df?
+stats_df$gene_symbol %in% genes_of_interest
+
+
+
+%in%
returns a logical vector that now we can use in
+dplyr::filter
.
# filter to keep only genes of interest
+stats_df |>
+ filter(gene_symbol %in% c("SNCA", "CDKN1A"))
+
+Let’s return to our first filter()
and build on to it.
+This time, let’s keep only some of the columns from the data frame using
+the select()
function. Let’s also save this as a new data
+frame called stats_filtered_df
.
# filter to highly expressed "male_female"
+# and select gene_symbol, log_fold_change and t_statistic
+stats_filtered_df <- stats_df |>
+ filter(contrast == "male_female",
+ avg_expression > 50) |>
+ select(log_fold_change, t_statistic)
+
+
+
+Let’s say we wanted to arrange this dataset so that the genes are
+arranged by the smallest p values to the largest. In order to do this,
+we would use the function arrange()
as well as the column
+we would like to sort by (in this case p_value
).
stats_df |>
+ arrange(p_value)
+
+What if we want to sort from largest to smallest? Like if we want to
+see the genes with the highest average expression? We can use the same
+function, but instead use the desc()
function and now we
+are using avg_expression
column.
# arrange descending by avg_expression
+stats_df |>
+ arrange(desc(avg_expression))
+
+What if we would like to create a new column of values? For that we
+use mutate()
function.
stats_df |>
+ mutate(log10_p_value = -log10(p_value))
+
+What if we want to obtain summary statistics for a column or columns?
+The summarize
function allows us to calculate summary
+statistics for a column. Here we will use summarize to calculate two
+summary statistics of log-fold change across all genes: mean (function
+mean()
) and standard deviation (function
+sd()
).
stats_df |>
+ summarize(mean(log_fold_change),
+ sd(log_fold_change))
+
+What if we’d like to obtain a summary statistics but have them for
+various groups? Conveniently named, there’s a function called
+group_by()
that seamlessly allows us to do this. Also note
+that group_by()
allows us to group by multiple variables at
+a time if you want to.
stats_summary_df <- stats_df |>
+ group_by(contrast) |>
+ summarize(mean(log_fold_change),
+ sd(log_fold_change))
+
+
+
+Let’s look at a preview of what we made:
+ + + +stats_summary_df
+
+Here we have the mean log fold change expression per each contrast we +made.
+apply
family of functionsIn base R, the apply
family of functions can be an
+alternative methods for performing transformations across a data frame,
+matrix or other object structures.
One of this family is (shockingly) the function apply()
,
+which operates on matrices.
A matrix is similar to a data frame in that it is a rectangular table +of data, but it has an additional constraint: rather than each column +having a type, ALL data in a matrix has the same type.
+The first argument to apply()
is the data object we want
+to work on. The third argument is the function we will apply to each row
+or column of the data object. The second argument in specifies whether
+we are applying the function across rows or across columns (1 for rows,
+2 for columns).
Remember that gene_df
is a gene x sample gene expression
+data frame that has columns of two different types, character and
+numeric. Converting it to a matrix will require us to make them all the
+same type. We can coerce it into a matrix using
+as.matrix()
, in which case R will pick a type that it can
+convert everything to. What does it choose?
# Coerce `gene_df` into a matrix
+gene_matrix <- as.matrix(gene_df)
+
+
+
+
+
+
+# Explore the structure of the `gene_matrix` object
+str(gene_matrix)
+
+
+ chr [1:20056, 1:59] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" ...
+ - attr(*, "dimnames")=List of 2
+ ..$ : NULL
+ ..$ : chr [1:59] "Gene" "GSM1094814" "GSM1094815" "GSM1094816" ...
+
+
+
+While that worked, it is rare that we want numbers converted to text, +so we are going to select only the columns with numeric values before +converting it to a matrix. We can do this most easily by removing the +first column, which contains the gene names stored as character +values.
+ + + +# Let's save a new matrix object names `gene_num_matrix` containing only
+# the numeric values
+gene_num_matrix <- as.matrix(gene_df[, -1])
+
+# Explore the structure of the `gene_num_matrix` object
+str(gene_num_matrix)
+
+
+ num [1:20056, 1:58] 9.5951 -0.0436 8.5246 1.6013 0.6189 ...
+ - attr(*, "dimnames")=List of 2
+ ..$ : NULL
+ ..$ : chr [1:58] "GSM1094814" "GSM1094815" "GSM1094816" "GSM1094817" ...
+
+
+
+Why do we have a [, -1]
after gene_df
in
+the above chunk?
Now that the matrix is all numbers, we can do things like calculate
+the column or row statistics using apply()
.
# Calculate row means
+gene_means <- apply(gene_num_matrix, 1, mean) # Notice we are using 1 here
+
+# How long will `gene_means` be?
+length(gene_means)
+
+
+[1] 20056
+
+
+
+Note that we can obtain the same results if we select just the
+columns with numeric values from the gene_df
data frame.
+This allows R to do the as.matrix() coercion automatically, and can be a
+handy shortcut if you have a mostly numeric data frame.
# Calculate row means using the `gene_df` object after removing the character column
+# apply() converts this to a matrix internally
+gene_means_from_df <- apply(gene_df[, -1], 1, mean)
+
+# Let's check that the two gene means objects are equal
+all.equal(gene_means, gene_means_from_df)
+
+
+[1] TRUE
+
+
+
+Now let’s investigate the same set up, but use 2 to
+apply
over the columns of our matrix.
# Calculate sample means
+sample_means <- apply(gene_num_matrix, 2, mean) # Notice we use 2 here
+
+# How long will `sample_means` be?
+length(sample_means)
+
+
+[1] 58
+
+
+
+We can put the gene names back into the numeric matrix object by +assigning them as rownames.
+ + + +# Assign the gene names from gene_df$Gene to the `gene_num_matrix` object using
+# the `rownames()` function
+rownames(gene_num_matrix) <- gene_df$Gene
+
+# Explore the `gene_num_matrix` object
+head(gene_num_matrix)
+
+
+ GSM1094814 GSM1094815 GSM1094816 GSM1094817 GSM1094818
+ENSG00000000003 9.59510150 8.4785070 12.6802129 8.677614838 10.75552946
+ GSM1094819 GSM1094820 GSM1094821 GSM1094822 GSM1094823
+ENSG00000000003 6.37470691 9.10028584 7.3546860 8.51847190 9.4216113
+ GSM1094824 GSM1094825 GSM1094826 GSM1094827 GSM1094828
+ENSG00000000003 5.0239629 7.89737460 8.1126876 7.03444640 9.6984918
+ GSM1094829 GSM1094830 GSM1094831 GSM1094832 GSM1094833
+ENSG00000000003 13.98689230 10.5868331 7.6836223 8.3862587 11.18932763
+ GSM1094834 GSM1094835 GSM1094836 GSM1094837 GSM1094838
+ENSG00000000003 9.7562003 9.6984918 10.56891510 9.9391025 7.8738131
+ GSM1094839 GSM1094840 GSM1094841 GSM1094842 GSM1094843
+ENSG00000000003 8.6311353 8.58077557 9.1579585 6.3317019 10.1939387
+ GSM1094844 GSM1094845 GSM1094846 GSM1094847 GSM1094848
+ENSG00000000003 10.44364159 9.62435722 16.05075944 6.9334508 8.55180910
+ GSM1094849 GSM1094850 GSM1094851 GSM1094852 GSM1094853
+ENSG00000000003 9.29497760 7.5027098 6.9593119 8.33588532 8.16826110
+ GSM1094854 GSM1094855 GSM1094856 GSM1094857 GSM1094858
+ENSG00000000003 9.8020077 7.92580451 8.5122426 8.5300217 6.45774124
+ GSM1094859 GSM1094860 GSM1094861 GSM1094862 GSM1094863
+ENSG00000000003 8.06834906 6.29704946 9.59510150 8.2198571 6.0207988
+ GSM1094864 GSM1094865 GSM1094866 GSM1094867 GSM1094868
+ENSG00000000003 10.60783176 10.23536609 9.031964212 7.66629540 1.06494502
+ GSM1094869 GSM1094870 GSM1094871
+ENSG00000000003 1.0408332 1.7262079 1.0292255
+ [ reached getOption("max.print") -- omitted 5 rows ]
+
+
+
+Row names like this can be very convenient for keeping matrices +organized, but row names (and column names) can be lost or misordered if +you are not careful, especially during input and output, so treat them +with care.
+Although the apply
functions may not be as easy to use
+as the tidyverse functions, for some applications, apply
+methods can be better suited. In this workshop, we will not delve too
+deeply into the various other apply functions (tapply()
,
+lapply()
, etc.) but you can read more information about
+them here.
Let’s say we have a scenario where we have two data frames that we
+would like to combine. Recall that stats_df
and
+gene_df
are data frames that contain information about some
+of the same genes. The dplyr::join
+family of functions are useful for various scenarios of combining
+data frames. For a visual explanation, the tidyexplain
+project has some helpful
+animations of joins.
For now, we will focus on inner_join()
, which will
+combine data frames by only keeping information about matching rows that
+are in both data frames. We need to use the by
argument to
+designate what column(s) should be used as a key to match the data
+frames. In this case we want to match the gene information between the
+two, so we will specify that we want to compare values in the
+ensembl_id
column from stats_df
to the
+Gene
column from gene_df
.
stats_df |>
+ # Join based on their shared column
+ # Called ensembl_id in stats_df and called Gene in gene_df
+ inner_join(gene_df, by = c('ensembl_id' = 'Gene'))
+
+Let’s write some of the data frames we created to a file. To do this,
+we can use the readr
library of write_()
+functions. The first argument of write_tsv()
is the data we
+want to write, and the second argument is a character string that
+describes the path to the new file we would like to create. Remember
+that we created a results
directory to put our output in,
+but if we want to save our data to a directory other than our working
+directory, we need to specify this. This is what we will use the
+file.path()
function for. Let’s look in a bit more detail
+what file.path()
does, by examining the results of the
+function in the examples below.
# Which of these file paths is what we want to use to save our data to the
+# results directory we created at the beginning of this notebook?
+file.path("docker-install", "stats_summary.tsv")
+
+
+[1] "docker-install/stats_summary.tsv"
+
+
+docker-install/stats_summary.tsv
+
+
+file.path("results", "stats_summary.tsv")
+
+
+[1] "results/stats_summary.tsv"
+
+
+results/stats_summary.tsv
+
+
+file.path("stats_summary.tsv", "results")
+
+
+[1] "stats_summary.tsv/results"
+
+
+stats_summary.tsv/results
+
+
+
+Replace <NEW_FILE_PATH>
below with the
+file.path()
statement from above that will successfully
+save our file to the results
folder.
# Write our data frame to a TSV file
+readr::write_tsv(stats_summary_df, <NEW_FILE_PATH>)
+
+
+
+Check in your results
directory to see if your new file
+has successfully saved.
For this example we have been working with data frames, which are
+conveniently represented as TSV or CSV tables. However, in other
+situations we may want to save more complicated or very large data
+structures, RDS (R Data Serialized/Single) files may be a better option
+for saving our data. RDS is R’s special file format for holding data
+exactly as you have it in your R environment. RDS files can also be
+compressed, meaning they will take up less space on your computer. Let’s
+save our data to an RDS file in our results
folder. You
+will need to replace the .tsv
with .RDS
, but
+you can use what we determined as our file path for the last chunk as
+your template.
# Write your object to an RDS file
+readr::write_rds(stats_summary_df, <PUT_CORRECT_FILE_PATH_HERE>)
+
+
+
+Since now you have learned the readr
functions:
+read_tsv()
, write_tsv()
, and now,
+write_rds()
, what do you suppose the function you will need
+to read your RDS file is called? Use that function here to re-import
+your data in the chunk we set up for you below.
# Read in your RDS file
+reimport_df <- <PUT_FUNCTION_NAME>(file.path("results", "stats_summary.RDS"))
+
+
+
+As is good practice, we will end this session by printing out our +session info.
+# Print out the versions and packages we are using in this session
+sessionInfo()
+
+
+R version 4.4.0 (2024-04-24)
+Platform: x86_64-pc-linux-gnu
+Running under: Ubuntu 22.04.4 LTS
+
+Matrix products: default
+BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
+LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
+
+locale:
+ [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
+ [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
+ [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
+ [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
+ [9] LC_ADDRESS=C LC_TELEPHONE=C
+[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
+
+time zone: Etc/UTC
+tzcode source: system (glibc)
+
+attached base packages:
+[1] stats graphics grDevices utils datasets methods base
+
+other attached packages:
+ [1] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
+ [5] purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
+ [9] ggplot2_3.5.1 tidyverse_2.0.0 optparse_1.7.5
+
+loaded via a namespace (and not attached):
+ [1] sass_0.4.9 utf8_1.2.4 generics_0.1.3 stringi_1.8.3
+ [5] hms_1.1.3 digest_0.6.35 magrittr_2.0.3 evaluate_0.23
+ [9] grid_4.4.0 timechange_0.3.0 fastmap_1.1.1 jsonlite_1.8.8
+[13] fansi_1.0.6 scales_1.3.0 getopt_1.20.4 jquerylib_0.1.4
+[17] cli_3.6.2 crayon_1.5.2 rlang_1.1.3 bit64_4.0.5
+[21] munsell_0.5.1 withr_3.0.0 cachem_1.0.8 yaml_2.3.8
+[25] parallel_4.4.0 tools_4.4.0 tzdb_0.4.0 colorspace_2.1-0
+[29] vctrs_0.6.5 R6_2.5.1 lifecycle_1.0.4 bit_4.0.5
+[33] fs_1.6.4 vroom_1.6.5 pkgconfig_2.0.3 pillar_1.9.0
+[37] bslib_0.7.0 gtable_0.3.5 glue_1.7.0 xfun_0.43
+[41] tidyselect_1.2.1 knitr_1.46 htmltools_0.5.8.1 rmarkdown_2.26
+[45] compiler_4.4.0
+
+
+Single-cell RNA-seq (scRNA-seq) technologies can be divided into two +categories, tag-based and full-length, based on their capture methods +and quantitative nature.
+In tag-based scRNA-seq, cells are separated by +emulsion/droplets, and individual cells are given a unique cell barcode +prior to sequencing. An example of tag-based scRNA-seq is 10x Genomics +(Zheng et +al. 2017).
+In full-length scRNA-seq, cells are physically separated +into individual wells of a plate and are often also sorted by other +means (e.g., Fluorescence Activated Cell Sorting). With full-length +scRNA-seq, each cell is sequenced individually and has its own fastq +file. An example of full-length scRNA-seq is Smart-seq2 (Picelli et +al. 2014).
+For the purposes of this tutorial, we will focus on tag-based +scRNA-seq, but it is important to keep in mind that the pre-processing +steps and the biases to look out for in post-processing vary based on +technology and how the cells are sorted.
+For more extensive background on single-cell experimental methods, +Predeus et al. also have a very good tutorial for scRNA-seq. We +will also refer extensively to the the book Orchestrating +Single-Cell Analysis with Bioconductor (Amezquita et +al.).
+Example: 10x Genomics (Zheng et +al. 2017) Individual cells are separated by emulsion/droplets +prior to cell lysis. Transcripts from each cell are then tagged with two +barcodes: a cell-specific barcode and a Unique Molecular Identifier +(UMI) (Islam et +al. 2014). All transcripts from all cells are then pooled +together and undergo PCR amplification and sequencing as if they are one +sample.
+Tagging of each transcript with a different UMI before amplification +allows the identification of PCR duplicates, allowing control for PCR +amplification errors and biases. Individual samples have two fastq +files: one for the cell and UMI barcodes (R1) and another with the +transcript sequence reads (R2).
+This notebook will demonstrate how to:
+In this notebook, we will be running through the basics of processing +raw single-cell RNA-seq data.
+We will be using a tag-based scRNA-seq sample from the Tabula +Muris project. This dataset is made of 20 mouse organs that +were sequenced using 10x Genomics Chromium single cell sequencing +methods. For 10x Genomics scRNA-seq data, cells are separated by +emulsion/droplets, and individual cells are given barcodes (often +abbreviated ‘CB’ in documentation). Each transcript will also contain a +Unique +Molecular Identifiers (UMIs) which allow us to examine PCR +amplification errors and biases.
+We obtained these data from Tabula Muris project’s Figshare.
+The BAM files that were on Figshare were converted to fastq
+files using bamtofastq
+from 10x Genomics. We will process a fastq
file from mouse
+bladder for this as an example.
+To limit the amount of time this takes to run in the context of this
+workshop, we are only running part of the sample’s reads.
Note: Depending on the format of the data you are working
+with, i.e., if you have a set of .bcl
files, you may need
+to use cellranger mkfastq
+to create .fastq
files for each sample. However, most
+public data is available in fastq
format, and most
+sequencing cores will generate the .fastq
files, so that is
+where we will start.
If you have opened the scRNA-seq.Rproj
file, your
+Terminal should already be set to the scRNA-seq
directory,
+but it is worth checking with the pwd
command in the
+Terminal (or by looking at the path shown in the command prompt or at
+the top of the Terminal pane).
If you are in a different directory, we will want to use
+cd
to change to the training-modules/scRNA-seq
+directory.
Copy and paste the text in the code blocks below into your
+Terminal
window in RStudio. It should be in the lower left
+hand corner as a tab next to Console
.
cd ~/training-modules/scRNA-seq
+Once you are there, you should be able to run the following command
+in the Terminal to look at the contents of the
+data/tabula-muris
directory:
ls data/tabula-muris
+Here you will see the fastq
directory, which is actually
+a link to a shared folder with the raw fastq files, split by sample. We
+will use these files, but we will not write to this directory.
We can look inside the contents of the fastq
directory,
+and we should see 16 subfolders corresponding to 16 different samples.
+Within each of these folders should be the fastq
files
+associated with that sample. Again, we can use the ls
+command to show the contents of each of the directories.
In this scenario, 10X_P4_3
refers to the sample name
+that we will be processing, which contains data from mouse bladder
+cells.
ls data/tabula-muris/fastq/10X_P4_3
+You should see a list of multiple fastq
files all
+starting with 10X_P4_3
, indicating the sample name.
If you notice, each fastq
file name contains either
+R1
or R2
. These correspond to the two
+sequencing reads of a paired-end library. For 10x data, the first read
+(the R1
file) will contain the cell barcode and UMI
+sequence, and the second read (the R2
file) will contain a
+cDNA sequence corresponding to the captured transcript. We will need
+both of these files to quantify our data.
10X_P4_3_L001_R1_001.fastq.gz 10X_P4_3_L002_R1_001.fastq.gz
+10X_P4_3_L001_R1_002.fastq.gz 10X_P4_3_L002_R1_002.fastq.gz
+10X_P4_3_L001_R1_003.fastq.gz 10X_P4_3_L002_R1_003.fastq.gz
+10X_P4_3_L001_R2_001.fastq.gz 10X_P4_3_L002_R2_001.fastq.gz
+10X_P4_3_L001_R2_002.fastq.gz 10X_P4_3_L002_R2_002.fastq.gz
+10X_P4_3_L001_R2_003.fastq.gz 10X_P4_3_L002_R2_003.fastq.gz
+Sequencing runs are often split into multiple fastq
+files, both when a sample was run across multiple lanes and to keep the
+individual file sizes down. This was the case for the Tabula
+Muris data we are using here as well. The files that you see in the
+data/tabula-muris/fastq/10X_P4_3
directory shown above
+represent 2 lanes worth of data, with three R1 and three R2 files per
+lane.
You will also see the file TM_droplet_metadata.csv
,
+which contains metadata for the Tabula Muris experiments.
Now that we are in scRNA-seq
, we’ll make a directory for
+us to store our quantification files in. In Terminal
, run
+the following command:
mkdir -p data/tabula-muris/alevin-quant/10X_P4_3_subset
+Alevin +is run from the command line (Terminal) to perform mapping and +quantification of tag-based single cell expression data.
+Before you can quantify with Salmon and Alevin
+we need to index the transcriptome for the species we will be mapping
+to. This step would be the same for mapping bulk RNA-seq data, and you
+can use the same transcriptome indexes as bulk RNA-seq, however, due to
+the shorter read lengths in the 10x sequencing, we may want to use
+shorter kmers than the default index size that salmon uses. In this
+instance, we used a -k
of 23.
In the interest of time, we have already run the command below and +have the index built and ready for you in a shared directory.
+But for your own reference, here is how you might do it yourself:
+# salmon --threads=16 --no-version-check index \
+# -t Mus_musculus.GRCm38.cdna.all.fa.gz \
+# -i index/Mus_musculus/short_index \
+# -k 23
+Scripts to build the indexes like those we are using here (and +others) can be found in this +repository.
+Copy and paste this in your Terminal
to run the Alevin
+quantification. This will take about 20 minutes to run, so we will start
+now, then talk about the options.
Note that here we are only giving the full paths to one of the
+R1
files and one of the R2
files. For the sake
+of time, we are only going to be running this on a subset of reads, but
+will also show you how to run it on the full sample.
salmon alevin \
+ -i index/Mus_musculus/short_index \
+ -l ISR \
+ -1 data/tabula-muris/fastq/10X_P4_3/10X_P4_3_L001_R1_001.fastq.gz \
+ -2 data/tabula-muris/fastq/10X_P4_3/10X_P4_3_L001_R2_001.fastq.gz \
+ -o data/tabula-muris/alevin-quant/10X_P4_3_subset \
+ -p 4 \
+ --chromium \
+ --tgMap index/Mus_musculus/Mus_musculus.GRCm38.95.versioned_tx2gene.tsv \
+ --dumpFeatures
+For detailed information about all options available see the Alevin +documentation and Salmon +documentation.
+Many of the options for the salmon alevin
command are
+the same as those you would see when mapping and quantifying bulk
+RNA-seq data with salmon quant
:
-i
gives the location of the transcriptome index-1
and -2
are the paired read input
+files-o
designates the output folder-p
allows us to specify how many processors to use; in
+this case we will use 4-l
The -l
option is for designating the library format. For
+most single-cell quantification, you will want to use the
+ISR
library type. See Salmon’s
+documentation for more information on fragment library types (and
+all of the other options available). Note that this option must come
+before the read files.
--chromium
Because we are using 10x v2 chromium data, we have to use this flag
+to tell alevin
where to expect the barcodes, UMIs and
+sequence data. If we were using 10x v3 data, we would need the
+--chromiumV3
flag instead. Drop-seq data is also supported,
+for which we would use the --dropseq
flag instead of
+this.
--tgMap
The transcriptome file that we are mapping to has separate sequences
+for each transcript of a gene, but due to the sparse nature of
+single-cell data, we are not likely to be able to meaningfully
+distinguish among different transcripts. For this reason,
+alevin
will quantify our results at the gene level, so we
+need to provide a file that maps each transcript to its gene. For this
+example, we’ve pre-made the file
+Mus_musculus.GRCm38.95.versioned_tx2gene.tsv
from the
+Ensembl transcriptome that we indexed above. The file is a TSV
+(tab-separated values) file with 2 columns: one of transcripts and the
+other the gene that each comes from.
--dumpFeatures
This option will print out information that we will use for quality +checks later on, including files with information on the UMIs and cell +barcodes.
+See the Alevin +documentation for a complete list of the Alevin options. There are +also a number of example analyses at the Alevin +tutorial website.
+When we took a look at the
+data/tabula-muris/fastq/10X_P4_3
directory earlier, we
+noticed that there were multiple files representing 2 lanes worth of
+data, with three R1 and three R2 files per lane:
We should really run all of these through Salmon, though that will
+take about six times as long as the single pair of reads we used. To do
+this, we could list each R1 and R2 file (space separated) after the
+-1
and -2
arguments, respectively. But that is
+a lot of typing, so a nice shortcut is to use a *
character
+to represent a wildcard that will be filled in with whatever characters
+are present in the files at the given path. In the pattern
+10X_P4_3_L*_R1_*.fastq.gz
, this would allow any lane number
+and any subset, so would match all of the following files (all the R1
+files in this case):
10X_P4_3_L001_R1_001.fastq.gz 10X_P4_3_L002_R1_001.fastq.gz
+10X_P4_3_L001_R1_002.fastq.gz 10X_P4_3_L002_R1_002.fastq.gz
+10X_P4_3_L001_R1_003.fastq.gz 10X_P4_3_L002_R1_003.fastq.gz
+For this directory, that would make our full
+salmon alevin
command look like this (don’t run this
+now!):
# salmon alevin \
+# -i index/Mus_musculus/short_index \
+# -l ISR \
+# -1 data/tabula-muris/fastq/10X_P4_3/10X_P4_3_L*_R1_*.fastq.gz \
+# -2 data/tabula-muris/fastq/10X_P4_3/10X_P4_3_L*_R2_*.fastq.gz \
+# -o data/tabula-muris/alevin-quant/10X_P4_3 \
+# -p 4 \
+# --chromium \
+# --tgMap index/Mus_musculus/Mus_musculus.GRCm38.95.versioned_tx2gene.tsv \
+# --dumpFeatures
+In general, you will want to run all lanes and all files for a given
+sample together. But DO NOT combine multiple
+samples into a single alevin
quantification! Keep
+separate samples (and replicates) separate!
alevinQC
Now that we have quantified our data with Alevin, we are ready to +perform initial quality control checks.
+In order to perform these quality control checks, we’ll use the
+alevinQC
R package. Note that alevinQC
depends
+on files that we get using the--dumpFeatures
option in
+Alevin.
About the alevinQCReport()
function: The first argument
+needs to be where the sample’s output data was put when Alevin was run
+(as a character string, aka using quotes). The rest of
+alevinQCReport()
’s arguments tell R where to put the output
+QC report.
# First, define path to alevin output:
+alevin_path <- file.path("data", "tabula-muris", "alevin-quant", "10X_P4_3_subset")
+
+# Produce a QC report of results found in the `alevin_path` directory
+alevinQC::alevinQCReport(alevin_path,
+ sampleId = "10X_P4_3_subset",
+ outputFile = "10X_P4_3_subset-qc_report.html",
+ outputDir = "qc-reports")
+
+
+
+Look for the 10X_P4_3_subset-qc_report.html
file created
+in the qc-reports
directory to examine the quality of your
+data and performance of Alevin.
We have also placed an example of a poor quality sample alevinQC
+report in the qc-reports
directory, with the name
+Bad_Example_10X_P4_2_qc_report.html
.
This report will show a few key metrics that inform you about the
+quality of your sample. There is a lot of information included in the
+report, so some key metrics to note are included in the
+Summary tables
:
The fraction of reads in whitelist barcodes is particularly important +as a low percentage here means the library contains many reads that do +not contain the expected cell barcodes. This is indicative of poor +single-cell capture during library construction.
+The mean number of reads per cell and median number of detected genes +per cell can be helpful in understanding how deeply the library was +sequenced. The higher these numbers are, the more information you will +obtain per cell.
+The knee plot shows the number of distinct UMIs for +each possible cell barcode on the y-axis, with the barcodes ranked from +the most UMIs to the fewest along the x-axis.
+Cell barcodes with low UMI counts are likely to be empty droplets +that did not contain a cell. These droplets must be filtered out so we +only consider true cells for downstream analysis.
+To do this, we can look for a “knee” on the curve where the number of +UMIs per barcode starts to drop off rapidly, with the intuition that +this is where we are reaching the end of the UMIs per cell distribution +for true cells. We can then choose a threshold below the knee and only +include barcodes above this threshold in the final cell barcode +list.
+This “knee” method, which is implemented by alevin
, is
+fairly effective and does not require any read mapping or quantification
+before filtering. More recent versions of Cell Ranger use a somewhat
+different method based on the “empty drops” method of Lun et al.
+(2019), that is applied after initial gene quantification. This
+allows filtering to retain cells with low counts that are nonetheless
+likely to represent real cells.
After we have successfully quantified our tag-based scRNA-seq data
+(and done some QC), we will want to read it into R to start to analyze
+it. The easiest way to do this is to use the tximeta
+package, which we will introduce in the next notebook.
sessionInfo()
+
+
+R version 4.4.0 (2024-04-24)
+Platform: x86_64-pc-linux-gnu
+Running under: Ubuntu 22.04.4 LTS
+
+Matrix products: default
+BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
+LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
+
+locale:
+ [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
+ [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
+ [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
+ [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
+ [9] LC_ADDRESS=C LC_TELEPHONE=C
+[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
+
+time zone: Etc/UTC
+tzcode source: system (glibc)
+
+attached base packages:
+[1] stats graphics grDevices utils datasets methods base
+
+other attached packages:
+[1] optparse_1.7.5
+
+loaded via a namespace (and not attached):
+ [1] digest_0.6.35 R6_2.5.1 fastmap_1.1.1 xfun_0.43
+ [5] magrittr_2.0.3 cachem_1.0.8 getopt_1.20.4 glue_1.7.0
+ [9] stringr_1.5.1 knitr_1.46 htmltools_0.5.8.1 rmarkdown_2.26
+[13] lifecycle_1.0.4 cli_3.6.2 sass_0.4.9 vctrs_0.6.5
+[17] jquerylib_0.1.4 compiler_4.4.0 tools_4.4.0 bslib_0.7.0
+[21] evaluate_0.23 yaml_2.3.8 jsonlite_1.8.8 rlang_1.1.3
+[25] stringi_1.8.3
+
+
+This notebook will demonstrate how to:
+tximeta
We will continue with the Tabula Muris data set that we started with +in the previous notebook.
+# tximeta for importing alevin results
+library(tximeta)
+
+
+Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
+'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'
+
+
+# SingleCellExperiment package for organizing our results
+library(SingleCellExperiment)
+
+
+Loading required package: SummarizedExperiment
+
+
+Loading required package: MatrixGenerics
+
+
+Loading required package: matrixStats
+
+
+
+Attaching package: 'MatrixGenerics'
+
+
+The following objects are masked from 'package:matrixStats':
+
+ colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
+ colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
+ colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
+ colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
+ colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
+ colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
+ colWeightedMeans, colWeightedMedians, colWeightedSds,
+ colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
+ rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
+ rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
+ rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
+ rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
+ rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
+ rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
+ rowWeightedSds, rowWeightedVars
+
+
+Loading required package: GenomicRanges
+
+
+Loading required package: stats4
+
+
+Loading required package: BiocGenerics
+
+
+
+Attaching package: 'BiocGenerics'
+
+
+The following objects are masked from 'package:stats':
+
+ IQR, mad, sd, var, xtabs
+
+
+The following objects are masked from 'package:base':
+
+ anyDuplicated, aperm, append, as.data.frame, basename, cbind,
+ colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
+ get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
+ match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
+ Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
+ tapply, union, unique, unsplit, which.max, which.min
+
+
+Loading required package: S4Vectors
+
+
+
+Attaching package: 'S4Vectors'
+
+
+The following object is masked from 'package:utils':
+
+ findMatches
+
+
+The following objects are masked from 'package:base':
+
+ expand.grid, I, unname
+
+
+Loading required package: IRanges
+
+
+Loading required package: GenomeInfoDb
+
+
+Loading required package: Biobase
+
+
+Welcome to Bioconductor
+
+ Vignettes contain introductory material; view with
+ 'browseVignettes()'. To cite Bioconductor, see
+ 'citation("Biobase")', and for packages 'citation("pkgname")'.
+
+
+
+Attaching package: 'Biobase'
+
+
+The following object is masked from 'package:MatrixGenerics':
+
+ rowMedians
+
+
+The following objects are masked from 'package:matrixStats':
+
+ anyMissing, rowMedians
+
+
+# GGPlot2 for the plots
+library(ggplot2)
+
+
+
+The data files we will be using for this part of the project are in
+the data/tabula-muris
subdirectory of the
+scRNA-seq
directory where this notebook is located.
The main files we will be using at this stage are the results from
+our earlier quantification, located in the alevin-quant
+subdirectory. Rather than just the subset, we will use the full data in
+order to get a somewhat more realistic view of a 10x data set. This data
+set is still a few years old though: newer datasets will tend to have
+more cells!
# main data directory
+data_dir <- file.path("data", "tabula-muris")
+
+# reference files
+ref_dir <- file.path("data", "reference")
+
+# Path to the single-sample alevin results
+alevin_file <- file.path(data_dir, "alevin-quant",
+ "10X_P4_3", "alevin", "quants_mat.gz")
+
+# Mitochondrial gene table
+mito_file <- file.path(ref_dir,
+ "mm_mitochondrial_genes.tsv")
+
+# create the output directory using fs::dir_create()
+filtered_dir <- file.path(data_dir, "filtered")
+fs::dir_create(filtered_dir)
+
+# Output file
+filtered_sce_file <- file.path(filtered_dir, "filtered_sce.rds")
+
+
+
+tximeta
needs a data frame with at least these two
+columns: - a files
column with the file paths to the
+quant.mat.gz files - a names
column with the sample
+names
In this case, we are only importing a single experiment, so we will +create a data frame with only one row.
+ + + +coldata <- data.frame(files = alevin_file,
+ names = "10X_P4_3")
+
+
+
+Using the coldata
data frame that we set up, we can now
+run the tximeta()
to import our expression data while
+automatically finding and associating the transcript annotations that
+were used when we performed the quantification.
The first time you run tximeta()
you may get a message
+about storing downloaded transcriptome data in a cache directory so that
+it can retrieve the data more quickly the next time. We recommend you
+use the cache, and accept the default location.
# Read in alevin results with tximeta
+bladder_sce <- tximeta(coldata, type = "alevin")
+
+
+importing quantifications
+
+
+importing alevin data is much faster after installing 'eds'
+
+
+reading in alevin gene-level counts across cells
+
+
+found matching transcriptome:
+[ Ensembl - Mus musculus - release 95 ]
+
+
+useHub=TRUE: checking for EnsDb via 'AnnotationHub'
+
+
+found matching EnsDb via 'AnnotationHub'
+
+
+downloading 1 resources
+
+
+retrieving 1 resource
+
+
+loading from cache
+
+
+require("ensembldb")
+
+
+generating gene ranges
+
+
+generating gene ranges
+
+
+
+A quick aside! When we ran alevinQC
on this data in the
+last notebook, we saw that salmon alevin
had identified a
+“whitelist” of barcodes that passed its quality control standards. We
+could use this filtered list directly, but salmon alevin
+can be quite strict, and methods for filtering quite variable. Instead,
+we will use the default behavior of tximeta()
and read in
+all of the barcodes for which there is a non-zero UMI count (after
+barcode correction). If you wanted instead to include only only barcodes
+that passed salmon alevin
’s filter, you could supply the
+additional argument alevinArgs = list(filterBarcodes=TRUE)
+to the tximeta()
function. Even if you do choose to read in
+pre-filtered data, it’s still important to explore the data as we’re
+about to do here and potentially filter further based on your
+observations, in particular since mapping software’s quality control
+measures (spoilers!) don’t always filter based on mitochondrial gene
+content.
In the intro-to-R-tidyverse module notebook,
+01-intro-to-base_R.Rmd
, we discuss base R object types, but
+there are some ‘special’ object types that are package-specific.
+tximeta
creates a SummarizedExperiment
object
+(or more specifically a RangedSummarizedExperiment
object),
+which is used by many Bioconductor packages to store and process results
+from gene expression studies.
# Explore the SummarizedExperiment data
+bladder_sce
+
+
+class: RangedSummarizedExperiment
+dim: 35429 344
+metadata(6): tximetaInfo quantInfo ... txomeInfo txdbInfo
+assays(1): counts
+rownames(35429): ENSMUSG00000000001 ENSMUSG00000000003 ...
+ ENSMUSG00000117649 ENSMUSG00000117651
+rowData names(8): gene_id gene_name ... symbol entrezid
+colnames(344): CGGAGTCAGTACGCCC TTGGCAACATGATCCA ... ACGTCAAGTGTAATGA
+ ATTACTCAGAGAACAG
+colData names(0):
+
+
+
+The main component we are concerned with for now is the
+counts
matrix, which is stored as an “assay”, with a row
+for each gene and a column for each cell. In this case, we can see there
+is information for 35,429 genes, and Alevin reports data for 344
+cells.
tximeta
also automatically added some annotation
+information about each gene, which can be seen by extracting the
+rowData
table.
# Examine row (gene) metadata
+rowData(bladder_sce)
+
+
+DataFrame with 35429 rows and 8 columns
+ gene_id gene_name gene_biotype
+ <character> <character> <character>
+ENSMUSG00000000001 ENSMUSG00000000001 Gnai3 protein_coding
+ENSMUSG00000000003 ENSMUSG00000000003 Pbsn protein_coding
+ENSMUSG00000000028 ENSMUSG00000000028 Cdc45 protein_coding
+ENSMUSG00000000037 ENSMUSG00000000037 Scml2 protein_coding
+ENSMUSG00000000049 ENSMUSG00000000049 Apoh protein_coding
+... ... ... ...
+ENSMUSG00000117643 ENSMUSG00000117643 AC122453.2 processed_pseudogene
+ENSMUSG00000117644 ENSMUSG00000117644 AC108777.1 processed_pseudogene
+ENSMUSG00000117646 ENSMUSG00000117646 AC122271.3 processed_pseudogene
+ENSMUSG00000117649 ENSMUSG00000117649 AC165087.2 processed_pseudogene
+ENSMUSG00000117651 ENSMUSG00000117651 CT485613.6 processed_pseudogene
+ seq_coord_system description
+ <character> <character>
+ENSMUSG00000000001 chromosome guanine nucleotide b..
+ENSMUSG00000000003 chromosome probasin [Source:MGI..
+ENSMUSG00000000028 chromosome cell division cycle ..
+ENSMUSG00000000037 chromosome Scm polycomb group p..
+ENSMUSG00000000049 chromosome apolipoprotein H [So..
+... ... ...
+ENSMUSG00000117643 chromosome Wilms tumour 1-assoc..
+ENSMUSG00000117644 chromosome gametocyte specific ..
+ENSMUSG00000117646 chromosome developmental plurip..
+ENSMUSG00000117649 chromosome heterogeneous nuclea..
+ENSMUSG00000117651 chromosome NSE1 homolog, SMC5-S..
+ gene_id_version symbol entrezid
+ <character> <character> <list>
+ENSMUSG00000000001 ENSMUSG00000000001.4 Gnai3 14679
+ENSMUSG00000000003 ENSMUSG00000000003.15 Pbsn 54192
+ENSMUSG00000000028 ENSMUSG00000000028.15 Cdc45 12544
+ENSMUSG00000000037 ENSMUSG00000000037.16 Scml2 107815
+ENSMUSG00000000049 ENSMUSG00000000049.11 Apoh 11818
+... ... ... ...
+ENSMUSG00000117643 ENSMUSG00000117643.1 AC122453.2 NA
+ENSMUSG00000117644 ENSMUSG00000117644.1 AC108777.1 NA
+ENSMUSG00000117646 ENSMUSG00000117646.1 AC122271.3 NA
+ENSMUSG00000117649 ENSMUSG00000117649.1 AC165087.2 NA
+ENSMUSG00000117651 ENSMUSG00000117651.1 CT485613.6 NA
+
+
+
+We could leave the object as it is, but we can unlock some extra
+functionality by converting this from a
+SummarizedExperiment
object to a
+SingleCellExperiment
, so we will go ahead and do that next.
+SingleCellExperiment
objects are a subtype of
+SummarizedExperiment
objects that a lot of single-cell
+analysis R packages use, so we will try to get acquainted with them.
For more information on SingleCellExperiment
objects, as
+well as many other topics related to this course, we highly recommend
+the e-book Orchestrating
+Single-Cell Analysis with Bioconductor (OSCA) and/or Amezquita
+et al. (2020).
Below is a figure from OSCA that shows the general structure of
+SingleCellExperiment
objects.
Note that three are slots for raw data, metadata about cells, +metadata about genes or features, and slots for various transformations +of the input data. Many of these will not be filled in when we first +create the object, but as we proceed through the workshop we will add in +more data to these slots as we compute new summaries and +transformations.
+To perform the conversion to a SingleCellExperiment
, we
+will use the R function as()
, which “coerces” objects from
+one type to another.
# Convert the SummarizedExperiment to a SingleCellExperiment
+bladder_sce <- as(bladder_sce, "SingleCellExperiment")
+bladder_sce
+
+
+class: SingleCellExperiment
+dim: 35429 344
+metadata(6): tximetaInfo quantInfo ... txomeInfo txdbInfo
+assays(1): counts
+rownames(35429): ENSMUSG00000000001 ENSMUSG00000000003 ...
+ ENSMUSG00000117649 ENSMUSG00000117651
+rowData names(8): gene_id gene_name ... symbol entrezid
+colnames(344): CGGAGTCAGTACGCCC TTGGCAACATGATCCA ... ACGTCAAGTGTAATGA
+ ATTACTCAGAGAACAG
+colData names(0):
+reducedDimNames(0):
+mainExpName: NULL
+altExpNames(0):
+
+
+
+Doing this added a couple of (currently empty) slots for things like +dimensionality reduction results and alternative feature experiments. +Foreshadowing!
+For a first pass at the data, we will extract just the counts matrix
+from the SingleCellExperiment
object, and use some base R
+functions to look at our results.
We can extract the gene by cell count matrix using the
+counts()
function. This actually returns a special format
+of matrix called a “sparse” matrix. Since single cell count data is
+mostly zeros, this format (a dgCMatrix
object) allows R to
+save a lot of memory. This object takes up about 6.4 MB, but if we
+stored it in the normal format, it would be closer to 100 MB!
+Thankfully, most of the functions that we use to work with regular
+matrices work just fine with these as well.
sc_counts <- counts(bladder_sce)
+
+
+
+Let’s look at the mean expression of the genes in this dataset. We
+will use apply()
in order to calculate things across our
+data frame. The second argument in apply()
specifies
+whether we are calculating by rows or columns. (1 = rows, 2 =
+columns).
In the code chunk below, use apply()
with the correct
+arguments to calculate the gene means.
# Let's calculate the gene means (by row)
+gene_means <- apply(sc_counts, 1, mean)
+
+
+
+This works just fine, but you may have noticed it is a bit slow. For
+a few common summary functions like means and sums, R has much more
+efficient functions to calculate across rows or columns. In this case,
+we can use rowMeans()
to do the same calculation much more
+quickly.
# use rowMeans() to calculate gene means
+gene_means <- rowMeans(sc_counts)
+
+
+
+Let’s make our first density plot with these data. We will use
+ggplot()
as you have seen before, but since the object we
+want to plot, gene_means
, is a vector not a data frame, we
+will skip the data
argument and go straight to the
+mapping
aesthetics. The remainder of the
+ggplot
code should look familiar.
# Plot the density of the means using ggplot2
+ggplot(mapping = aes(x = gene_means)) +
+ geom_density() +
+ labs(x = "Mean gene count")
+
+
+
+
+
+
+That plot is not quite as informative as we might like, as a few
+genes with high expression are making the scale just a bit
+wide. Lets zoom in on the left part of the graph by adding an
+xlim()
argument. (Note that xlim()
will remove
+points outside the specified range, so you will get a warning.)
# Plot the density of the means using ggplot2
+ggplot(mapping = aes(x = gene_means)) +
+ geom_density() +
+ labs(x = "Mean gene count") +
+ xlim(0, 5)
+
+
+Warning: Removed 203 rows containing non-finite outside the scale range
+(`stat_density()`).
+
+
+
+
+
+
+Even as we zoom in, the counts data has many zeroes, which we +actually expect in a single cell RNA-seq experiment.
+Let’s calculate what proportion of the count data is zeros:
+ + + +sum(sc_counts == 0)/(nrow(sc_counts) * ncol(sc_counts))
+
+
+[1] 0.9447591
+
+
+
+The small amount of RNA in a single cell results in higher chances of +errors and biases in RNA isolation, amplification, and sequencing. We +should check that the overall data we observe for each sample/cell are +reasonable before proceeding too far.
+The next section explores some of the ways we can filter the data set +to clean things up before we continue to downstream analysis.
+First, lets look at the total number of counts per cell, across all
+genes. For this we will use colSums()
, as each column
+represents a different sampled cell.
# Make a vector of total_counts number of counts per sample using colSums()
+total_counts <- colSums(sc_counts)
+
+
+
+
+
+
+# Take a look at the summary statistics for the total counts
+summary(total_counts)
+
+
+ Min. 1st Qu. Median Mean 3rd Qu. Max.
+ 1.0 287.8 3971.5 8089.9 12008.8 62446.0
+
+
+
+Yikes, at least one of the cells has only 1 read!, compared to the +median of ~4000! It’s highly likely that this ‘cell’ is either an empty +well or did not get sequenced properly.
+Let’s visualize the distribution of total counts to see if this is +the only cell we might want to exclude.
+In following graphs, we will use vertical red lines to indicate +possible cutoffs.
+ + + +# Let's use the same kind of plot as above but add more layers
+ggplot(mapping = aes(x = total_counts)) +
+ geom_density(fill = "lightblue") +
+ geom_vline(xintercept = 1000, color = "red") +
+ labs(x = "Counts per cell")
+
+
+
+
+
+
+How many cells would be removed with this (or other cutoffs) for +counts per sample?
+ + + +# Calculate the number of cells that would be removed with a given cutoff
+count_cutoff <- 1000
+sum(total_counts <= count_cutoff)
+
+
+[1] 133
+
+
+
+What if a single gene accounted for all counts in a particular cell? +This cell would not have helpful data for us, so we should look to +remove any cells we suspect might not have a useful amount of its +transcriptome measured.
+But before we can determine how many genes we consider a particular +cell to be expressing we need to determine a numeric cutoff for what we +consider to be a detected gene. How many counts must there be for you to +consider a gene expressed? Here let’s go for a simple detection cutoff +of > 0.
+ + + +# make a detection_mat matrix that is TRUE when a gene is expressed in a sample
+detection_mat <- sc_counts > 0
+
+
+
+Now that we have turned our data into a matrix of
+TRUE/FALSE
for detection, we can sum this data by column to
+effectively get a vector of how many genes were measured in each
+cell.
# Make a vector that contains the number of genes expressed by a particular cell
+num_genes_exp <- colSums(detection_mat)
+
+
+
+Let’s plot this using the same style and type of graph as above.
+ + + +ggplot(mapping = aes(x = num_genes_exp)) +
+ geom_density(fill = "lightblue") +
+ labs(x = "Number of genes expressed") +
+ theme_classic()
+
+
+
+
+
+
+This plot helps us visualize the distribution of genes per cell and +can help inform how we choose the cutoff. It’s important to remember +that different cell types can have quite different patterns with regards +to number of genes expressed. If we were to use strict cutoffs to select +which cells are “valid”, there is the possibility that we could bias our +results, so this is something we want to be careful about.
+Let’s see what happens if we only keep cells with > 500 expressed +genes. Just like when we looked at total counts, we can add in a +vertical line to the previous plot where the possible cutoff would +be.
+ + + +ggplot(mapping = aes(x = num_genes_exp)) +
+ geom_density(fill = "lightblue") +
+ labs(x = "Number of genes expressed") +
+ theme_classic() +
+ geom_vline(xintercept = 500, color = "red")
+
+
+
+
+
+
+How many cells would be removed with this cutoff?
+ + + +# Calculate the number of cells that would be removed with a given cutoff
+gene_cutoff <- 500
+sum(num_genes_exp <= gene_cutoff)
+
+
+[1] 145
+
+
+
+If a cell is dead or dying, its mRNA will tend to leak out of the
+cell, leaving an overabundance of mitochondrial RNA, which is more
+likely to stay within the mitochondria longer. To look for this, we
+would like to calculate the fraction of mitochondrial expression for
+each cell as well. First, we will need a list of the mitochondrial
+genes, which we have prepared in a tsv file
+mm_mitochondrial_genes.tsv
that we will now read in, and
+filter to just the genes that are found in the data set.
# read `mm_mitochondrial_genes.tsv` from ref_dir and
+# create from it a single vector containing only the gene ids
+mito_genes <- readr::read_tsv(mito_file) |>
+ # filter to only gene in the sce object
+ dplyr::filter(gene_id %in% rownames(bladder_sce)) |>
+ # pull takes this column out of the data frame as a stand-alone vector
+ dplyr::pull(gene_id)
+
+
+Rows: 37 Columns: 13
+── Column specification ────────────────────────────────────────────────────────
+Delimiter: "\t"
+chr (9): gene_id, gene_name, seqnames, strand, gene_biotype, seq_coord_syste...
+dbl (4): start, end, width, entrezid
+
+ℹ Use `spec()` to retrieve the full column specification for this data.
+ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
+
+
+
+Now we can use the genes from that list to select only the rows of +the count matrix that correspond to the mitochondrial genes and sum +their expression for each sample.
+ + + +# create a mito_rows vector that is TRUE for mitochondrial genes in our dataset
+mito_rows <- rownames(sc_counts) %in% mito_genes
+
+# sum the counts from just those genes for all samples
+mito_counts <- colSums(sc_counts[mito_rows, ])
+
+# calculate mito_fraction for all samples
+mito_fraction <- mito_counts/total_counts
+
+
+
+Lets make a plot of this distribution as well!
+ + + +ggplot(mapping = aes(x = mito_fraction)) +
+ geom_density(fill = "lightblue") +
+ labs(x = "Mitchondrial fraction") +
+ geom_vline(xintercept = 0.2, color = "red") +
+ theme_classic()
+
+
+
+
+
+
+Here, we want to keep cells with a low fraction of reads
+corresponding to mitochondrial genes and remove any cells with a high
+mitochondrial fraction. Again, it’s important to take this step even if
+you started with filtered data, since mapping software like
+salmon alevin
and Cell Ranger do not usually consider
+mitochondrial read percentages when filtering.
Lets put all of the QC measures we have calculated into a single data +frame, so we can look at how they might relate to one another.
+ + + +# make a data frame with number of genes expressed, total counts, and mito fraction
+qc_df <- data.frame(barcode = names(num_genes_exp),
+ genes_exp = num_genes_exp,
+ total_counts = total_counts,
+ mito_fraction = mito_fraction)
+
+
+
+Now we can plot these measures all together, along with some possible +cutoffs.
+ + + +ggplot(qc_df, aes (x = total_counts,
+ y = genes_exp,
+ color = mito_fraction)) +
+ geom_point(alpha = 0.5) +
+ scale_color_viridis_c() +
+ geom_vline(xintercept = 1000, color = "red") +
+ geom_hline(yintercept = 500, color = "red") +
+ labs(x = "Total Count",
+ y = "Number of Genes Expressed",
+ color = "Mitochondrial\nFraction") +
+ theme_bw()
+
+
+
+
+
+
+If we want to filter our data based on these measures and cutoffs we
+like, we can do this with dplyr::filter()
and then select
+the resulting columns from the matrix.
# create a filtered_samples data frame from qc_df
+filtered_samples <- qc_df |>
+ dplyr::filter(total_counts > 1000,
+ genes_exp > 500,
+ mito_fraction < 0.2)
+# select only passing samples for bladder_sce_filtered
+sc_counts_filtered <- sc_counts[, filtered_samples$barcode]
+
+
+
+SingleCellExperiment
directlyscater
The methods above were nice for demonstrating the kinds of filtering
+we might do, but all the steps would certainly be repetitive if we had
+to do them for each sample. Thankfully, there are some nice methods that
+have been developed in packages like scater
to perform them
+all at once and add the results to the SingleCellExperiment
+object. The advantages of using functions like this are that we can keep
+all of the metadata together, filter directly on the object of interest,
+avoid a lot of repetition, and in doing so avoid many potential
+errors.
We will start with the function addPerCellQC()
, which
+takes a SingleCellExperiment
and a list of gene sets that
+that we might want to calculate subset information for. In our case, we
+will just look at mitochondrial genes again.
bladder_sce <- scater::addPerCellQC(
+ bladder_sce,
+ # a list of named gene subsets that we want stats for
+ # here we are using mitochondrial genes
+ subsets = list(mito = mito_genes)
+ )
+
+
+
+The results of these calculations are now stored as a data frame in
+the colData
slot of the SingleCellExperiment
+object, which we can pull out with the colData()
function.
+(Unfortunately, it is not quite a regular data frame, but we can easily
+convert it to one.) Even nicer, we can access the QC data in those
+columns directly with just the familiar $
syntax!
The calculated statistics include sum
, the total UMI
+count for the cell, detected
, the number of genes detected,
+and a few different statistics for each subset that we gave, including
+the percent (not fraction!) of all UMIs from the subset. Since the
+subset we used was named mito
, this column is called
+subsets_mito_percent
.
Using these, we can recreate the plot from before:
+ + + +# extract the column data and convert to a data frame
+bladder_qc <- data.frame(colData(bladder_sce))
+
+# plot with the qc data frame
+ggplot(bladder_qc, aes (x = sum,
+ y = detected,
+ color = subsets_mito_percent)) +
+ geom_point(alpha = 0.5) +
+ scale_color_viridis_c() +
+ labs(x = "Total Count",
+ y = "Number of Genes Expressed",
+ color = "Mitochondrial\nFraction") +
+ theme_bw()
+
+
+
+
+
+
+SingleCellExperiment
Filtering the SingleCellExperiment
object is done as if
+it were just the counts matrix, with brackets and indexes. While this
+will look much like what we did before, it is better, because it will
+also keep the filtered QC stats alongside, in case we wanted to revisit
+them later. Otherwise, we would have to filter our QC results
+separately, which is an easy place for errors to creep in.
# create a boolean vector of QC filters
+cells_to_keep <- bladder_sce$sum > 1000 &
+ bladder_sce$detected > 500 &
+ bladder_sce$subsets_mito_percent < 20
+
+# filter the sce object (cells are columns)
+bladder_sce_filtered <- bladder_sce[, cells_to_keep]
+
+
+
+Just to check, we should have the same number of cells in
+bladder_sce_filtered
as our previous
+sc_counts_filtered
.
ncol(sc_counts_filtered) == ncol(bladder_sce_filtered)
+
+
+[1] TRUE
+
+
+
+Now we have an idea of what cells we probably want to get rid of. But +what if our data contains genes that we can’t reliably measure in these +cells?
+We could use our earlier detection_mat
to add up how
+many cells express each gene, but we will skip straight to the
+scater
function this time, which is called
+addPerFeatureQC()
. This will add QC statistics to the
+rowData
for each gene (alongside the annotation data we
+already had there) The columns it adds are the average expression level
+of each gene (mean
) and the percentage of cells in which it
+was detected (detected
).
bladder_sce_filtered <- scater::addPerFeatureQC(bladder_sce_filtered)
+
+
+
+Let’s make another density plot with the percentage of samples that +express each gene:
+ + + +# extract the gene information with
+gene_info <- data.frame(rowData(bladder_sce_filtered))
+
+# Plot the detected percentage
+ggplot(gene_info, aes(x = detected) )+
+ geom_density(fill = "lightblue") +
+ labs(x = "Percent of Cells Expressing Each Gene") +
+ theme_classic()
+
+
+
+
+
+
+How many genes will be excluded if we draw our cutoff at 5% of +cells?
+ + + +sum(gene_info$detected < 5)
+
+
+[1] 23960
+
+
+
+That’s a lot! How do we feel about that?
+ + + +cutoff <- 2
+# filter bladder_sce_filtered to only genes above a cutoff value
+bladder_sce_filtered <- bladder_sce_filtered[gene_info$detected >= cutoff, ]
+
+
+
+How big is the SingleCellExperiment
object now?
dim(bladder_sce_filtered)
+
+
+[1] 13648 186
+
+
+
+We will save the filtered SingleCellExperiment
object as
+a .rds
file for later use.
# Save object to the file filtered_sce_file, which
+# we defined at the top of this notebook
+readr::write_rds(bladder_sce_filtered, file = filtered_sce_file)
+
+
+
+sessionInfo()
+
+
+R version 4.4.0 (2024-04-24)
+Platform: x86_64-pc-linux-gnu
+Running under: Ubuntu 22.04.4 LTS
+
+Matrix products: default
+BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
+LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
+
+locale:
+ [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
+ [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
+ [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
+ [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
+ [9] LC_ADDRESS=C LC_TELEPHONE=C
+[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
+
+time zone: Etc/UTC
+tzcode source: system (glibc)
+
+attached base packages:
+[1] stats4 stats graphics grDevices utils datasets methods
+[8] base
+
+other attached packages:
+ [1] ensembldb_2.28.0 AnnotationFilter_1.28.0
+ [3] GenomicFeatures_1.56.0 AnnotationDbi_1.66.0
+ [5] ggplot2_3.5.1 SingleCellExperiment_1.26.0
+ [7] SummarizedExperiment_1.34.0 Biobase_2.64.0
+ [9] GenomicRanges_1.56.0 GenomeInfoDb_1.40.0
+[11] IRanges_2.38.0 S4Vectors_0.42.0
+[13] BiocGenerics_0.50.0 MatrixGenerics_1.16.0
+[15] matrixStats_1.3.0 tximeta_1.22.0
+[17] optparse_1.7.5
+
+loaded via a namespace (and not attached):
+ [1] jsonlite_1.8.8 tximport_1.32.0
+ [3] magrittr_2.0.3 ggbeeswarm_0.7.2
+ [5] farver_2.1.1 rmarkdown_2.26
+ [7] fs_1.6.4 BiocIO_1.14.0
+ [9] zlibbioc_1.50.0 vctrs_0.6.5
+ [11] memoise_2.0.1 Rsamtools_2.20.0
+ [13] DelayedMatrixStats_1.26.0 RCurl_1.98-1.14
+ [15] htmltools_0.5.8.1 S4Arrays_1.4.0
+ [17] progress_1.2.3 AnnotationHub_3.12.0
+ [19] curl_5.2.1 BiocNeighbors_1.22.0
+ [21] SparseArray_1.4.0 sass_0.4.9
+ [23] bslib_0.7.0 httr2_1.0.1
+ [25] cachem_1.0.8 GenomicAlignments_1.40.0
+ [27] mime_0.12 lifecycle_1.0.4
+ [29] pkgconfig_2.0.3 rsvd_1.0.5
+ [31] Matrix_1.7-0 R6_2.5.1
+ [33] fastmap_1.1.1 GenomeInfoDbData_1.2.12
+ [35] digest_0.6.35 colorspace_2.1-0
+ [37] scater_1.32.0 irlba_2.3.5.1
+ [39] RSQLite_2.3.6 beachmat_2.20.0
+ [41] filelock_1.0.3 labeling_0.4.3
+ [43] fansi_1.0.6 httr_1.4.7
+ [45] abind_1.4-5 compiler_4.4.0
+ [47] bit64_4.0.5 withr_3.0.0
+ [49] BiocParallel_1.38.0 viridis_0.6.5
+ [51] DBI_1.2.2 highr_0.10
+ [53] biomaRt_2.60.0 rappdirs_0.3.3
+ [55] DelayedArray_0.30.0 rjson_0.2.21
+ [57] tools_4.4.0 vipor_0.4.7
+ [59] beeswarm_0.4.0 glue_1.7.0
+ [61] restfulr_0.0.15 grid_4.4.0
+ [63] generics_0.1.3 gtable_0.3.5
+ [65] tzdb_0.4.0 hms_1.1.3
+ [67] ScaledMatrix_1.12.0 BiocSingular_1.20.0
+ [69] xml2_1.3.6 utf8_1.2.4
+ [71] XVector_0.44.0 ggrepel_0.9.5
+ [73] BiocVersion_3.19.1 pillar_1.9.0
+ [75] stringr_1.5.1 vroom_1.6.5
+ [77] dplyr_1.1.4 getopt_1.20.4
+ [79] BiocFileCache_2.12.0 lattice_0.22-6
+ [81] rtracklayer_1.64.0 bit_4.0.5
+ [83] tidyselect_1.2.1 Biostrings_2.72.0
+ [85] scuttle_1.14.0 knitr_1.46
+ [87] gridExtra_2.3 ProtGenerics_1.36.0
+ [89] xfun_0.43 stringi_1.8.3
+ [91] UCSC.utils_1.0.0 lazyeval_0.2.2
+ [93] yaml_2.3.8 evaluate_0.23
+ [95] codetools_0.2-20 tibble_3.2.1
+ [97] BiocManager_1.30.22 cli_3.6.2
+ [99] munsell_0.5.1 jquerylib_0.1.4
+ [ reached getOption("max.print") -- omitted 17 entries ]
+
+
+This notebook will demonstrate how to:
+In this notebook, we’ll continue with processing the same dataset +that we have been working with, moving onto normalization of scRNA-seq +count data that we have already done quality-control analyses of.
+For this tutorial, we will be using a pair of single-cell analysis
+specific R packages: scater
and scran
to work
+with our data. This tutorial is in part based on the scran
+tutorial.
Load the libraries we will be using, and set the random number +generation seed value for reproducibility.
+ + + +# Set seed for reproducibility
+set.seed(1234)
+
+# GGPlot2 for the plots
+library(ggplot2)
+
+# Packages for single cell processing
+library(scater)
+
+
+Loading required package: SingleCellExperiment
+
+
+Loading required package: SummarizedExperiment
+
+
+Loading required package: MatrixGenerics
+
+
+Loading required package: matrixStats
+
+
+
+Attaching package: 'MatrixGenerics'
+
+
+The following objects are masked from 'package:matrixStats':
+
+ colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
+ colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
+ colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
+ colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
+ colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
+ colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
+ colWeightedMeans, colWeightedMedians, colWeightedSds,
+ colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
+ rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
+ rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
+ rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
+ rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
+ rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
+ rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
+ rowWeightedSds, rowWeightedVars
+
+
+Loading required package: GenomicRanges
+
+
+Loading required package: stats4
+
+
+Loading required package: BiocGenerics
+
+
+
+Attaching package: 'BiocGenerics'
+
+
+The following objects are masked from 'package:stats':
+
+ IQR, mad, sd, var, xtabs
+
+
+The following objects are masked from 'package:base':
+
+ anyDuplicated, aperm, append, as.data.frame, basename, cbind,
+ colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
+ get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
+ match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
+ Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
+ tapply, union, unique, unsplit, which.max, which.min
+
+
+Loading required package: S4Vectors
+
+
+
+Attaching package: 'S4Vectors'
+
+
+The following object is masked from 'package:utils':
+
+ findMatches
+
+
+The following objects are masked from 'package:base':
+
+ expand.grid, I, unname
+
+
+Loading required package: IRanges
+
+
+Loading required package: GenomeInfoDb
+
+
+Loading required package: Biobase
+
+
+Welcome to Bioconductor
+
+ Vignettes contain introductory material; view with
+ 'browseVignettes()'. To cite Bioconductor, see
+ 'citation("Biobase")', and for packages 'citation("pkgname")'.
+
+
+
+Attaching package: 'Biobase'
+
+
+The following object is masked from 'package:MatrixGenerics':
+
+ rowMedians
+
+
+The following objects are masked from 'package:matrixStats':
+
+ anyMissing, rowMedians
+
+
+Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
+'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'
+
+
+Loading required package: scuttle
+
+
+library(scran)
+
+
+
+Now let’s set up the files we will be using:
+ + + +# main data directory
+data_dir <- file.path("data", "tabula-muris")
+
+# Filtered count matrix file from previous notebook
+filtered_sce_file <- file.path(data_dir, "filtered", "filtered_sce.rds")
+
+# Metadata file location
+metadata_file <- file.path(data_dir, "TM_droplet_metadata.csv")
+
+# Output directory for normalized data
+norm_dir <- file.path(data_dir, "normalized")
+fs::dir_create(norm_dir)
+
+
+
+bladder_sce <- readr::read_rds(filtered_sce_file)
+sc_metadata <- readr::read_csv(metadata_file)
+
+
+Rows: 70118 Columns: 9
+── Column specification ────────────────────────────────────────────────────────
+Delimiter: ","
+chr (9): cell, channel, mouse.id, tissue, subtissue, mouse.sex, cell_ontolog...
+
+ℹ Use `spec()` to retrieve the full column specification for this data.
+ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
+
+
+
+Because the Tabula Muris project is a well-studied data set, we +actually have some cell type information for this data set that we can +refer to.
+Note that we would normally NOT have this
+information until later in the analysis pipeline! Nonetheless, adding it
+here will be useful for visualizing the results of our normalization
+(and demonstrating how one might add metadata to the
+SingleCellExperiment
object).
# get the column (cell) metadata (this includes earlier QC stats!)
+# and convert to a data frame
+cell_info <- data.frame(colData(bladder_sce)) |>
+ # convert the row names of this data frame to a separate column
+ tibble::rownames_to_column("barcode")
+
+cell_metadata <- sc_metadata |>
+ # filter to just the sample we are working with
+ dplyr::filter(channel == "10X_P4_3") |>
+ # extract the 16 nt cell barcodes from the `cell` column
+ dplyr::mutate(barcode = stringr::str_sub(cell, start= -16)) |>
+ # choose only the columns we want to add
+ dplyr::select(barcode, cell_ontology_class, free_annotation)
+
+# Join the tables together, using `left_join()` to preserve all rows in cell_info
+cell_info <- cell_info |>
+ dplyr::left_join(cell_metadata)
+
+
+Joining with `by = join_by(barcode)`
+
+
+
+Check that the sample info accession ids are still the same as the +columns of our data.
+ + + +all.equal(cell_info$barcode, colnames(bladder_sce))
+
+
+[1] TRUE
+
+
+
+Now we can add that data back to the
+SingleCellExperiment
object. To keep with the format of
+that object, we have to convert our table to a DataFrame
+object in order for this to work. Just to keep things confusing, a
+DataFrame
is not the same as a data.frame
that
+we have been using throughout. We also need to be sure to include the
+row.names
argument to keep those properly attached.
Note that this will replace all of the previous column (cell) +metadata, which is part of the reason that we pulled out all previous +column data content first.
+ + + +# add new metadata data back to `bladder_sce`
+colData(bladder_sce) <- DataFrame(cell_info, row.names = cell_info$barcode)
+
+
+
+In whatever data we are working with, we are always looking to +maximize biological variance and minimize technical variance. A primary +source of technical variation we are concerned with is the variation in +library sizes among our samples. While different cells may have +different total transcript counts, it seems more likely that the primary +source of variation that we see is due to library construction, +amplification, and sequencing.
+This is where normalization methods usually come into the workflow. +The distribution of the counts that we saw in the previous notebook, and +in particular the fact that the count data is noisy with many zero +counts, makes normalization particularly tricky. To handle this noise, +we normalize cells in groups with other cells like them; a method +introduced in Lun +et al. (2016).
+Briefly, we first cluster the cells to find groups of similar cells,
+then compute normalization factors based on the sums of expression in
+those groups. The group normalization is then applied back to the
+individual cells within the group to create a normalized count matrix.
+In this case, we will also log-transform the normalized counts to get a
+less skewed distribution of expression measures. Note that because of
+the zero counts, the logNormCounts()
function will add a
+pseudocount of 1 to each value before computing the log.
# Step 1) Group cells with other like cells by clustering.
+qclust <- scran::quickCluster(bladder_sce)
+
+
+Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
+collapsing to unique 'x' values
+
+
+# Step 2) Compute sum factors for each cell cluster grouping.
+bladder_sce <- scran::computeSumFactors(bladder_sce, clusters = qclust)
+
+# Step 3) Normalize using these pooled sum factors and log transform.
+bladder_sce <- scater::logNormCounts(bladder_sce)
+
+
+
+One way to determine whether our normalization yields biologically +relevant results is to plot it and see if similarly labeled samples and +cells end up together. Because plotting expression for thousands genes +together isn’t practical, we will reduce the dimensions of our data +using Principal Components Analysis (PCA).
+We will also make the same plot with our unnormalized data, +to visualize the effect of normalization on our sample. We’ll do this +comparison twice:
+Before plotting the unnormalized data, we will log transform the raw
+counts to make their scaling more comparable to the normalized data. To
+do this we will use the log1p()
function, which is
+specifically designed for the case where we want to add 1 to all of our
+values before taking the log, as we do here. (We could do something like
+log(counts + 1)
, but this is both more efficient and more
+accurate.)
# Use PCA for dimension reduction of cells' scran normalized data
+norm_pca <- scater::calculatePCA(bladder_sce)
+
+# PCA on the raw counts, log transformed
+log_pca <- counts(bladder_sce) |> # get the raw counts
+ log1p() |> # log transform to make these more comparable to the normalized values
+ scater::calculatePCA() # calculate PCA scores
+
+
+
+Note that we are using scater::calculatePCA()
two
+different ways here: once on the full bladder_sce
object,
+and once on just the counts
matrix. When we use
+calculatePCA()
on the object, it automatically uses the log
+normalized matrix from inside the object.
Next we will arrange the PCA scores for plotting, adding a column for +each of the total UMI counts and the cell type labels so we can color +each point of the plot.
+ + + +# Set up the PCA scores for plotting
+norm_pca_scores <- data.frame(norm_pca,
+ geo_accession = rownames(norm_pca),
+ total_umi = bladder_sce$sum,
+ cell_type = bladder_sce$cell_ontology_class)
+log_pca_scores <- data.frame(log_pca,
+ geo_accession = rownames(log_pca),
+ total_umi = bladder_sce$sum,
+ cell_type = bladder_sce$cell_ontology_class)
+
+
+
+First, we will plot the unnormalized PCA scores with their total UMI +counts:
+ + + +# Now plot counts pca
+ggplot(log_pca_scores, aes(x = PC1, y = PC2, color = total_umi)) +
+ geom_point() +
+ labs(title = "Log counts (unnormalized) PCA scores",
+ color = "Total UMI count") +
+ scale_color_viridis_c() +
+ theme_bw()
+
+
+
+
+
+
+We’ve plotted the unnormalized data for you. Knowing that we want the +same graph, but different data, use the above template to plot the +normalized data. Feel free to customize the plot with a different theme +or color scheme!
+Let’s plot the norm_pca_scores
data:
ggplot(norm_pca_scores, aes(x = PC1, y = PC2, color = total_umi)) +
+ geom_point() +
+ labs(title = "Normalized log counts PCA scores",
+ color = "Total UMI count") +
+ scale_color_viridis_c() +
+ theme_bw()
+
+
+
+
+
+
+Do you see an effect from the normalization when comparing these +plots?
+Now, let’s plot these two sets of PCA scores again, but colored by +cell type. Do you see an effect from the normalization when comparing +these plots?
+ + + +# First, plot the normalized pca
+ggplot(norm_pca_scores, aes(x = PC1, y = PC2, color = cell_type)) +
+ geom_point() +
+ labs(title = "Normalized log counts PCA scores",
+ color = "Cell Type") +
+ scale_color_brewer(palette = "Dark2", na.value = "grey70") + # add a visually distinct color palette
+ theme_bw()
+
+
+
+
+
+# Next, plot log count pca
+ggplot(log_pca_scores, aes(x = PC1, y = PC2, color = cell_type)) +
+ geom_point() +
+ labs(title = "Log counts (unnormalized) PCA scores",
+ color = "Cell Type") +
+ scale_color_brewer(palette = "Dark2", na.value = "grey70") + # add a visually distinct color palette
+ theme_bw()
+
+
+
+
+
+
+In case we wanted to return to this data later, let’s save the
+normalized data to a tsv file. In order to do this we need to extract
+our normalized counts from bladder_sce
. Refer back to the
+SingleCellExperiment
figure above to determine why we are
+using this logcounts()
function.
Recall that readr::write_tsv
requires a data frame so we
+need to convert the logcounts
matrix to a data frame. We
+will actually have to do this in two steps: first by making the sparse
+matrix to a standard R matrix, then converting that to a data frame.
# Save this gene matrix to a tsv file
+logcounts(bladder_sce) |>
+ as.matrix() |>
+ as.data.frame() |>
+ readr::write_tsv(file.path(norm_dir, "scran_norm_gene_matrix.tsv"))
+
+
+
+We may want to return to our normalized bladder_sce
+object in the future, so we will also save our data in an RDS file so
+that we can re-load it into our R environment as a
+SingleCellExperiment
object.
# Save the data as an RDS
+readr::write_rds(bladder_sce, file.path(norm_dir, "normalized_bladder_sce.rds"))
+
+
+
+sessionInfo()
+
+
+R version 4.4.0 (2024-04-24)
+Platform: x86_64-pc-linux-gnu
+Running under: Ubuntu 22.04.4 LTS
+
+Matrix products: default
+BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
+LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
+
+locale:
+ [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
+ [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
+ [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
+ [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
+ [9] LC_ADDRESS=C LC_TELEPHONE=C
+[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
+
+time zone: Etc/UTC
+tzcode source: system (glibc)
+
+attached base packages:
+[1] stats4 stats graphics grDevices utils datasets methods
+[8] base
+
+other attached packages:
+ [1] scran_1.32.0 scater_1.32.0
+ [3] scuttle_1.14.0 SingleCellExperiment_1.26.0
+ [5] SummarizedExperiment_1.34.0 Biobase_2.64.0
+ [7] GenomicRanges_1.56.0 GenomeInfoDb_1.40.0
+ [9] IRanges_2.38.0 S4Vectors_0.42.0
+[11] BiocGenerics_0.50.0 MatrixGenerics_1.16.0
+[13] matrixStats_1.3.0 ggplot2_3.5.1
+[15] optparse_1.7.5
+
+loaded via a namespace (and not attached):
+ [1] gridExtra_2.3 rlang_1.1.3
+ [3] magrittr_2.0.3 compiler_4.4.0
+ [5] DelayedMatrixStats_1.26.0 vctrs_0.6.5
+ [7] stringr_1.5.1 pkgconfig_2.0.3
+ [9] crayon_1.5.2 fastmap_1.1.1
+[11] XVector_0.44.0 labeling_0.4.3
+[13] utf8_1.2.4 rmarkdown_2.26
+[15] tzdb_0.4.0 UCSC.utils_1.0.0
+[17] ggbeeswarm_0.7.2 bit_4.0.5
+[19] xfun_0.43 bluster_1.14.0
+[21] zlibbioc_1.50.0 cachem_1.0.8
+[23] beachmat_2.20.0 jsonlite_1.8.8
+[25] highr_0.10 DelayedArray_0.30.0
+[27] BiocParallel_1.38.0 irlba_2.3.5.1
+[29] parallel_4.4.0 cluster_2.1.6
+[31] R6_2.5.1 RColorBrewer_1.1-3
+[33] bslib_0.7.0 stringi_1.8.3
+[35] limma_3.60.0 jquerylib_0.1.4
+[37] Rcpp_1.0.12 knitr_1.46
+[39] readr_2.1.5 Matrix_1.7-0
+[41] igraph_2.0.3 tidyselect_1.2.1
+[43] abind_1.4-5 yaml_2.3.8
+[45] viridis_0.6.5 codetools_0.2-20
+[47] lattice_0.22-6 tibble_3.2.1
+[49] withr_3.0.0 evaluate_0.23
+[51] getopt_1.20.4 pillar_1.9.0
+[53] generics_0.1.3 vroom_1.6.5
+[55] hms_1.1.3 sparseMatrixStats_1.16.0
+[57] munsell_0.5.1 scales_1.3.0
+[59] glue_1.7.0 metapod_1.12.0
+[61] tools_4.4.0 BiocNeighbors_1.22.0
+[63] ScaledMatrix_1.12.0 locfit_1.5-9.9
+[65] fs_1.6.4 grid_4.4.0
+[67] edgeR_4.2.0 colorspace_2.1-0
+[69] GenomeInfoDbData_1.2.12 beeswarm_0.4.0
+[71] BiocSingular_1.20.0 vipor_0.4.7
+[73] cli_3.6.2 rsvd_1.0.5
+[75] fansi_1.0.6 S4Arrays_1.4.0
+[77] viridisLite_0.4.2 dplyr_1.1.4
+[79] gtable_0.3.5 sass_0.4.9
+[81] digest_0.6.35 SparseArray_1.4.0
+[83] ggrepel_0.9.5 dqrng_0.3.2
+[85] farver_2.1.1 htmltools_0.5.8.1
+[87] lifecycle_1.0.4 httr_1.4.7
+[89] statmod_1.5.0 bit64_4.0.5
+
+
+This notebook will demonstrate how to:
+emptyDropsCellRanger()
In this notebook, we’ll try out some dimension reduction techniques +on single-cell RNA-seq data.
+Visualizing highly dimensional data is a common challenge in +genomics, and especially with RNA-seq data. The expression of every gene +we look at is another dimension describing a sample. When we also have +hundreds or thousands of individual samples, as in the case of +single-cell analysis, figuring out how to clearly display all of the +data in a meaningful way is difficult.
+A common practice is to common to use dimension reduction techniques +so all of the data is in a more manageable form for plotting, +clustering, and other downstream analyses.
+# Load libraries
+library(ggplot2)
+library(scater)
+
+
+Loading required package: SingleCellExperiment
+
+
+Loading required package: SummarizedExperiment
+
+
+Loading required package: MatrixGenerics
+
+
+Loading required package: matrixStats
+
+
+
+Attaching package: 'MatrixGenerics'
+
+
+The following objects are masked from 'package:matrixStats':
+
+ colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
+ colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
+ colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
+ colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
+ colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
+ colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
+ colWeightedMeans, colWeightedMedians, colWeightedSds,
+ colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
+ rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
+ rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
+ rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
+ rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
+ rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
+ rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
+ rowWeightedSds, rowWeightedVars
+
+
+Loading required package: GenomicRanges
+
+
+Loading required package: stats4
+
+
+Loading required package: BiocGenerics
+
+
+
+Attaching package: 'BiocGenerics'
+
+
+The following objects are masked from 'package:stats':
+
+ IQR, mad, sd, var, xtabs
+
+
+The following objects are masked from 'package:base':
+
+ anyDuplicated, aperm, append, as.data.frame, basename, cbind,
+ colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
+ get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
+ match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
+ Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
+ tapply, union, unique, unsplit, which.max, which.min
+
+
+Loading required package: S4Vectors
+
+
+
+Attaching package: 'S4Vectors'
+
+
+The following object is masked from 'package:utils':
+
+ findMatches
+
+
+The following objects are masked from 'package:base':
+
+ expand.grid, I, unname
+
+
+Loading required package: IRanges
+
+
+Loading required package: GenomeInfoDb
+
+
+Loading required package: Biobase
+
+
+Welcome to Bioconductor
+
+ Vignettes contain introductory material; view with
+ 'browseVignettes()'. To cite Bioconductor, see
+ 'citation("Biobase")', and for packages 'citation("pkgname")'.
+
+
+
+Attaching package: 'Biobase'
+
+
+The following object is masked from 'package:MatrixGenerics':
+
+ rowMedians
+
+
+The following objects are masked from 'package:matrixStats':
+
+ anyMissing, rowMedians
+
+
+Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
+'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'
+
+
+Loading required package: scuttle
+
+
+library(scran)
+
+# Setting the seed for reproducibility
+set.seed(12345)
+
+
+
+The data we will be using for this module comes from a a 10x Genomics +data set of expression +data from a Hodgkin’s Lymphoma tumor. The data was generated with +the 10Xv3.1 chemistry, and processed with Cell Ranger and 10x Genomics +standard pipeline.
+There are a variety of files that you will often see as part of the
+standard output from Cell Ranger, which are described in detail in 10x
+Genomics documentation. We have included some of these in the
+data/hodkins/cellranger
directory, including the
+web_summary.html
file that includes some similar QC
+statistics to those we generated with alevinQC
. The main
+file we will be working with are the feature by barcode matrices. Cell
+Ranger does some filtering on its own, but we will start with the raw
+data.
# main data directory
+data_dir <- file.path("data", "hodgkins")
+# reference files
+ref_dir <- file.path("data", "reference")
+
+# Path to the Cell Ranger matrix
+raw_matrix_dir <- file.path(data_dir, "cellranger",
+ "raw_feature_bc_matrix")
+
+# Path to mitochondrial genes table
+mito_file <- file.path(ref_dir, "hs_mitochondrial_genes.tsv")
+
+# Directory and file to save output
+normalized_dir <- file.path(data_dir, "normalized")
+fs::dir_create(normalized_dir)
+
+output_sce_file <- file.path(normalized_dir, "normalized_hodgkins_sce.rds")
+
+
+
+Cell Ranger output includes count data in two main formats. The first
+is a folder with a feature list, a barcode list, and a sparse matrix in
+“Matrix
+Exchange” format. The DropletUtils::read10xCounts()
+function takes this directory and reads in the data from these three
+files, assembling the SingleCellExperiment
object we have
+worked with before.
Alternatively, we could use the HDF5
format file that
+Cell Ranger outputs as a file with the .h5
extension, which
+contains the same data. For whatever reason, the way you read the data
+affects how it is stored in R. Reading from the directory results in
+smaller objects in R, so that is what we will do here.
Cell Ranger also outputs both filtered and raw matrices; today we +will start with the raw matrix and perform our own filtering.
+hodgkins_sce <- DropletUtils::read10xCounts(raw_matrix_dir)
+
+
+Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
+'DelayedArray::makeNindexFromArrayViewport' when loading 'HDF5Array'
+
+
+
+How many potential cells are there here?
+ + + +dim(hodgkins_sce)
+
+
+[1] 36601 6794880
+
+
+
+That is a lot of cells! In fact, it is really every possible barcode, +whether there were reads associated with it or not. We should probably +do something about that.
+We will start by calculating the basic QC stats as we have done
+previously, adding those to our SingleCellExperiment
+object.
The first step again is reading in our table of mitochondrial genes +and finding the ones that were quantified our data set.
+ + + +mito_genes <- readr::read_tsv(mito_file) |>
+ dplyr::filter(gene_id %in% rownames(hodgkins_sce)) |>
+ dplyr::pull(gene_id)
+
+
+Rows: 37 Columns: 13
+── Column specification ────────────────────────────────────────────────────────
+Delimiter: "\t"
+chr (9): gene_id, gene_name, seqnames, strand, gene_biotype, seq_coord_syste...
+dbl (4): start, end, width, entrezid
+
+ℹ Use `spec()` to retrieve the full column specification for this data.
+ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
+
+
+
+Next we will calculate the QC stats that we used before. Note that +this is much slower than before, as we have many more genes in the +unfiltered set!
+ + + +hodgkins_sce <- scater::addPerCellQC(
+ hodgkins_sce,
+ subsets = list(mito = mito_genes))
+
+
+
+We can now do the most basic level of filtering: getting rid of +“cells” with no reads.
+ + + +hodgkins_sce <- hodgkins_sce[, hodgkins_sce$total > 0]
+dim(hodgkins_sce)
+
+
+[1] 36601 549931
+
+
+
+emptyDropsCellRanger()
The DropletUtils
package that we used to read in the 10x
+data has a number of other useful features. One is the
+emptyDropsCellRanger()
function, which uses the overall
+gene expression patterns in the sample to identify droplets that are
+likely to not contain an intact cell, but may simply have contained
+loose ambient RNA released during cell separation. This method was
+originally developed by Lun et al.
+(2019) and implemented as the function emptyDrops()
,
+but has since been adapted as the main filtering method used by Cell
+Ranger. The emptyDropsCellRanger()
function emulates the
+variant of this function that used by Cell Ranger, making the results
+more comparable between the two methods.
The Empty Drops method uses the droplets with very low UMI counts to
+estimate the “ambient” expression pattern of RNA content outside of
+cells. It then scores the remaining cells based how much they deviate
+from that pattern, assigning a small P value when the droplet’s
+expression deviates from the ambient expression pattern. Because it uses
+the low UMI count droplets, this method should not be used when other
+filtering has already been performed (which is unfortunately the case
+with the version of salmon alevin
we used).
This method seems to perform well to exclude false “cells” while +retaining cells with distinct expression profiles but low counts that +might have failed a simple cutoff. Note that this method also requires +that the data has already been quantified, with reads assigned to genes, +as compared to a simple total UMI count filter which can be performed +much earlier in the pipeline.
+The emptyDropsCellRanger()
function takes the
+counts
matrix from our SingleCellExperiment, and returns a
+data frame with the statistics it calculates. This will take a few
+minutes to run, but we can speed it up by allowing parallel
+processing.
droplet_stats <- DropletUtils::emptyDropsCellRanger(
+ counts(hodgkins_sce),
+ BPPARAM = BiocParallel::MulticoreParam(4)) # use multiprocessing
+
+
+
+We will use a false discovery rate (FDR) of 0.01 as our cutoff for
+“real” cells. Since emptyDropsCellRanger()
uses low count
+cells to estimate the “ambient” expression pattern, those cells are not
+assigned an FDR value, and have a value of NA. These NAs can be a
+problem for filtering with a Boolean vector, as we did above, so instead
+we will use the which()
function to get the
+positions of the cells that pass our filter and select the
+columns we want using that.
cells_to_retain <- which(droplet_stats$FDR <= 0.01)
+
+filtered_sce <- hodgkins_sce[, cells_to_retain]
+dim(filtered_sce)
+
+
+[1] 36601 3401
+
+
+
+How does this compare to the number of cells in the Cell Ranger
+filtered data? Looking the web_summary.html
report from
+Cell Ranger, it seems that it would have kept 3,394 cells, so we seem to
+be getting broadly similar results.
While emptyDropsCellRanger()
should have filtered out
+droplets containing no cells, it will not necessarily filter out damaged
+cells. For that we will still want to look at mitochondrial content, as
+we did previously. The statistics we calculated earlier with
+addPerCellQC()
are retained in our new object, so we can
+plot those directly.
# Plot the mitochondrial percents stored in `filtered_sce`
+ggplot(mapping = aes(x = filtered_sce$subsets_mito_percent)) +
+ geom_histogram(bins = 100)
+
+
+
+
+
+
+There are certainly some cells with high mitochondrial percentages! +For now, we will use a cutoff of 20% to filter out the worst of the +cells.
+ + + +filtered_sce <- filtered_sce[, filtered_sce$subsets_mito_percent < 20]
+
+
+
+We can also filter by features (genes in our case) using
+scater::addPerFeatureQC()
which will compute the number of
+samples where each gene is detected and the mean count across all genes.
+We can then use those data (stored in rowData
) to filter by
+row to only the genes that are detected in at least 5% of cells, and
+with a mean count > 0.1.
filtered_sce <- scater::addPerFeatureQC(filtered_sce)
+detected <- rowData(filtered_sce)$detected > 5
+expressed <- rowData(filtered_sce)$mean > 0.1
+
+# filter the genes (rows) this time
+filtered_sce <- filtered_sce[detected & expressed, ]
+
+
+
+How many cells do we have now?
+ + + +dim(filtered_sce)
+
+
+[1] 7445 2747
+
+
+
+Now we will perform the same normalization steps we did in a previous
+dataset, using scran::computeSumFactors()
and
+scater::logNormCounts()
. You might recall that there is a
+bit of randomness in some of these calculations, so we should be sure to
+have used set.seed()
earlier in the notebook for
+reproducibility.
# Cluster similar cells
+qclust <- scran::quickCluster(filtered_sce)
+
+# Compute sum factors for each cell cluster grouping.
+filtered_sce <- scran::computeSumFactors(filtered_sce, clusters = qclust, positive = FALSE)
+
+
+Warning in (function (x, sizes, min.mean = NULL, positive = FALSE, scaling =
+NULL) : encountered non-positive size factor estimates
+
+
+
+It turns out in this case we end up with some negative size factors. +This is usually an indication that our filtering was not stringent +enough, and there remain a number of cells or genes with nearly zero +counts. This probably happened when we removed the +infrequently-expressed genes; cells which had high counts from those +particular genes (and few others) could have had their total counts +dramatically reduced.
+To account for this, we will recalculate the per-cell stats and
+filter out low counts. Unfortunately, to do this, we need to first
+remove the previously calculated statistics, which we will do by setting
+them to NULL
.
# remove previous calculations
+filtered_sce$sum <- NULL
+filtered_sce$detected <- NULL
+filtered_sce$total <- NULL
+filtered_sce$subsets_mito_sum <- NULL
+filtered_sce$subsets_mito_detected <- NULL
+filtered_sce$subsets_mito_sum <- NULL
+
+# recalculate cell stats
+filtered_sce <- scater::addPerCellQC(filtered_sce, subsets = list(mito = mito_genes))
+
+# print the number of cells with fewer than 500 UMIs
+sum(filtered_sce$sum < 500)
+
+
+[1] 13
+
+
+
+Now we can filter again. In this case, we will keep cells with at +least 500 UMIs after removing the lowly expressed genes. Then we will +redo the size factor calculation, hopefully with no more warnings.
+ + + +filtered_sce <- filtered_sce[, filtered_sce$sum >= 500]
+
+qclust <- scran::quickCluster(filtered_sce)
+
+filtered_sce <- scran::computeSumFactors(filtered_sce, clusters = qclust)
+
+
+
+Looks good! Now we’ll do the normalization.
+ + + +# Normalize and log transform.
+normalized_sce <- scater::logNormCounts(filtered_sce)
+
+
+
+At this point, we have a few different versions of our
+SingleCellExperiment
object. The original (mostly)
+unfiltered version is in hodgkins_sce
, the filtered version
+in filtered_sce
, and the normalized version in
+normalized_sce
. We can clean those up a bit to save memory,
+keeping only the latest normalized_sce
version, which now
+has two assay
s: counts
with the raw data and
+logcounts
with the normalized and transformed data.
assayNames(normalized_sce)
+
+
+[1] "counts" "logcounts"
+
+
+counts
+logcounts
+
+
+rm(hodgkins_sce, filtered_sce)
+
+
+
+Principal component analysis (PCA) is a dimensionality reduction +technique that allows us to identify the largest components of variation +in a complex dataset. Our expression data can be thought of as mapping +each sample in a multidimensional space defined by the expression level +of each gene. The expression of many of those genes are correlated, so +we can often get a better, simpler picture of the data by combining the +information from those correlated genes.
+PCA rotates and transforms this space so that each axis is now a +combination of multiple correlated genes, ordered so the first axes +capture the most variation from the data. These new axes are the +“principal components.” If we look at the first few components, we can +often get a nice overview of relationships among the samples in the +data.
+We will store the PCA results in our
+SingleCellExperiment
object, as we will want to use them
+later. To do this, we will use the runPCA()
function from
+scater
, which performs the PCA calculations and returns a
+new object with the results stored in the reducedDim
slot.
+If we wanted to, we could get the raw results as a matrix instead with
+calculatePCA()
function, as we did in a previous
+notebook.
We will also use the ntop
argument to calculate the PCA
+using 2000 genes with the highest variance. The default is
+ntop = 500
.
# calculate PCA using the top 2000 genes
+normalized_sce <- runPCA(normalized_sce, ntop = 2000)
+
+
+
+We can see what reduced dimensionality matrices are stored in the
+object with the reducedDimNames()
function.
# print the reduced dimensionality tables available
+reducedDimNames(normalized_sce)
+
+
+[1] "PCA"
+
+
+PCA
+
+
+
+To extract them by name, we use the reducedDim()
+function, much like the assay()
function to extract
+original data.
# print the top corner of the PCA matrix
+reducedDim(normalized_sce, "PCA")[1:10, 1:5]
+
+
+ PC1 PC2 PC3 PC4 PC5
+ [1,] 15.379889 7.929537 11.790972 3.9845847 8.232603
+ [2,] -1.220424 6.053443 2.100161 -10.7532071 -6.769888
+ [3,] 0.647626 -7.566098 -12.004547 -0.9862535 8.936328
+ [4,] 7.922958 -21.975167 4.156634 -10.0798546 5.383158
+ [5,] 1.577685 -26.510992 11.089218 10.5259441 -11.660812
+ [6,] 5.676435 -27.571708 13.655945 -9.3006045 2.594320
+ [7,] -3.623704 -2.052362 -9.061048 -5.7941810 9.112434
+ [8,] 9.577588 12.977445 -4.125885 -0.1949252 -10.757970
+ [9,] 9.393483 -7.288896 -12.661552 6.4502827 -2.371315
+[10,] -7.917621 -8.052548 -9.061915 0.8675157 9.034390
+
+
+
+If we have the PCA results stored in the
+SingleCellExperiment
object, we can use the
+scater::plotReducedDim()
function to plot it with some nice
+defaults easily. One nice thing about this function is that it uses
+ggplot2
under the hood, so if we wanted to customize it
+later, we could.
# plot PCA results
+plotReducedDim(normalized_sce, "PCA")
+
+
+
+
+
+
+PCA gives us a matrix with more than just two dimensions, and we
+might want to look at some higher dimensions too. We can do that with
+the ncomponents
argument.
# plot PC3 and PC4
+plotReducedDim(normalized_sce, "PCA", ncomponents = c(3,4))
+
+
+
+
+
+
+The variation in gene expression we see among cells comes from a
+combination of variation due to technical effects and the biology we
+really care about. In order to roughly account for this we could just
+take the largest variance genes, on the assumption that low variance
+genes are mostly just noise. This is the default approach that
+runPCA()
and calculatePCA()
take, using the
+genes with the greatest variance across cells to calculate the PCA
+matrix.
If we want to be a bit more careful about it, we can model the +variance in expression of each gene as a function of the mean expression +for that gene. This is useful because we generally expect the variance +to increase as mean expression increases, even if there is no biological +signal in the expression variation.
+We will do this modeling of variance by expression with the
+scran::modelGeneVar()
function, saving the results to a new
+variable.
gene_variance <- scran::modelGeneVar(normalized_sce)
+
+
+
+Now let’s plot the relationship between gene expression and variance
+we were discussing. Here we will also add the fitting curve that
+scran::modelGeneVar()
created, which is stored as function
+in the $trend
slot of the gene_variance
+object. We can add a function like that curve to a ggplot
+with a stat_function
layer.
ggplot(as.data.frame(gene_variance), aes(x = mean, y = total)) +
+ geom_point(alpha = 0.1) +
+ stat_function(fun = metadata(gene_variance)$trend, color = "blue") +
+ labs(
+ x = "Mean log-expression",
+ y = "Variance") +
+ theme_bw()
+
+
+
+
+
+
+Now we can use scran::getTopHVGs()
to select the genes
+that have the most biological variation (according to the model) and
+recalculate PCA scores using only those genes. (In practice, we are
+selecting the genes with the largest residual variation after removing
+technical variation modeled by the mean/variance relationship.)
Here we are picking the 2000 top genes to match the number of genes +from our earlier calculations.
+ + + +# select the most variable genes
+highvar_genes <- scran::getTopHVGs(gene_variance, n = 2000)
+# calculate a PCA matrix using those genes
+normalized_sce <- runPCA(normalized_sce, subset_row = highvar_genes)
+
+
+
+Now we can plot our new PCA values for comparison. You might realize
+that our old PCA values were replaced when we ran runPCA()
+again, so we can’t recreate the earlier plots at this stage of the
+notebook. You will have to scroll up to your earlier plots to
+compare.
# plot the new PCA results
+plotReducedDim(normalized_sce, "PCA")
+
+
+
+
+
+plotReducedDim(normalized_sce, "PCA", ncomponents = c(3,4))
+
+
+
+
+
+
+UMAP (Uniform Manifold Approximation and Projection) +is a machine learning technique designed to provide more detail in +highly dimensional data than a typical principal components analysis. +While PCA assumes that the variation we care about has a particular +distribution (normal, broadly speaking), UMAP allows more complicated +distributions that it learns from the data. The underlying mathematics +are beyond me, but if you are more ambitious than I, you can look at the +paper by McInnes, Healy, +& Melville (2018). The main advantage of this change in +underlying assumptions is that UMAP can do a better job separating +clusters, especially when some of those clusters may be more similar to +each other than others.
+Another dimensionality reduction technique that you may have heard of +is t-SNE (t-distributed Stochastic Neighbor Embedding), +which has similar properties to UMAP, and often produces similar +results. There is some ongoing debate about which of these two +techniques is superior, and whether the differences are due to the +underlying algorithm or to implementation and parameter initialization +defaults. Regardless of why, in our experience, UMAP seems to produce +slightly better results and run a bit faster, but the differences can be +subtle.
+For ease of use with this data, we will be using the
+scater::calculateUMAP()
and scater::runUMAP()
+function to apply UMAP to our single cell data, but similar functions
+the uwot
package (notably uwot::umap()
) can be
+used to apply UMAP to any numerical matrix.
UMAP can be slow for a large data set with lots of parameters. It is
+worth noting that the scater::calculateUMAP()
+implementation actually does PCA first, and then runs UMAP on the top 50
+PCs. If we have already calculated PCA (as we have) we can tell it to
+use those results with the dimred
argument.
As with PCA, there are two functions we could use:
+scater::calculateUMAP()
will return a matrix of results,
+with one row for each sample, and a column for each of the UMAP
+dimensions returned. scater::runUMAP()
performs the same
+function, but returns the results in a SingleCellExperiment object.
Let’s see how it looks with the (mostly) default parameters:
+ + + +# Run UMAP
+normalized_sce <- runUMAP(normalized_sce,
+ dimred = "PCA") # use already stored PCA results
+
+
+
+Now we can plot with the same plotReducedDim()
function,
+specifying we want to plot the UMAP results this time. We will also add
+some color this time with the color_by
argument, using the
+number of genes detected in each cell to assign a hue.
# make a UMAP plot with `plotReducedDim()`
+plotReducedDim(normalized_sce, "UMAP", color_by = "detected")
+
+
+
+
+
+
+There is clearly a lot of structure in there, but is it meaningful? +Do the clusters we see differentiate cell types? How should we divide +them up?
+We will come back to this question later!
+Now that we have an idea of what a UMAP plot with the default
+parameters looks like, let’s try experimenting with the
+n_neighbors
parameter. First, we should see what this
+parameter is, and what the default value is. In the console, run
+?scater::calculateUMAP
to see what this (and other
+parameters) are. For even more parameters, you can look at the
+underlying implementation code that calculateUMAP()
uses,
+which is the function uwot::umap()
In order to make our experimentation easier, we will create a
+function that allows us to rerun the same code easily, but
+create an argument that allows us to change one variable: the
+n_neighbors
variable. Here we are saving only a line of
+code, but we could apply this to a much more complex series of
+operations if we wanted to!
UMAP_plot_wrapper <- function(sce = normalized_sce, nn_param = 15) {
+ # Purpose: Run UMAP and plot the output
+ # Args: nn_param: a single numeric argument that will change the
+ # n_neighbors variable in the calculateUMAP() function.
+ # Output: a scatterplot with the two UMAP coordinates plotted and
+ # cell-types labeled with data point colors.
+
+ # Run UMAP with a specified n_neighbors parameter
+ sce_umap <- scater::runUMAP(sce, dimred = "PCA", n_neighbors = nn_param)
+ scater::plotReducedDim(sce_umap, "UMAP", color_by = "detected") +
+ # make the legend label more informative (this is ggplot2 code!)
+ guides(color = guide_colorbar(title="genes\nexpressed"))
+}
+
+
+
+Let’s make sure that works and gives the same result as before when +we use the default parameters.
+ + + +UMAP_plot_wrapper(nn_param = 15)
+
+
+
+
+
+
+Kind of?
+This isn’t your fault! UMAP is a non-deterministic function, which
+means that there is a random component to the results. We can use
+set.seed()
to be sure that an individual run (or set of
+runs) is the same every time you run your analysis, but it is important
+to check your results a few times with different random starting points
+to be sure that the random component is not giving you anomalous
+results. Setting a different random number seed with
+set.seed()
is one way to do this, or you can run the
+analysis multiple times in the same session, as we have done here.
Fill in the next few code chunks with the function and the
+n_neighbors
argument you would like to use for each. (Feel
+free to add more tests!) Then run the chunks and compare your output
+graphs.
# Try something low?
+UMAP_plot_wrapper(nn_param = 3)
+
+
+
+
+
+
+
+
+
+# Try something high?
+UMAP_plot_wrapper(nn_param = 100)
+
+
+
+
+
+
+
+
+
+# Try whatever you like!
+UMAP_plot_wrapper(nn_param = 5)
+
+
+
+
+
+
+Analyses such as UMAP have various limitations for +interpretability. The coordinates of UMAP output for any given cell can +change dramatically depending on parameters, and even run to run with +the same parameters. This probably means that you shouldn’t rely on the +exact values of UMAP’s output.
+Playing with parameters so you can fine-tune them is a good way +to give you more information about a particular analysis as well as the +data itself.
Where results are consistent, they are more likely to have +meaning. While we do not have labeled cell types in this case, there +does seem to be some consistency of the overall patterns that we see (if +not precise values), and this likely reflects biological information (or +technical artifacts).
In summary, if the results of an analysis can be completely changed +by changing its parameters, you should be more cautious when it comes to +the conclusions you draw from it as well as having good rationale for +the parameters you choose.
+In the block below is a similar analysis and plot with t-SNE +(t-distributed Stochastic Neighbor Embedding). Note that this analysis +also uses PCA before moving on to the fancy machine learning.
+ + + +# Run TSNE
+normalized_sce <- runTSNE(normalized_sce, dimred = "PCA")
+
+# plot with scater function
+plotReducedDim(normalized_sce, "TSNE", color_by = "detected")
+
+
+
+
+
+
+Different! (Slower!) Is it better or worse? Hard to say! Different +people like different things, and one plot might illustrate a particular +point better than another.
+We are going to use this data more in the next notebook, so let’s
+save it as an RDS
file.
readr::write_rds(normalized_sce, file = output_sce_file)
+
+
+
+n_neighbors
parameter for
+perplexity
.)sessionInfo()
+
+
+R version 4.4.0 (2024-04-24)
+Platform: x86_64-pc-linux-gnu
+Running under: Ubuntu 22.04.4 LTS
+
+Matrix products: default
+BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
+LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
+
+locale:
+ [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
+ [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
+ [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
+ [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
+ [9] LC_ADDRESS=C LC_TELEPHONE=C
+[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
+
+time zone: Etc/UTC
+tzcode source: system (glibc)
+
+attached base packages:
+[1] stats4 stats graphics grDevices utils datasets methods
+[8] base
+
+other attached packages:
+ [1] scran_1.32.0 scater_1.32.0
+ [3] scuttle_1.14.0 SingleCellExperiment_1.26.0
+ [5] SummarizedExperiment_1.34.0 Biobase_2.64.0
+ [7] GenomicRanges_1.56.0 GenomeInfoDb_1.40.0
+ [9] IRanges_2.38.0 S4Vectors_0.42.0
+[11] BiocGenerics_0.50.0 MatrixGenerics_1.16.0
+[13] matrixStats_1.3.0 ggplot2_3.5.1
+[15] optparse_1.7.5
+
+loaded via a namespace (and not attached):
+ [1] gridExtra_2.3 rlang_1.1.3
+ [3] magrittr_2.0.3 compiler_4.4.0
+ [5] DelayedMatrixStats_1.26.0 vctrs_0.6.5
+ [7] stringr_1.5.1 pkgconfig_2.0.3
+ [9] crayon_1.5.2 fastmap_1.1.1
+ [11] XVector_0.44.0 labeling_0.4.3
+ [13] utf8_1.2.4 rmarkdown_2.26
+ [15] tzdb_0.4.0 UCSC.utils_1.0.0
+ [17] ggbeeswarm_0.7.2 bit_4.0.5
+ [19] xfun_0.43 bluster_1.14.0
+ [21] zlibbioc_1.50.0 cachem_1.0.8
+ [23] beachmat_2.20.0 jsonlite_1.8.8
+ [25] highr_0.10 rhdf5filters_1.16.0
+ [27] DelayedArray_0.30.0 Rhdf5lib_1.26.0
+ [29] BiocParallel_1.38.0 irlba_2.3.5.1
+ [31] parallel_4.4.0 cluster_2.1.6
+ [33] R6_2.5.1 bslib_0.7.0
+ [35] stringi_1.8.3 limma_3.60.0
+ [37] jquerylib_0.1.4 Rcpp_1.0.12
+ [39] knitr_1.46 R.utils_2.12.3
+ [41] FNN_1.1.4 readr_2.1.5
+ [43] Matrix_1.7-0 igraph_2.0.3
+ [45] tidyselect_1.2.1 abind_1.4-5
+ [47] yaml_2.3.8 viridis_0.6.5
+ [49] codetools_0.2-20 lattice_0.22-6
+ [51] tibble_3.2.1 withr_3.0.0
+ [53] Rtsne_0.17 evaluate_0.23
+ [55] getopt_1.20.4 pillar_1.9.0
+ [57] generics_0.1.3 vroom_1.6.5
+ [59] hms_1.1.3 sparseMatrixStats_1.16.0
+ [61] munsell_0.5.1 scales_1.3.0
+ [63] glue_1.7.0 metapod_1.12.0
+ [65] tools_4.4.0 BiocNeighbors_1.22.0
+ [67] ScaledMatrix_1.12.0 locfit_1.5-9.9
+ [69] fs_1.6.4 cowplot_1.1.3
+ [71] rhdf5_2.48.0 grid_4.4.0
+ [73] DropletUtils_1.24.0 edgeR_4.2.0
+ [75] colorspace_2.1-0 GenomeInfoDbData_1.2.12
+ [77] beeswarm_0.4.0 BiocSingular_1.20.0
+ [79] HDF5Array_1.32.0 vipor_0.4.7
+ [81] cli_3.6.2 rsvd_1.0.5
+ [83] fansi_1.0.6 S4Arrays_1.4.0
+ [85] viridisLite_0.4.2 dplyr_1.1.4
+ [87] uwot_0.2.2 gtable_0.3.5
+ [89] R.methodsS3_1.8.2 sass_0.4.9
+ [91] digest_0.6.35 SparseArray_1.4.0
+ [93] ggrepel_0.9.5 dqrng_0.3.2
+ [95] farver_2.1.1 htmltools_0.5.8.1
+ [97] R.oo_1.26.0 lifecycle_1.0.4
+ [99] httr_1.4.7 statmod_1.5.0
+[101] bit64_4.0.5
+
+
+This notebook will demonstrate how to:
+# Load libraries
+library(ggplot2)
+library(scater)
+
+
+Loading required package: SingleCellExperiment
+
+
+Loading required package: SummarizedExperiment
+
+
+Loading required package: MatrixGenerics
+
+
+Loading required package: matrixStats
+
+
+
+Attaching package: 'MatrixGenerics'
+
+
+The following objects are masked from 'package:matrixStats':
+
+ colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
+ colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
+ colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
+ colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
+ colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
+ colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
+ colWeightedMeans, colWeightedMedians, colWeightedSds,
+ colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
+ rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
+ rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
+ rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
+ rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
+ rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
+ rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
+ rowWeightedSds, rowWeightedVars
+
+
+Loading required package: GenomicRanges
+
+
+Loading required package: stats4
+
+
+Loading required package: BiocGenerics
+
+
+
+Attaching package: 'BiocGenerics'
+
+
+The following objects are masked from 'package:stats':
+
+ IQR, mad, sd, var, xtabs
+
+
+The following objects are masked from 'package:base':
+
+ anyDuplicated, aperm, append, as.data.frame, basename, cbind,
+ colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
+ get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
+ match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
+ Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
+ tapply, union, unique, unsplit, which.max, which.min
+
+
+Loading required package: S4Vectors
+
+
+
+Attaching package: 'S4Vectors'
+
+
+The following object is masked from 'package:utils':
+
+ findMatches
+
+
+The following objects are masked from 'package:base':
+
+ expand.grid, I, unname
+
+
+Loading required package: IRanges
+
+
+Loading required package: GenomeInfoDb
+
+
+Loading required package: Biobase
+
+
+Welcome to Bioconductor
+
+ Vignettes contain introductory material; view with
+ 'browseVignettes()'. To cite Bioconductor, see
+ 'citation("Biobase")', and for packages 'citation("pkgname")'.
+
+
+
+Attaching package: 'Biobase'
+
+
+The following object is masked from 'package:MatrixGenerics':
+
+ rowMedians
+
+
+The following objects are masked from 'package:matrixStats':
+
+ anyMissing, rowMedians
+
+
+Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
+'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'
+
+
+Loading required package: scuttle
+
+
+library(scran)
+
+# clustering tools
+library(bluster)
+
+# Setting the seed for reproducibility
+set.seed(12345)
+
+
+
+
+
+
+# main data directory
+data_dir <- file.path("data", "hodgkins")
+
+# normalized data file
+normalized_rds <- file.path(data_dir, "normalized", "normalized_hodgkins_sce.rds")
+
+# Output directory for markers
+marker_dir <- file.path("analysis", "hodgkins", "markers")
+fs::dir_create(marker_dir)
+
+
+
+
+
+
+hodgkins_sce <- readr::read_rds(normalized_rds)
+
+
+
+When we performed dimensionality reduction on our single cell data, +we could see visually that the cells tended cluster together into +groups. To the extent that such clustering is a real biological +phenomenon, representing cells with similar patterns of gene expression, +we might like to identify distinct groups that we can name and assign a +label to. Ultimately, we would hope that these labels correspond to +previously identified (or newly identified!) cell types, and that we can +use that information to provide more insight into the results of our +experiment.
+There are a number of methods to identify clusters and assign cells +to those in multidimensional data like the single cell data we have. We +will explore a couple of the more common methods here.
+The first method we will try is k-means clustering. The
+k
here refers to the number of clusters that we will
+create, and must be chosen before we start. This clustering method seeks
+to find a way to divide the cells into k
clusters such that
+the cells within each cluster are as similar as possible and the
+differences among clusters is as large as possible.
It turns out that is a pretty hard problem to solve exactly, but we +can do pretty well with an algorithm that starts with a random guess at +where the clusters will be:
+You might wonder: How many clusters should we use? That is a hard +question! There are some heuristics we can use for deciding the +“correct” number of clusters, but we will not be exploring those right +now.
+For an intuitive visualization of the general k-means method, you +might find this +StatQuest video useful, and for more discussion of the method in a +single-cell context, the Orchestrating +Single-Cell Analysis book section on k-means is a good +reference.
+We are going to use the function clusterRows()
from the
+Bioconductor bluster
package for our clustering. This
+function takes a matrix where each sample (cell in our case) is a row
+and each column is a feature. The matrix of counts (or normalized
+counts) by cells in our SingleCellExperiment
object is the
+wrong orientation, so at a minimum we would have to transpose that
+matrix before proceeding.
However, clustering algorithms like k-means can be a bit slow with as +many features as the number of genes that we have in our data set, so we +would rather not use the raw data. There is also a potential concern +that noise in the raw data might disrupt the clustering algorithm, so it +would be best to use some kind of dimensionality reduction algorithm +first. We still want to maintain a good number of dimensions, so our old +friend PCA is a good (and very standard) choice.
+Thankfully, we already computed and stored a matrix with
+reduced dimensions with the runPCA()
function. We will
+extract that from the SingleCellExperiment
object with the
+reducedDim()
function, which conveniently returns a matrix
+with the cells as rows, so we can use that directly!
The other argument we need for clusterRows()
will tell
+it which clustering algorithm to use, and any additional parameters
+associated with that algorithm. This has to be made with a special kind
+of function from the bluster
package. In this case, we will
+use KmeansParam()
to specify that we want k-means
+clustering, with the centers
parameter to set how many
+clusters we will assign (k
).
# set the number of clusters
+k <- 7
+
+# extract the principal components matrix
+hodgkins_pca <- reducedDim(hodgkins_sce, "PCA")
+
+# perform the clustering
+kclusters <- clusterRows(hodgkins_pca, KmeansParam(centers = k))
+
+
+
+The clusterRows()
function returned a vector of cluster
+assignments as integers, but the numerical values have no inherent
+meaning. For plotting we will want to convert those to a factor, so R is
+not tempted to treat them as a continuous variable.
We can also store them back into the column (cell) information table +of the original object for convenient storage and later use.
+ + + +# save clusters in the SCE object as a factor
+hodgkins_sce$kcluster <- factor(kclusters)
+
+
+
+Now we can plot the results and see how the clustering looks, using
+the scater
function plotReducedDim()
that we
+have used before, coloring the points by our clustering results. We will
+start by using the UMAP coordinates for the plot. Note that this does
+require that the cluster identities were stored in the
+SingleCellExperiment
object, as we just did.
# plot clustering results
+plotReducedDim(hodgkins_sce, "UMAP", color_by = "kcluster")
+
+
+
+
+
+
+PCA
+or TSNE
coordinates?You will have time to explore questions like these in the exercise +notebooks. One thing worth noting right away though is that cluster +numbers here and in the future are assigned arbitrarily. Even if we got +exactly the same logical clusters across runs (unlikely!), we wouldn’t +expect the label numbers to be the same or stable.
+Another common type of clustering method for single cell data is +graph-based clustering. This algorithm follows the following general +steps:
+There is a lot of hidden detail in those three steps!
+To apply this clustering algorithm, we will use the same
+bluster::clusterRows()
function as before, but we will
+change the second argument from KmeansParam()
to
+NNGraphParam()
to tell it that we want to use a
+nearest-neighbor graph-based method. We can then supply additional
+parameters to NNGraphParam()
to adjust the details of the
+algorithm. Here we will use k
to specify the number of
+neighbors to use when building the graph and cluster.fun
to
+specify the algorithm for identifying the clusters within the graph.
Despite sharing a letter, k
here and the one from
+k-means clustering are not the same thing! In this case, we are telling
+the algorithm how many neighbor connections to make for each cell, not
+the final number of clusters, which will be determined by the algorithm
+we use for the cluster building step.
The options for cluster.fun
describe the algorithm
+for the cluster building step described above. These include
+walktrap
(the default), leiden
, and
+louvain
, which is the default algorithm in Seurat
, another
+common package for single cell analysis that you may have seen.
In the example below, we will use the default values for these two +arguments.
+ + + +# run the clustering algorithm
+nnclusters <- clusterRows(
+ hodgkins_pca,
+ NNGraphParam(k = 10,
+ cluster.fun = "walktrap")
+ )
+# store cluster results in the SCE object
+hodgkins_sce$nncluster <- factor(nnclusters)
+
+
+
+Now we can plot the results of our graph-based clustering. This time
+we will also use the text_by
argument to include the
+cluster ids directly on the plot.
plotReducedDim(hodgkins_sce,
+ "UMAP",
+ color_by = "nncluster",
+ text_by = "nncluster")
+
+
+
+
+
+
+Again, you will have time to explore these more in the exercise +notebook, and of course with your own data! Sadly, there are not always +good answers to which set of inferred clusters is best! Which method and +parameters you use may depend on the kind of question you are trying to +answer.
+For more detailed information about the methods presented here, +including some ways to assess the “quality” of the clustering, I +encourage you to explore at the relevant chapter of the Orchestrating +Single-Cell Analysis book. A recent review by Kislev et al. +(2019) also goes into some depth about the differences among +algorithms and the general challenges associated with clustering single +cell data.
+Assigning clusters is nice for visualization, but we would also like +to be able to move toward a biological interpretation of the clusters +and identifying the cell types in each cluster. To that end, we can +identify marker genes that are differentially expressed among +clusters.
+It is worth noting here that the statistical calculations here are +more than a bit circular: we identified clusters first based on gene +expression, then we are using those same clusters to find differences in +gene expression. The result is that even if there were no true +clusters, we would always find marker genes! For a much more technical +exploration of this circularity (and a method to correct for it), see a +preprint by Gao et +al. (2020). In light of this, it is better to think about marker +gene identification as an aid in interpreting the clustering results +(and possibly extending insights to new data sets), rather than results +that should be interpreted on their own, and we should be extremely wary +of justifying cluster assignments solely based on these results! With +that caveat, let’s proceed.
+To identify marker genes, we will use the
+scran::findMarkers()
function, which will rank genes by
+their differential expression by calculating pairwise statistics among
+clusters. We have a few options for how to determine the gene rankings
+and marker gene list for each cluster. At one end could include genes
+that are differentially expressed in any pairwise comparison
+against our focal cluster, or at the other we could only include genes
+that are differentially expressed in all comparisons with that
+cluster. We could also do something in between, including genes that
+differentiate the focal cluster from some fraction of the other
+clusters. For now, we will use the findMarkers()
function
+to rank the genes in each cluster by their combined scores against
+all other clusters, using the pval.type
+argument.
findMarkers()
will return a list (technically a
+list-like object) of tables, one for each cell type, with statistics for
+each gene showing how well it differentiates that cell type against
+other types.
# use `findMarkers()` to calculate how well each gene
+# differentiates each cluster from *all* other clusters
+markers <- scran::findMarkers(hodgkins_sce,
+ groups = hodgkins_sce$nncluster,
+ pval.type = "all")
+
+
+
+Next we can look at one of those tables. We will start with the first
+cluster, which we will select from the list using the R standard double
+bracket [[1]]
notation. We also doing a bit of
+transformation here to pull the gene name into a column of its own.
markers[[1]] |>
+ as.data.frame() |> # convert to a data frame
+ tibble::rownames_to_column("gene") # make gene a column
+
+You can see that this table includes values for all genes, so we +would like to make a shorter list.
+Because we tend to like tidy data, here we use
+a tidyverse
function from the purrr
package to
+apply the same operations as above to every element of the
+markers
list. We will introduce purrr
briefly
+here, but if you want more information and background, we recommend the
+purrr
+cheatsheet (PDF) and Jenny Bryan’s great purrr
+tutorial.
The main functions in purrr
are the map()
+functions, which take as their main arguments a list
+and a function to apply to each element of the list.
+The main function is purrr::map()
; if you are familiar with
+the base R lapply()
function, it is very similar, but with
+some different defaults. We will use it to get the top rows from each
+table by applying the head()
function to each element of
+the list. The results are returned as a new list.
purrr::map(
+ as.list(markers[1:3]), # select the first 3 clusters and convert to a 'regular' list for purrr
+ head # the function to apply (note no parenthesis)
+ )
+
+
+
+This returns a list of data frames, which isn’t quite what we +want.
+There is no built-in function that will give us just the first few
+row names, so we will have to define one. As of version 4.1, R
+introduced a new approach to defining anonymous functions -
+that is, functions you can quickly define “on-the-fly” without formally
+assigning them to a function name. They are handy when you need to do a
+very short task that requires a function, but it isn’t really a function
+you need beyond this context. This new anonymous syntax looks like this:
+\(x)...
(or for slightly longer code, use curly braces as
+in \(x) {...}
). This defines a function that takes one
+argument, x
, with ...
indicating where you
+would put the expression to calculate.
purrr::map()
will then apply the expression in our
+anonymous function to each element of the list, and return the results
+as a new list.
# Get the first few row names of each table with a purrr function.
+purrr::map(
+ # convert markers to a 'regular' list for purrr
+ as.list(markers),
+ # our custom function:
+ \(x) head( rownames(x) )
+)
+
+
+$`1`
+[1] "ENSG00000153064" "ENSG00000247982" "ENSG00000042980" "ENSG00000224137"
+[5] "ENSG00000211898" "ENSG00000163534"
+
+$`2`
+[1] "ENSG00000213809" "ENSG00000172543" "ENSG00000153563" "ENSG00000104660"
+[5] "ENSG00000164120" "ENSG00000182871"
+
+$`3`
+[1] "ENSG00000124882" "ENSG00000137462" "ENSG00000136689" "ENSG00000103569"
+[5] "ENSG00000123689" "ENSG00000043462"
+
+$`4`
+[1] "ENSG00000120875" "ENSG00000115935" "ENSG00000100629" "ENSG00000136754"
+[5] "ENSG00000082074" "ENSG00000163599"
+
+$`5`
+[1] "ENSG00000229117" "ENSG00000231500" "ENSG00000147403" "ENSG00000109475"
+[5] "ENSG00000105372" "ENSG00000156508"
+
+$`6`
+[1] "ENSG00000266088" "ENSG00000235576" "ENSG00000204866" "ENSG00000156234"
+[5] "ENSG00000174946" "ENSG00000111796"
+
+$`7`
+[1] "ENSG00000111678" "ENSG00000275385" "ENSG00000173369" "ENSG00000164754"
+[5] "ENSG00000159189" "ENSG00000197249"
+
+$`8`
+[1] "ENSG00000164236" "ENSG00000054219" "ENSG00000128487" "ENSG00000086758"
+[5] "ENSG00000181163" "ENSG00000133112"
+
+$`9`
+[1] "ENSG00000156508" "ENSG00000205542" "ENSG00000171863" "ENSG00000181163"
+[5] "ENSG00000186468" "ENSG00000145592"
+
+$`10`
+[1] "ENSG00000008517" "ENSG00000128340" "ENSG00000026025" "ENSG00000092820"
+[5] "ENSG00000102879" "ENSG00000054267"
+
+$`11`
+[1] "ENSG00000141753" "ENSG00000085063" "ENSG00000148175" "ENSG00000115306"
+[5] "ENSG00000182871" "ENSG00000085733"
+
+$`12`
+[1] "ENSG00000132465" "ENSG00000185507" "ENSG00000156675" "ENSG00000051108"
+[5] "ENSG00000135916" "ENSG00000101057"
+
+
+ENSG00000153064
+ENSG00000247982
+ENSG00000042980
+ENSG00000224137
+ENSG00000211898
+ENSG00000163534
+
+
+ENSG00000213809
+ENSG00000172543
+ENSG00000153563
+ENSG00000104660
+ENSG00000164120
+ENSG00000182871
+
+
+ENSG00000124882
+ENSG00000137462
+ENSG00000136689
+ENSG00000103569
+ENSG00000123689
+ENSG00000043462
+
+
+ENSG00000120875
+ENSG00000115935
+ENSG00000100629
+ENSG00000136754
+ENSG00000082074
+ENSG00000163599
+
+
+ENSG00000229117
+ENSG00000231500
+ENSG00000147403
+ENSG00000109475
+ENSG00000105372
+ENSG00000156508
+
+
+ENSG00000266088
+ENSG00000235576
+ENSG00000204866
+ENSG00000156234
+ENSG00000174946
+ENSG00000111796
+
+
+ENSG00000111678
+ENSG00000275385
+ENSG00000173369
+ENSG00000164754
+ENSG00000159189
+ENSG00000197249
+
+
+ENSG00000164236
+ENSG00000054219
+ENSG00000128487
+ENSG00000086758
+ENSG00000181163
+ENSG00000133112
+
+
+ENSG00000156508
+ENSG00000205542
+ENSG00000171863
+ENSG00000181163
+ENSG00000186468
+ENSG00000145592
+
+
+ENSG00000008517
+ENSG00000128340
+ENSG00000026025
+ENSG00000092820
+ENSG00000102879
+ENSG00000054267
+
+
+ENSG00000141753
+ENSG00000085063
+ENSG00000148175
+ENSG00000115306
+ENSG00000182871
+ENSG00000085733
+
+
+ENSG00000132465
+ENSG00000185507
+ENSG00000156675
+ENSG00000051108
+ENSG00000135916
+ENSG00000101057
+
+
+
+Another variant is purrr::imap()
, which allows us to use
+the names of the list elements in our function. (Try
+names(markers)
to see the names for the list we are working
+with now.) We will use that here to name output files where we will
+print each of the marker tables, one for each cell type. We are again
+defining a custom function within the call to purrr:imap()
+using the \(x)
syntax, but this time we need two variables:
+we will use table
for the list elements (each a table of
+results) and id
for their names. So, we’ll actually start
+by defining the function as \(table, id)
, since there will
+be two input arguments. Because we don’t know the identities of the
+clusters we identified, these are just the cluster numbers for now.
Making file names from numbers can be a a bit fraught, as we really
+want them to sort in numerical order, but many systems will sort by
+alphabetical order. Unfortunately, that would tend to sort 10-19 before
+2, 20-29 before 3, etc. To solve this, we are using the
+sprintf()
function, which allows us to specify the format
+of a printed string. In this case, we are using the formatting syntax of
+%02d
to tell it that we will want to insert
+(%
) a number (d
), with two digits and leading
+zeros. To see what this does a bit more concretely, let’s look at a
+simple example:
sprintf("%02d", 1:10)
+
+
+ [1] "01" "02" "03" "04" "05" "06" "07" "08" "09" "10"
+
+
+01
+02
+03
+04
+05
+06
+07
+08
+09
+10
+
+
+
+In addition to writing the tables out, we are saving the data frames +we created as a new list that we can use in the next step.
+ + + +marker_df_list <- purrr::imap(
+ as.list(markers), # convert markers to a 'regular' list for purrr
+ # purrr function: x is the list element, y is the element name (number here)
+ \(table, id) {
+ as.data.frame(table) |> # first convert to a data frame
+ tibble::rownames_to_column("gene") |> # make genes a column
+ dplyr::arrange(FDR) |> # sort to be sure small FDR genes are first
+ readr::write_tsv( # write each data frame to a file
+ file.path(
+ marker_dir, # construct the output path
+ sprintf("cluster%02d_markers.tsv", as.integer(id)) # format cluster numbers in file names with leading zeros
+ )
+ )
+ }
+)
+
+
+
+One thing we can do with this list of marker genes is to see how they
+look across the cells and clusters. The
+scater::plotReducedDim()
function makes this easy! We have
+earlier colored points by some cell statistic, like the number of
+expressed genes, but it is just as easy to color by the expression of a
+single gene by using the gene identifier as the color_by
+argument.
The first step is to get the gene information for the genes we might +be interested in.
+ + + +# get gene ids for top 10 cluster 1 markers
+gene_ids <- marker_df_list[[1]] |>
+ head(n = 10) |>
+ dplyr::pull(gene)
+
+# look at the gene info for these
+gene_info <- rowData(hodgkins_sce)[gene_ids, ]
+data.frame(gene_info)
+
+Now we can pick one of the genes for plotting and go!
+ + + +# get gene id and gene symbol for nicer plotting
+rank <- 1
+gene_id <- gene_info$ID[rank]
+symbol <- gene_info$Symbol[rank]
+
+# Plot UMAP results colored by expression
+plotReducedDim(hodgkins_sce, "UMAP",
+ color_by = gene_id) +
+ # label the guide with the gene symbol
+ guides(color = guide_colorbar(title = symbol))
+
+
+
+
+
+
+Hopefully that expression pattern aligns at least in part with your +expectations!
+So far we have identified clusters of cells (if you believe them), +and found some genes that are associated with each cluster. What you +might want to know at this point is what cell types comprise +each cluster. Setting aside the thorny question of “what is a cell +type?”, this is still a challenging problem, and we’ll explore some +approaches to perform cell type annotation in the next notebook!
+sessionInfo()
+
+
+R version 4.4.0 (2024-04-24)
+Platform: x86_64-pc-linux-gnu
+Running under: Ubuntu 22.04.4 LTS
+
+Matrix products: default
+BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
+LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
+
+locale:
+ [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
+ [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
+ [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
+ [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
+ [9] LC_ADDRESS=C LC_TELEPHONE=C
+[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
+
+time zone: Etc/UTC
+tzcode source: system (glibc)
+
+attached base packages:
+[1] stats4 stats graphics grDevices utils datasets methods
+[8] base
+
+other attached packages:
+ [1] bluster_1.14.0 scran_1.32.0
+ [3] scater_1.32.0 scuttle_1.14.0
+ [5] SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0
+ [7] Biobase_2.64.0 GenomicRanges_1.56.0
+ [9] GenomeInfoDb_1.40.0 IRanges_2.38.0
+[11] S4Vectors_0.42.0 BiocGenerics_0.50.0
+[13] MatrixGenerics_1.16.0 matrixStats_1.3.0
+[15] ggplot2_3.5.1 optparse_1.7.5
+
+loaded via a namespace (and not attached):
+ [1] gridExtra_2.3 rlang_1.1.3
+ [3] magrittr_2.0.3 compiler_4.4.0
+ [5] DelayedMatrixStats_1.26.0 vctrs_0.6.5
+ [7] stringr_1.5.1 pkgconfig_2.0.3
+ [9] crayon_1.5.2 fastmap_1.1.1
+[11] XVector_0.44.0 labeling_0.4.3
+[13] utf8_1.2.4 rmarkdown_2.26
+[15] tzdb_0.4.0 UCSC.utils_1.0.0
+[17] ggbeeswarm_0.7.2 bit_4.0.5
+[19] purrr_1.0.2 xfun_0.43
+[21] zlibbioc_1.50.0 cachem_1.0.8
+[23] beachmat_2.20.0 jsonlite_1.8.8
+[25] highr_0.10 DelayedArray_0.30.0
+[27] BiocParallel_1.38.0 irlba_2.3.5.1
+[29] parallel_4.4.0 cluster_2.1.6
+[31] R6_2.5.1 bslib_0.7.0
+[33] stringi_1.8.3 limma_3.60.0
+[35] jquerylib_0.1.4 Rcpp_1.0.12
+[37] knitr_1.46 readr_2.1.5
+[39] Matrix_1.7-0 igraph_2.0.3
+[41] tidyselect_1.2.1 abind_1.4-5
+[43] yaml_2.3.8 viridis_0.6.5
+[45] codetools_0.2-20 lattice_0.22-6
+[47] tibble_3.2.1 withr_3.0.0
+[49] evaluate_0.23 getopt_1.20.4
+[51] pillar_1.9.0 generics_0.1.3
+[53] vroom_1.6.5 hms_1.1.3
+[55] sparseMatrixStats_1.16.0 munsell_0.5.1
+[57] scales_1.3.0 glue_1.7.0
+[59] metapod_1.12.0 tools_4.4.0
+[61] BiocNeighbors_1.22.0 ScaledMatrix_1.12.0
+[63] locfit_1.5-9.9 fs_1.6.4
+[65] cowplot_1.1.3 grid_4.4.0
+[67] edgeR_4.2.0 colorspace_2.1-0
+[69] GenomeInfoDbData_1.2.12 beeswarm_0.4.0
+[71] BiocSingular_1.20.0 vipor_0.4.7
+[73] cli_3.6.2 rsvd_1.0.5
+[75] fansi_1.0.6 S4Arrays_1.4.0
+[77] viridisLite_0.4.2 dplyr_1.1.4
+[79] gtable_0.3.5 sass_0.4.9
+[81] digest_0.6.35 SparseArray_1.4.0
+[83] ggrepel_0.9.5 dqrng_0.3.2
+[85] farver_2.1.1 htmltools_0.5.8.1
+[87] lifecycle_1.0.4 httr_1.4.7
+[89] statmod_1.5.0 bit64_4.0.5
+
+
+This notebook will demonstrate how to:
+SingleR
SingleR
classification to groups of cellsIn this notebook, we will attempt to annotate cell types to each of +the cells in a dataset, using some of the automated tools that are +available within the Bioconductor universe.
+Much of the material in this notebook is directly inspired by, and +draws heavily on, material presented in the book Orchestrating Single +Cell Analysis with Bioconductor.
+The data we will use for this notebook is derived from a 10x
+Genomics dataset of human peripheral blood mononuclear cells
+(PBMCs). These data include both single cell RNA-seq counts and
+quantification of antibody-derived tags (ADTs) performed by sequencing
+short DNA barcodes attached to specific antibodies. This type of ADT
+sequencing with single cells is commonly known as CITE-seq, after the
+protocol developed by Stoeckius et al.
+(2017).
+The antibodies used here are the The
+TotalSeq™-B Human TBNK Cocktail, a set of antibodies designed to
+react with immune cell surface markers.
The data here have already been filtered, normalized, and had +dimension reductions calculated for the single-cell RNA-seq data. The +ADT data has also been separately filtered and normalized. For details +about how to perform these tasks with data that has been processed with +Cell Ranger, you may want to look at the “Integrating +with protein abundance” chapter of OSCA.
+The processed gene expression and ADT data were saved into a combined
+SingleCellExperiment
(SCE) object, and we will start with
+that object for our exploration here.
To start, we will load some of the libraries we will need later, and +set a random number seed for reproducibility.
+ + + +# Load libraries
+library(ggplot2) # plotting functions
+library(SingleCellExperiment) # Bioconductor single-cell data class
+
+
+Loading required package: SummarizedExperiment
+
+
+Loading required package: MatrixGenerics
+
+
+Loading required package: matrixStats
+
+
+
+Attaching package: 'MatrixGenerics'
+
+
+The following objects are masked from 'package:matrixStats':
+
+ colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
+ colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
+ colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
+ colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
+ colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
+ colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
+ colWeightedMeans, colWeightedMedians, colWeightedSds,
+ colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
+ rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
+ rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
+ rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
+ rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
+ rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
+ rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
+ rowWeightedSds, rowWeightedVars
+
+
+Loading required package: GenomicRanges
+
+
+Loading required package: stats4
+
+
+Loading required package: BiocGenerics
+
+
+
+Attaching package: 'BiocGenerics'
+
+
+The following objects are masked from 'package:stats':
+
+ IQR, mad, sd, var, xtabs
+
+
+The following objects are masked from 'package:base':
+
+ anyDuplicated, aperm, append, as.data.frame, basename, cbind,
+ colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
+ get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
+ match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
+ Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
+ tapply, union, unique, unsplit, which.max, which.min
+
+
+Loading required package: S4Vectors
+
+
+
+Attaching package: 'S4Vectors'
+
+
+The following object is masked from 'package:utils':
+
+ findMatches
+
+
+The following objects are masked from 'package:base':
+
+ expand.grid, I, unname
+
+
+Loading required package: IRanges
+
+
+Loading required package: GenomeInfoDb
+
+
+Loading required package: Biobase
+
+
+Welcome to Bioconductor
+
+ Vignettes contain introductory material; view with
+ 'browseVignettes()'. To cite Bioconductor, see
+ 'citation("Biobase")', and for packages 'citation("pkgname")'.
+
+
+
+Attaching package: 'Biobase'
+
+
+The following object is masked from 'package:MatrixGenerics':
+
+ rowMedians
+
+
+The following objects are masked from 'package:matrixStats':
+
+ anyMissing, rowMedians
+
+
+Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
+'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'
+
+
+# Setting the seed for reproducibility
+set.seed(12345)
+
+
+
+As mentioned, our input file here is a single normalized and
+processed SCE object, stored as an rds
file. That should be
+all we need to read in!
Our output will be a table of per-cell information, which will
+include the cell type assignments we have made throughout this notebook.
+We aren’t planning any significant modifications of the underlying data,
+so we won’t bother re-saving the whole SCE object as a new
+.rds
file this time.
# directory for the input data
+data_dir <- file.path("data",
+ "PBMC-TotalSeqB",
+ "normalized")
+
+# the input file itself
+sce_file <- file.path(data_dir,
+ "PBMC_TotalSeqB_normalized_sce.rds")
+
+# A directory to store outputs
+analysis_dir <- file.path("analysis",
+ "PBMC-TotalSeqB")
+
+# Create directory if it doesn't exist
+fs::dir_create(analysis_dir)
+
+# output table path
+cellinfo_file <- file.path(analysis_dir,
+ "PBMC_TotalSeqB_cellinfo.tsv")
+
+
+
+SingleCellExperiment
Now that the preliminary setup is out of the way, we can get started.
+First we will read in the SingleCellExperiment
from the
+input file we defined earlier.
# read in the SCE file
+sce <- readr::read_rds(sce_file)
+# print a summary of the SCE
+sce
+
+
+class: SingleCellExperiment
+dim: 36601 7924
+metadata(1): Samples
+assays(2): counts logcounts
+rownames(36601): ENSG00000243485 ENSG00000237613 ... ENSG00000278817
+ ENSG00000277196
+rowData names(3): ID Symbol Type
+colnames: NULL
+colData names(13): Sample Barcode ... prob_compromised sizeFactor
+reducedDimNames(2): PCA UMAP
+mainExpName: Gene Expression
+altExpNames(1): ADT
+
+
+
+This should look similar to the SCE objects that we have seen before,
+containing counts
and logcounts
assays where
+each cell is a column and each row is a gene. We also have some of the
+rowData
, colData
and reduced dimension
+matrices that we have seen before.
But where are the data from the ADTs? We wouldn’t necessarily want +those stored in the main data matrices, as the characteristics of ADT +barcode data is going to be quite different from gene expression +data.
+To keep the ADT data separate from the RNA gene expression data, we
+have split this data off into an alternative experiment
+(altExp
) slot. You can see the name of this
+altExp
on the line altExpNames
above. We
+could have more than one type of alternative experiment (such
+as spike-in or ATAC-seq), but in this case, just the one.
To access the contents of the altExp
slot, we can use
+the altExp()
function. Let’s look at what we have in that
+slot:
# print a summary of the 'ADT' altExp
+altExp(sce, "ADT")
+
+
+class: SingleCellExperiment
+dim: 10 7924
+metadata(1): Samples
+assays(2): counts logcounts
+rownames(10): CD3 CD4 ... CD45 IgG1
+rowData names(3): ID Symbol Type
+colnames: NULL
+colData names(1): sizeFactor
+reducedDimNames(0):
+mainExpName: NULL
+altExpNames(0):
+
+
+
+It is another SingleCellExperiment
! Inception! Let’s
+look at that embedded SCE more closely.
The first thing to note is that this altExp
has the same
+number of columns as did the main SCE object. Those corresponded to the
+individual cells before, and still do!
There are only 10 rows, however, and these correspond to the ADTs
+that were assayed by this particular experiment. Just as we did with the
+full SCE, we can use rowData()
to view the table containing
+metadata associated with each of these rows. We’ll add the
+altExp()
function to point it to the embedded object we are
+interested in. Since there is only one altExp
, we don’t
+need the second (name) argument ("ADT"
) that we used above;
+the default behavior of altExp()
is to just give us the
+first altExp
, and that is the one (and only) that we
+need.
# What proteins were assayed?
+rowData(altExp(sce))
+
+
+DataFrame with 10 rows and 3 columns
+ ID Symbol Type
+ <character> <character> <character>
+CD3 CD3 CD3 Antibody Capture
+CD4 CD4 CD4 Antibody Capture
+CD8 CD8 CD8 Antibody Capture
+CD11c CD11c CD11c Antibody Capture
+CD14 CD14 CD14 Antibody Capture
+CD16 CD16 CD16 Antibody Capture
+CD19 CD19 CD19 Antibody Capture
+CD56 CD56 CD56 Antibody Capture
+CD45 CD45 CD45 Antibody Capture
+IgG1 IgG1 IgG1_control_TotalSeqC Antibody Capture
+
+
+
+You can see here the names and symbols of the tags used, along with
+the designation that all have an “Antibody Capture” type (as opposed to
+“Gene Expression” for the RNA data). One you might note looks different
+is the IgG1
control, which is actually a mouse antibody
+used as a negative control.
While dimension reduction was performed on this data, we have not yet +performed any clustering.
+Let’s assign some clusters to our cells, using graph-based clustering +and default parameters, taking as input the PCA matrix that was +previously calculated. Note that this PCA matrix and the UMAP built from +it were derived from the gene expression data, so the clustering is +going to reflect the gene expression data only. While we have the ADT +data, it is not being used for this stage of the analysis.
+ + + +# perform clustering
+nn_clusters <- bluster::clusterRows(
+ # PCA input
+ reducedDim(sce, "PCA"),
+ # graph clustering & parameters
+ bluster::NNGraphParam()
+)
+
+# add clusters to colData
+sce$nn_cluster <- nn_clusters
+
+
+
+Now we can plot the clusters we have identified with
+scater::plotUMAP()
. This is a shortcut for
+scater::plotReducedDim(dimred = "UMAP", ...)
, which can
+save us a lot of typing as we do this repeatedly!
# plot clusters
+scater::plotUMAP(sce, color_by = "nn_cluster") +
+ # rename the legend
+ guides(color = guide_legend(title = "Cluster"))
+
+
+
+
+
+
+But what are these clusters, really? Do they correspond to particular +cell types that we are interested in?
+Does it bother you that we just used the default nearest-neighbor +graph clustering parameters? Do you know what those were?
+The first way we will identify cell types of individual cells is to +use the ADT normalized counts. These antibody markers were (hopefully) +chosen for their relevance to the sequenced cell population.
+The first marker we will look at is CD3
, which is a
+protein complex that is found on the surface of T cells. We can again
+use the plotUMAP()
function to color cells by
+CD3
ADT levels.
Note that this function can plot data from the colData
+table (as we used it above when plotting clusters), in the main gene
+expression matrix (as we used it in the previous notebook), AND
+in altExp
tables and matrices! So to color by the ADT
+levels (as normalized in the logcounts
matrix) we only need
+to provide the tag name that we want to plot in the
+color_by
argument.
# plot CD3 expression
+scater::plotUMAP(sce, color_by = "CD3")
+
+
+
+
+
+
+It appears that we have a number of potential T cells down in the +lower left!
+Let’s look at a couple of other markers to try to break those up more +specifically.
+Two other markers of relevance to the T cells are CD4
+and CD8
. The CD4
complex is present in helper
+T cells (hence their other common name, CD4+ T cells). By contrast, the
+CD8
complex is found on killer T cells (CD8+ cells).
Let’s plot the ADT results for those two markers as well below:
+ + + +# plot CD4 marker
+scater::plotUMAP(sce,
+ color_by = "CD4")
+
+
+
+
+
+
+
+
+
+# plot CD8 marker
+scater::plotUMAP(sce,
+ color_by = "CD8")
+
+
+
+
+
+
+Plotting the levels of the ADTs provides a nice visual +representation, but what we really want to do is to turn these values +into specific cell-type assignments for each cell. Such classification +could be considered as analogous to a cell-sorter assay, where we would +set up some rules to look at a few markers for each cell and use those +to assign a cell type. The simplest type of rule might be one where we +use a threshold to call a marker as present or absent, and then use the +presence of a marker to indicate a specific cell type.
+To do this, we will need to make some decisions, such as the +thresholds we should use to determine whether a cell is or is not +expressing a particular marker. In general, markers that are useful for +this cell-typing approach will have a bimodal distribution of expression +levels which can be used to separate the population into two groups of +cells. One group of cells will have only a background level signal for +each marker (due to non-specific binding or other factors), while the +other group, those that express the protein, will have a much higher +level of binding and higher counts.
+To assess whether the ADTs we have chosen have a useful distribution +of expression values, and to identify thresholds we might use, we would +like to plot each ADT tag. To do this, we will pull out the expression +values for these markers from the SCE object and do some data +wrangling.
+We are interested in the normalized counts for the ADT tags, which
+are stored in the logcounts
assay of the
+altExp
. If you recall, this matrix is stored with the
+columns as cells and rows as markers, but we really want it with each
+row a cell and each column a marker. So we will first transpose the
+data, then convert it to a data frame for our next steps. Because the
+SCE object stores the assay data matrices in a specialized format, we
+have to do one extra step convert it first to a “regular” R matrix or R
+won’t know how to convert it to a data frame.
# convert logcounts data to a data frame
+adt_df <- logcounts(altExp(sce)) |>
+ t() |> # transpose
+ as.matrix() |> # convert to matrix
+ as.data.frame() # convert to data frame
+
+# view the data frame
+head(adt_df)
+
+If we just wanted to plot one of these tags, we could do so right
+away, but with a bit more data wrangling, we can convert these results
+into a “tidier” format, that will allow us to take full advantage of
+tidyverse
tools! In particular, it will let us plot them
+all at once with ggplot2
faceting.
Right now the data is in a “wide” format, such that each column is a +different tag. But the data in all of the columns is the same type, and +measures something similar: the normalized count of an ADT. One could +even argue that each row contains 10 different observations, where the +“tidy” data ideal, as espoused by Wickham (2014), +requires a single observation per row, a “long” format. This long format +will have one column that tells us which ADT was measured and a second +column with the measurement value itself.
+We can perform this conversion using the tidyr::pivot_longer()
+function, which allows us to convert our data frame with one column per
+tag into a data frame with separate columns for the tag id
+(ADT
) and the expression value (logcount
).
+Following conversion, we will filter to just the ADTs that we care
+about.
adt_df_long <- adt_df |>
+ # pivot to long format
+ tidyr::pivot_longer(
+ everything(), # use all columns
+ names_to = "ADT", # convert row names to a column called "ADT"
+ values_to = "logcount" # name the value column "logcount"
+ ) |>
+ # filter to tags we are interested in
+ dplyr::filter(ADT %in% c("CD3", "CD4", "CD8"))
+
+# look at the resulting df
+head(adt_df_long)
+
+Now we can make a density plot with ggplot2
for all
+three ADTs we are interested in at once.
# plot logcounts by ADT
+ggplot(adt_df_long, aes(x = logcount, fill = ADT)) +
+ geom_density() + # density plot
+ facet_grid(rows = vars(ADT)) + # facet by ADT
+ theme_bw() + # nicer theme
+ theme(legend.position = "none") # no legend needed
+
+
+
+
+
+
+These look pretty good! Each of these markers has a bimodal +distribution: A lower peak consisting of cells that do not express the +protein but which still have a background level of antibody binding, and +an upper peak of cells that do express the protein of interest. The +background level does vary by antibody marker, so we will need a +different threshold value for each one.
+We can now use the values from these plots to construct a set of +rules to classify the T cells. We will do this using the “wide” data +frame from earlier.
+The thresholds we are using here were identified just “by eye”, so
+this is not a particularly principled method of cell type assignment,
+but it can be fairly effective. Here we are assigning only three cell
+types; cells that do not fit any of these criteria will be set as
+NA
.
# add cell type column by thresholding
+adt_df <- adt_df |>
+ dplyr::mutate(
+ celltype = dplyr::case_when(
+ CD3 > 6.7 & CD4 > 8 ~ "CD4+ T-cell",
+ CD3 > 6.7 & CD8 > 6 ~ "CD8+ T-cell",
+ CD3 > 6.7 ~ "T-cell"
+ )
+ )
+
+adt_df
+
+Now we will want to add the cell types we have assigned back to our
+original SCE object. We can do that by defining a new column name,
+threshold_celltype
that will be added to the
+colData
object. Creating and assigning values to this
+column can be done with the $
shortcut, and then we can
+plot our results with the plotUMAP()
function as
+before.
sce$threshold_celltype <- adt_df$celltype
+scater::plotUMAP(sce,
+ color_by = "threshold_celltype") +
+ guides(color = guide_legend(title = "Cell type"))
+
+
+
+
+
+
+How did we do?
+Note that while we applied this technique to assign cell types using +the ADT data, we could use the same type of procedure using gene +expression data alone, or a combination of gene expression data and tag +data.
+However, what we did here was very ad-hoc and quite manual! We didn’t +calculate any statistics, and we had to look at every tag we were +interested in to pick thresholds. A different dataset might have +different background levels, which would require different +thresholds.
+While this technique might be good for some simple experiments, and +can be useful for manual curation, it might not translate well to more +complex datasets with multiple samples. We also looked at each marker +separately, which might not be the most efficient or robust method of +analysis.
+For a more principled approach that allows identification of cell
+types by looking at the expression of sets of genes that are known to
+characterize each cell type, you might look at the AUCell
+package. For more on that method, the OSCA section Assigning
+cell labels from gene sets is a very good reference.
SingleR
An alternative approach to using known marker genes for +classification is to instead classify cells by comparing them to a +reference expression dataset. To do this, we will find a well-curated +gene expression dataset that contains samples with known cell types. We +can then train a model based on this dataset and look at each of the +cells in our new dataset to determine which (if any) of the known cell +types has the most similar expression pattern. The details of how such a +model may be constructed and trained will vary by the specific method, +but this overall approach is widely applied.
+For this section, we will focus on the SingleR
package
+and its methods, which are described in detail in The SingleR
+Book.
Selecting a reference dataset is one of the more critical steps for +this enterprise. At the most basic level, if the reference dataset does +not include the types of cells that we expect to see in our sample, it +won’t be useful. So we will want a reference dataset that has as many as +possible of the cell types that we expect to find in our dataset, at a +level of granularity that aligns with our goals.
+For SingleR
that reference data can be from bulk RNA
+sequencing or from other single-cell experiments. SingleR
+is also fairly robust to the method used for gene expression
+quantification, which means that we can use either RNA-seq datasets or
+microarrays, if those are more readily available.
One convenient source of cell reference data is the
+celldex
package, which is what we will use here. This
+package includes functions to download a variety of well-annotated
+reference datasets in a common format.
+For more information on the datasets available, you will want to refer
+to the
+celldex
summary vignette.
We will start by using a reference dataset of sorted immune cells +from GSE107011 +(Monaco et al. 2019). This particular reference was chosen +because it is well-suited to PBMC datasets, with a good level of +granularity.
+The celldex
functions also have a convenient option to
+convert gene symbols to Ensembl ids, which we will use here so that our
+reference data uses the same gene identifiers as the single-cell
+data.
# Bioconductor "Hub" packages provide the option to cache
+# downloads, but the interactive prompt can be annoying
+# when working with notebooks.
+# These options disable the prompt by giving permission
+# to create the cache automatically
+ExperimentHub::setExperimentHubOption("ASK", FALSE)
+AnnotationHub::setAnnotationHubOption("ASK", FALSE)
+
+# Get Monaco 2019 data from celldex with Ensembl ids.
+monaco_ref <- celldex::MonacoImmuneData(ensembl = TRUE)
+
+
+Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
+'DelayedArray::makeNindexFromArrayViewport' when loading 'HDF5Array'
+
+
+downloading 1 resources
+
+
+retrieving 1 resource
+
+
+loading from cache
+
+
+require("ensembldb")
+
+
+
+What is this monaco_ref
object?
monaco_ref
+
+
+class: SummarizedExperiment
+dim: 46077 114
+metadata(0):
+assays(1): logcounts
+rownames(46077): ENSG00000121410 ENSG00000268895 ... ENSG00000159840
+ ENSG00000074755
+rowData names(0):
+colnames(114): DZQV_CD8_naive DZQV_CD8_CM ... G4YW_Neutrophils
+ G4YW_Basophils
+colData names(3): label.main label.fine label.ont
+
+
+
+A SummarizedExperiment
is very similar to a
+SingleCellExperiment
, except rather than having one column
+per cell, each column is a sample. Otherwise, the components
+are very similar: each row is still a gene, for example, and additional
+data about the samples are stored in the colData
. In fact,
+the SingleCellExperiment
object is derived from a
+SummarizedExperiment
, with some extra slots that are more
+relevant to single-cell data.
What information do we have for the samples?
+ + + +colData(monaco_ref)
+
+
+DataFrame with 114 rows and 3 columns
+ label.main label.fine label.ont
+ <character> <character> <character>
+DZQV_CD8_naive CD8+ T cells Naive CD8 T cells CL:0000900
+DZQV_CD8_CM CD8+ T cells Central memory CD8 T.. CL:0000907
+DZQV_CD8_EM CD8+ T cells Effector memory CD8 .. CL:0000913
+DZQV_CD8_TE CD8+ T cells Terminal effector CD.. CL:0001062
+DZQV_MAIT T cells MAIT cells CL:0000940
+... ... ... ...
+G4YW_NK NK cells Natural killer cells CL:0000623
+G4YW_pDC Dendritic cells Plasmacytoid dendrit.. CL:0000784
+G4YW_mDC Dendritic cells Myeloid dendritic ce.. CL:0000782
+G4YW_Neutrophils Neutrophils Low-density neutroph.. CL:0000096
+G4YW_Basophils Basophils Low-density basophils CL:0000043
+
+
+
+There are three main columns for the sample data:
+label.main
is a more general cell type
+assignment.
label.fine
is a fine-level cell type with more
+specific labels. The exact level of granularity of these
+main
and fine
designations (and indeed the
+label names themselves) will vary among datasets, so it is important to
+look at the reference to see whether it is suitable for your
+application.
label.ont
is a standardized Cell Ontology
+identifier. Using the cell ontology can allow for more complex
+representations of the relationships among different cell types, but
+investigating that is beyond the scope of this workshop.
Another component we would like to explore is how many of each of
+these cell types we have in the reference dataset. A bit of quick
+dplyr
wrangling can give us the answer.
colData(monaco_ref) |>
+ as.data.frame() |>
+ dplyr::count(label.main, label.fine)
+
+This is pretty good! Most cell types have 4 replicates, which is more +replicates than we often find.
+SingleR
do?As mentioned earlier, SingleR
builds a model from a set
+of training data, and then uses that model to classify cells (or groups
+of cells) in new datasets.
SingleR
works by first identifying a set of marker genes
+that can be used to differentiate among the cell types in the reference
+dataset. It does this by performing pairwise comparisons among all of
+the cell types, and retaining the top set of genes differentiating each
+pair. The idea is that this set of genes will be the most informative
+for differentiating cell types.
Then, for each cell, SingleR
calculates the Spearman
+correlation between expression of that cell and each cell type (using
+the only the genes chosen earlier). Notably, this is a non-parametric
+correlation, so the scaling and normalization that we apply (or don’t)
+should not matter! Note that if you used a single-cell technology that
+produces full-length transcripts (i.e., SMART-seq), you will probably
+want to convert your counts to Transcripts per Million (TPM), to allow
+more consistent ranking among transcripts of different lengths.
The reference cell type with the highest correlation is then chosen +as the cell type assignment for that cell. If there are multiple cell +types with high scores, an optional fine-tuning step repeats the process +using only the most relevant genes for those cell types.
+SingleR
For our first run, we will do the marker gene selection (training)
+and classification in a single step, using the convenience function
+SingleR::SingleR()
. For this we need only supply three main
+arguments: Our SCE object, a reference matrix (here in
+SummarizedExperiment
format), and the labels for each of
+the samples in the reference that we want to use. We also need to be
+sure that our sample and the reference data use the same gene IDs, which
+is why we requested the Ensembl IDs when getting the reference
+dataset.
Because this function is doing many repetitive calculations (lots of
+correlations!), we can speed it up by including the BPPARAM
+argument. This is a common argument in Bioconductor
+packages where BP
stands for the BiocParallel
+package, which provides multiprocessing capabilities to many
+Bioconductor functions. In this case, we will use the argument
+BiocParallel::MulticoreParam(4)
to specify we want to use
+local multicore processing with 4 “workers”.
# calculate SingleR results in one step
+singler_result <- SingleR::SingleR(
+ sce, # our query SCE
+ ref = monaco_ref, # reference dataset
+ labels = monaco_ref$label.main, # reference labels to use
+ BPPARAM = BiocParallel::MulticoreParam(4) # multiprocessing
+)
+
+
+
+SingleR
provides a few nice visualizations for
+evaluating the statistics it calculated and the assignments it makes.
+One is a heatmap of the scores for each cell, arranged by the cell type
+that was assigned to each. This is created with the
+SingleR::plotScoreHeatmap()
function.
SingleR::plotScoreHeatmap(singler_result)
+
+
+
+
+
+
+We can also pull out individual components of the results object for
+plotting in the context of our input SCE object. Here we will save the
+pruned labels (where low-quality assignments have been given an
+NA
label), storing them back in our SCE object
+(specifically to a new column of the colData
table).
sce$celltype_main <- singler_result$pruned.labels
+
+
+
+Now we can plot the cell type assignments onto our UMAP to see how +they compare to the patterns we saw there before.
+ + + +scater::plotUMAP(sce, color_by = "celltype_main")
+
+
+
+
+
+
+Annoyingly, the NA
and T cells
labels are
+quite close in color, and the scater
and
+SingleR
packages don’t agree on color choices. Luckily,
+since plotUMAP()
returns a ggplot
object, we
+can modify the color palette using ggplot2
functions. Still
+annoyingly, however, when we change the palette, the legend title
+defaults to the uninformative name "colour_by"
, so we’ll
+also specify a matching legend title with our new color palette.
scater::plotUMAP(sce, color_by = "celltype_main") +
+ scale_color_brewer(name = "Cell type", # legend title
+ palette = "Dark2", # color palette
+ na.value = "gray80") # use light gray for NA values
+
+
+Scale for colour is already present.
+Adding another scale for colour, which will replace the existing scale.
+
+
+
+
+
+
+We seem to have a pretty good set of cell type assignments, with most +falling into groupings consistent with what we see in the UMAP plot.
+We can thank the fact that this is a PBMC sample and that we have a +good reference dataset for these cell types for the cleanliness of this +plot. Quite often with other kinds of samples (especially cancer cells!) +things will be much less clean!
+We can also look to see how the cell type assignments are distributed
+using the base R function table()
. Since we like to keep
+track of the cells that ended up as NA
in the pruned
+labels, we will include the useNA = "ifany"
argument.
table(singler_result$pruned.labels, useNA = "ifany")
+
+
+
+ B cells CD4+ T cells CD8+ T cells Dendritic cells Monocytes
+ 692 1291 904 232 3622
+ NK cells Progenitors T cells <NA>
+ 345 47 646 145
+
+
+
+In the previous cell typing, we used the label.main
+column, but we also had label.fine
, so let’s use that to
+explore the dataset in a bit more detail.
We will also take this time to dive a bit deeper into the steps that
+SingleR
performed. As mentioned, the first step is training
+the model, during which we identify the genes that will be used for the
+correlation analysis later. While this step is not particularly slow, if
+we were classifying multiple samples, we would not want to have to
+repeat it for every sample.
To do the training, we will use the trainSingleR()
+function. For this we will start with our reference and the labels we
+want to train the model with.
We can then specify the method used to select the genes that will be
+used for classification. The default method is "de"
, which
+performs a differential expression analysis for each pair of labels, but
+we could also use "sd"
to select the genes which are most
+variable across labels, or "all"
to use all genes. If we
+want to get really fancy, we could even provide a specific list of genes
+to use.
We should note here that the reference dataset for
+SingleR
does not need to be from a compendium like
+celldex
! If you have any well-classified dataset that you
+want to use as a reference, you can, as long as you can create a gene by
+sample expression matrix and a vector of cell types for each sample. You
+will want to ensure that the cell types you expect to see in your sample
+are present in the reference dataset, and data should be normalized, but
+otherwise the method can be quite flexible. You can even use a
+previously-annotated SingleCellExperiment
as a reference
+for a new dataset. For more details about custom references, see the OSCA
+chapter on cell type annotation
We do want to be sure that the genes selected for the model will be
+among those present in our SCE object, so we will use the
+restrict
argument with a vector of the genes in our SCE.
+This step would happen automatically with the
+SingleR::SingleR()
function, but we need to add it manually
+for this use case.
# build fine model
+singler_finemodel <- SingleR::trainSingleR(
+ monaco_ref, # reference dataset
+ labels = monaco_ref$label.fine, # labels for training dataset
+ # use DE to select genes (default)
+ genes = "de",
+ # only use genes in the sce object
+ restrict = rownames(sce),
+ # parallel processing
+ BPPARAM = BiocParallel::MulticoreParam(4)
+)
+
+
+
+Now we can perform the classification step, using our SCE object and
+the SingleR
model that we just created.
# classify with fine model
+singler_result_fine <- SingleR::classifySingleR(
+ sce, # our SCE object
+ singler_finemodel, # the trained model object
+ # perform fine tuning (default)
+ fine.tune = TRUE,
+ # parallel processing
+ BPPARAM = BiocParallel::MulticoreParam(4)
+)
+
+
+
+What labels were assigned, and how many of each?
+ + + +table(singler_result_fine$pruned.labels, useNA = "ifany")
+
+
+
+ Central memory CD8 T cells Classical monocytes
+ 121 2926
+ Effector memory CD8 T cells Exhausted B cells
+ 31 48
+ Follicular helper T cells Intermediate monocytes
+ 135 424
+ Low-density basophils MAIT cells
+ 5 112
+ Myeloid dendritic cells Naive B cells
+ 162 311
+ Naive CD4 T cells Naive CD8 T cells
+ 600 752
+ Natural killer cells Non classical monocytes
+ 320 189
+ Non-switched memory B cells Non-Vd2 gd T cells
+ 250 163
+ Plasmablasts Plasmacytoid dendritic cells
+ 3 93
+ Progenitor cells Switched memory B cells
+ 36 81
+ T regulatory cells Terminal effector CD4 T cells
+ 163 39
+Terminal effector CD8 T cells Th1 cells
+ 115 97
+ Th1/Th17 cells Th17 cells
+ 135 133
+ Th2 cells Vd2 gd T cells
+ 144 146
+ <NA>
+ 190
+
+
+
+
+
+
+# add fine labels to SCE
+sce$celltype_fine <- singler_result_fine$pruned.labels
+# plot UMAP with fine labels
+scater::plotUMAP(sce, color_by = "celltype_fine")
+
+
+Warning: Removed 190 rows containing missing values or values outside the scale range
+(`geom_point()`).
+
+
+
+
+
+
+That’s a pretty messy plot. Mostly that is because there are
+lots of cell types here, and not enough colors to represent
+them all. The NA
cells also got taken off completely, which
+is not ideal.
One thing we can do is to use some functions from the
+tidyverse
package forcats
, which can
+be very handy for dealing with categorical variables like these cell
+types.
We will use two of these functions in the chunk below: First we will
+use fct_collapse
to take some of the finer labels that we
+might not be as interested in and collapse them into logical groupings
+(in this case, the main
label that they were part of).
+After that, we will use fct_relevel
to put the remaining
+factor levels in the order we would like them to appear for
+plotting.
collapsed_labels <- singler_result_fine$pruned.labels |>
+ forcats::fct_collapse(
+ "Monocytes" = c(
+ "Classical monocytes",
+ "Intermediate monocytes",
+ "Non classical monocytes"),
+ "Dendritic cells" = c(
+ "Myeloid dendritic cells",
+ "Plasmacytoid dendritic cells"),
+ "T cells" = c(
+ "MAIT cells",
+ "Non-Vd2 gd T cells",
+ "Vd2 gd T cells"),
+ "Helper T cells" = c(
+ "Th1 cells",
+ "Th1/Th17 cells",
+ "Th17 cells",
+ "Th2 cells",
+ "Follicular helper T cells"),
+ "B cells" = c(
+ "Naive B cells",
+ "Switched memory B cells",
+ "Non-switched memory B cells",
+ "Exhausted B cells",
+ "Plasmablasts"
+ )
+ ) |>
+ # order for plotting
+ forcats::fct_relevel(
+ "Helper T cells",
+ "T regulatory cells",
+ "Naive CD4 T cells",
+ "Terminal effector CD4 T cells",
+ "Naive CD8 T cells",
+ "Central memory CD8 T cells",
+ "Effector memory CD8 T cells",
+ "Terminal effector CD8 T cells",
+ "T cells",
+ "Natural killer cells",
+ "B cells",
+ "Monocytes",
+ "Dendritic cells",
+ "Progenitor cells",
+ "Low-density basophils"
+ )
+
+
+
+Now that we have that set up, we can plot using our collapsed and +ordered cell type labels.
+ + + +sce$celltype_collapsed <- collapsed_labels
+scater::plotUMAP(sce,
+ color_by = "celltype_collapsed")
+
+
+
+
+
+
+Let’s look at how the cell type assignments we obtained using
+SingleR
compare to the clusters that we found using the
+unsupervised clustering at the start of this notebook.
To do this, we will again use the table()
function, but
+now with two vectors as input, to build a contingency table of the cell
+types and clusters that each cell was classified with.
# create a table of clusters & cell type counts
+type_cluster_tab <- table(sce$celltype_fine, sce$nn_cluster, useNA = "ifany")
+
+# look at the top corner of the results
+type_cluster_tab[1:5, 1:5]
+
+
+
+ 1 2 3 4 5
+ Central memory CD8 T cells 81 0 0 0 0
+ Classical monocytes 0 0 2195 698 0
+ Effector memory CD8 T cells 26 0 0 0 0
+ Exhausted B cells 0 0 0 0 0
+ Follicular helper T cells 93 0 0 0 0
+
+
+
+As you can see, this produced a table with rows for each cell type +and columns for each cluster number. The values are the count of cells +for each cluster/cell type combination. However, these raw counts are +not quite what we’ll want for visualization. Since the total number of +cells differs across clusters, we’d like to convert these counts into +the proportions of each cell type in each cluster.
+We’ll do this by going through the table column by column and
+dividing each value by the sum for that cluster. This will give us
+normalized values where the values in each column now sum to 1. To do
+that, we will use the apply
function, which allows us to
+operate on a matrix row by row or column by column, applying a function
+to each “slice”. Since the function we want to apply is very short, we
+will use R’s new (as of version 4.1) anonymous function shorthand:
+\(x) ...
can be used to define a function that that takes
+as input values x
(where the ...
is where you
+would put the expression to calculate). Here we will apply the
+expression x/sum(x)
, which will divide each element of a
+vector x
by the sum of its values.
# normalize by the number of cells in each cluster (columns)
+type_cluster_tab <- apply(
+ type_cluster_tab,
+ 2, # apply function to columns
+ \(x) x/sum(x) # function to apply
+)
+# print the normalized values
+type_cluster_tab[1:5, 1:5]
+
+
+
+ 1 2 3 4 5
+ Central memory CD8 T cells 0.08617021 0 0.0000000 0.000000 0
+ Classical monocytes 0.00000000 0 0.9825425 0.656015 0
+ Effector memory CD8 T cells 0.02765957 0 0.0000000 0.000000 0
+ Exhausted B cells 0.00000000 0 0.0000000 0.000000 0
+ Follicular helper T cells 0.09893617 0 0.0000000 0.000000 0
+
+
+
+Now we can plot these results as a heatmap, using the
+pheatmap
package. There is a lot of customization we could
+do here, but pheatmap
(pretty heatmap) has good defaults,
+so we won’t spend too much time on it for now.
# plot with pheatmap
+pheatmap::pheatmap(type_cluster_tab)
+
+
+
+
+
+
+We can see that most of our clusters are indeed defined by a single +cell type, though there are some clusters (e.g., 1 & 9) that have a +number of (related) cell types within them. There are also some places +where single cell types are spread across a few different clusters +(Classical monocytes, for example).
+While most of the time we will want to classify single cells, +sometimes the sparseness of the data may mean that individual cells do +not provide reliable estimates of cell types.
+An alternative approach is to classify the clusters as a whole, +assuming that the clusters we have identified represent a single cell +state. If that is the case, then we should be able to combine the data +for all cells across each cluster, then apply our cell typing method to +this group of cells. This is similar to an approach we will return to +later in the context of differential expression.
+The first step here is to create a new matrix where we sum the counts
+across cells that are from the same type according to our clustering.
+Because SingleR
is a non-parametric approach, we can
+perform this step with the raw counts matrix. There are a few different
+ways to do this, but we will use the function
+DelayedArray::colsum()
, which can work directly on the
+sparse matrices that are often found in SCE objects. We will provide it
+with the matrix we need, and then a vector of the cluster assignments
+for each column of the matrix. The function will then sum expression
+values for each gene across all of the columns that have that value.
# sum count matrix by cluster
+cluster_mat <- DelayedArray::colsum(counts(sce), sce$nn_cluster)
+# print new dimensions
+dim(cluster_mat)
+
+
+[1] 36601 20
+
+
+
+You can see that the resulting matrix still has the same number of +rows we have seen before, but now only has as many columns as the number +of clusters that the cells were assigned to.
+Now we can apply the same SingleR
model to these
+results, using the new matrix as input along with the previously trained
+model. As there are only 20 clusters to classify, this will be very
+quick, and we don’t need to parallelize it!
# run SingleR classification with previously trained model
+singler_cluster <- SingleR::classifySingleR(
+ cluster_mat, # cluster expression matrix
+ singler_finemodel # pre-trained model
+)
+
+# view results
+head(singler_cluster)
+
+
+DataFrame with 6 rows and 4 columns
+ scores labels delta.next
+ <matrix> <character> <numeric>
+1 0.612108:0.431222:0.615571:... Th1/Th17 cells 0.01424091
+2 0.310607:0.468907:0.306530:... Naive B cells 0.55307985
+3 0.275063:0.805908:0.308321:... Classical monocytes 0.35835004
+4 0.276687:0.776140:0.317954:... Classical monocytes 0.08904561
+5 0.289710:0.708883:0.317147:... Myeloid dendritic ce.. 0.07394915
+6 0.462245:0.395730:0.451555:... Naive B cells 0.00785926
+ pruned.labels
+ <character>
+1 Th1/Th17 cells
+2 Naive B cells
+3 Classical monocytes
+4 Classical monocytes
+5 Myeloid dendritic ce..
+6 Naive B cells
+
+
+
+The result is a fairly small table of results, but we are most
+interested in the labels, which we would like to associate with each
+cell in our SCE object for visualization. Since the cluster labels are
+the row names of that table, we can perform a cute little trick to
+assign labels back to each cell based on the name of the cluster that it
+was assigned to. (In this case the cluster names are all numbers, but
+that might not always be the case.) We’ll select values repeatedly from
+the singler_cluster
table, using the cluster assignment to
+pick a row, and then always picking the pruned.labels
+column.
sce$celltype_cluster <- singler_cluster[sce$nn_cluster, "pruned.labels"]
+
+
+
+Now we can plot these cluster-based cell type assignments using the
+now familiar plotUMAP()
function.
scater::plotUMAP(sce, color_by = "celltype_cluster")
+
+
+
+
+
+
+This sure looks nice and clean, but what have we really done here? We +are assuming that each cluster has only a single cell type, +which is a pretty bold assumption, as we really aren’t sure that the +clusters we created were correct. You may recall that clustering +algorithms are quite sensitive to parameter choice, so a different +parameter choice could quite likely give a different result.
+As a middle ground between the potentially messy single-cell cell +type assignment and the almost-certainly overconfident cluster-based +assignment above, we can take approach inspired by Baran et al. +(2019) using something they called metacells. The idea is +that we can perform fine-scaled clustering to identify groups of very +similar cells, then sum the counts within those clusters as “metacells” +to use for further analysis. The original paper includes a number of +optimizations to make sure that the metacell clusters have desirable +properties for downstream analysis. We won’t go into that depth here, +but we can apply similar ideas.
+To begin, we will perform some fine-scale clustering, using a simpler
+clustering algorithm: K-means clustering. We will use the same
+bluster
package, clustering based on the PCA results we
+have from earlier, but this algorithm allows us to specify the number of
+clusters we want to end up with. We have about 8000 cells, so let’s
+cluster those into groups of approximately 80 cells, which works out to
+100 clusters. While this is almost certainly more clusters than are
+“real” in this dataset, our goal here is not to find differences among
+clusters, just to get homogeneous groups of cells.
# perform k-means clustering
+kclusters <- bluster::clusterRows(
+ reducedDim(sce, "PCA"),
+ bluster::KmeansParam(
+ centers = 100, # the number of clusters
+ iter.max = 100 # more iterations to be sure of convergence
+ )
+)
+
+
+
+Now we can apply exactly the same approach we did when we had the 20 +clusters we had identified with the earlier graph-based clustering.
+ + + +# create a "metacell" matrix by summing fine-scale clusters
+metacell_mat <- DelayedArray::colsum(counts(sce), kclusters)
+
+# apply SingleR model to metacell matrix
+metacell_singler <- SingleR::classifySingleR(
+ metacell_mat,
+ singler_finemodel
+)
+
+# apply metacell cell type assignments to individual cells
+sce$celltype_metacell <- metacell_singler[kclusters, "pruned.labels"]
+
+
+
+Now we can plot the results as we have done before.
+ + + +scater::plotUMAP(sce, color_by = "celltype_metacell")
+
+
+Warning: Removed 208 rows containing missing values or values outside the scale range
+(`geom_point()`).
+
+
+
+
+
+
+What do you think of this plot? Is this more or less useful than the +original cell-based clustering?
+To save disk space (and time), we won’t write out the whole SCE
+object, as we haven’t changed any of the core data there. Instead we
+will just write out the cell information table (colData
) as
+a TSV file.
colData(sce) |>
+ as.data.frame() |>
+ readr::write_tsv(file = cellinfo_file)
+
+
+
+sessionInfo()
+
+
+R version 4.4.0 (2024-04-24)
+Platform: x86_64-pc-linux-gnu
+Running under: Ubuntu 22.04.4 LTS
+
+Matrix products: default
+BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
+LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
+
+locale:
+ [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
+ [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
+ [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
+ [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
+ [9] LC_ADDRESS=C LC_TELEPHONE=C
+[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
+
+time zone: Etc/UTC
+tzcode source: system (glibc)
+
+attached base packages:
+[1] stats4 stats graphics grDevices utils datasets methods
+[8] base
+
+other attached packages:
+ [1] ensembldb_2.28.0 AnnotationFilter_1.28.0
+ [3] GenomicFeatures_1.56.0 AnnotationDbi_1.66.0
+ [5] SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0
+ [7] Biobase_2.64.0 GenomicRanges_1.56.0
+ [9] GenomeInfoDb_1.40.0 IRanges_2.38.0
+[11] S4Vectors_0.42.0 BiocGenerics_0.50.0
+[13] MatrixGenerics_1.16.0 matrixStats_1.3.0
+[15] ggplot2_3.5.1 optparse_1.7.5
+
+loaded via a namespace (and not attached):
+ [1] RColorBrewer_1.1-3 jsonlite_1.8.8
+ [3] magrittr_2.0.3 ggbeeswarm_0.7.2
+ [5] gypsum_1.0.0 farver_2.1.1
+ [7] rmarkdown_2.26 BiocIO_1.14.0
+ [9] fs_1.6.4 zlibbioc_1.50.0
+ [11] vctrs_0.6.5 Rsamtools_2.20.0
+ [13] memoise_2.0.1 DelayedMatrixStats_1.26.0
+ [15] RCurl_1.98-1.14 forcats_1.0.0
+ [17] htmltools_0.5.8.1 S4Arrays_1.4.0
+ [19] AnnotationHub_3.12.0 curl_5.2.1
+ [21] BiocNeighbors_1.22.0 Rhdf5lib_1.26.0
+ [23] SparseArray_1.4.0 rhdf5_2.48.0
+ [25] sass_0.4.9 alabaster.base_1.4.0
+ [27] bslib_0.7.0 httr2_1.0.1
+ [29] cachem_1.0.8 GenomicAlignments_1.40.0
+ [31] igraph_2.0.3 mime_0.12
+ [33] lifecycle_1.0.4 pkgconfig_2.0.3
+ [35] rsvd_1.0.5 Matrix_1.7-0
+ [37] R6_2.5.1 fastmap_1.1.1
+ [39] GenomeInfoDbData_1.2.12 digest_0.6.35
+ [41] colorspace_2.1-0 paws.storage_0.5.0
+ [43] scater_1.32.0 irlba_2.3.5.1
+ [45] ExperimentHub_2.12.0 RSQLite_2.3.6
+ [47] beachmat_2.20.0 filelock_1.0.3
+ [49] labeling_0.4.3 fansi_1.0.6
+ [51] httr_1.4.7 abind_1.4-5
+ [53] compiler_4.4.0 bit64_4.0.5
+ [55] withr_3.0.0 BiocParallel_1.38.0
+ [57] viridis_0.6.5 DBI_1.2.2
+ [59] highr_0.10 alabaster.ranges_1.4.0
+ [61] HDF5Array_1.32.0 alabaster.schemas_1.4.0
+ [63] rappdirs_0.3.3 DelayedArray_0.30.0
+ [65] rjson_0.2.21 bluster_1.14.0
+ [67] tools_4.4.0 vipor_0.4.7
+ [69] beeswarm_0.4.0 glue_1.7.0
+ [71] restfulr_0.0.15 rhdf5filters_1.16.0
+ [73] grid_4.4.0 cluster_2.1.6
+ [75] generics_0.1.3 gtable_0.3.5
+ [77] tzdb_0.4.0 tidyr_1.3.1
+ [79] hms_1.1.3 xml2_1.3.6
+ [81] BiocSingular_1.20.0 ScaledMatrix_1.12.0
+ [83] utf8_1.2.4 XVector_0.44.0
+ [85] ggrepel_0.9.5 BiocVersion_3.19.1
+ [87] pillar_1.9.0 stringr_1.5.1
+ [89] vroom_1.6.5 dplyr_1.1.4
+ [91] getopt_1.20.4 BiocFileCache_2.12.0
+ [93] lattice_0.22-6 rtracklayer_1.64.0
+ [95] bit_4.0.5 tidyselect_1.2.1
+ [97] paws.common_0.7.2 Biostrings_2.72.0
+ [99] scuttle_1.14.0 knitr_1.46
+ [ reached getOption("max.print") -- omitted 35 entries ]
+
+
+
+
+
Comments
+Arguably the most important aspect of your coding is +comments: Small pieces of explanatory text you leave in your code to +explain what the code is doing and/or leave notes to yourself or others. +Comments are invaluable for communicating your code to others, but they +are most important for Future You. Future You comes +into existence about one second after you write code, and has no idea +what on earth Past You was thinking.
+Comments in R code are indicated with pound signs (aka +hashtags, octothorps). R will ignore any text in a line after +the pound sign, so you can put whatever text you like there.
+ + + + + + + + + + + + + + + + +Help out Future You by adding lots of comments! Future You next week +thinks Today You is an idiot, and the only way you can convince Future +You that Today You is reasonably competent is by adding comments in your +code explaining why Today You is actually not so bad.
+