-
Notifications
You must be signed in to change notification settings - Fork 24
/
Copy pathdecoupleR.Rmd
340 lines (279 loc) · 11.4 KB
/
decoupleR.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
---
title: "Introduction"
author:
- name: Pau Badia-i-Mompel
affiliation:
- Heidelberg Universiy
- name: Jesús Vélez-Santiago
affiliation:
- National Autonomous University of Mexico
output:
BiocStyle::html_document:
self_contained: true
toc: true
toc_float: true
toc_depth: 3
code_folding: show
package: "`r pkg_ver('decoupleR')`"
vignette: >
%\VignetteIndexEntry{Introduction}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r chunk_setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
```{r vignette_setup, echo=FALSE, message=FALSE, warning = FALSE}
# Track time spent on making the vignette.
start_time <- Sys.time()
# Bib setup.
library(RefManageR)
# Write bibliography information
bib <- c(
decoupleR = citation("decoupleR")[1],
AUCell = citation("AUCell")[1],
fgsea = citation("fgsea")[1],
GSVA = citation("GSVA")[1],
viper = citation("viper")[1]
)
```
# Installation
`r Biocpkg("decoupleR")` is an R package distributed as part of the Bioconductor
project. To install the package, start R and enter:
```{r bioconductor_install, eval=FALSE}
install.packages("BiocManager")
BiocManager::install("decoupleR")
```
Alternatively, you can instead install the latest development version from [GitHub](https://github.com/) with:
```{r github_install, eval=FALSE}
BiocManager::install("saezlab/decoupleR")
```
# Usage
`r Biocpkg("decoupleR")` `r Citep(bib[["decoupleR"]])` contains different
statistical methods to extract biological activities from omics data using prior
knowledge. Some of them are:
* AUCell: `r Citep(bib[["AUCell"]])`
* Fast GSEA: `r Citep(bib[["fgsea"]])`
* GSVA: `r Citep(bib[["GSVA"]])`
* viper: `r Citep(bib[["viper"]])`
In this vignette we showcase how to use it with some toy data.
## Libraries
`r Biocpkg("decoupleR")` can be imported as:
```{r load_library, message=FALSE}
library(decoupleR)
# Extra libraries
library(dplyr)
library(pheatmap)
```
## Input data
`r Biocpkg("decoupleR")` needs a matrix (`mat`) of any molecular readouts (gene
expression, logFC, p-values, etc.) and a `network` that relates target
features (genes, proteins, etc.) to "source" biological entities (pathways,
transcription factors, molecular processes, etc.). Some methods also require
the mode of regulation (MoR) for each interaction, defined as negative or
positive weights.
To get an example data-set, run:
```{r read_example_data}
data <- decoupleR::get_toy_data()
mat <- data$mat
head(mat,5)[,1:5]
network <- data$network
network
```
This example consists of two small populations of samples (S, cols) with
different gene expression patterns (G, rows):
```{r show_matrix, message=TRUE}
# Color scale
colors <- rev(RColorBrewer::brewer.pal(n = 11, name = "RdBu"))
colors.use <- grDevices::colorRampPalette(colors = colors)(100)
# Heatmap
pheatmap::pheatmap(mat = mat,
color = colors.use,
border_color = "white",
cluster_rows = FALSE,
cluster_cols = FALSE,
cellwidth = 15,
cellheight = 15,
treeheight_row = 0,
treeheight_col = 0)
```
Here we can see that some genes seem to be more expressed in one group of
samples than in the other and vice-versa. Ideally, we would like to capture
these differences in gene programs into interpretable biological entities.
In this example we will do it by summarizing gene expression into transcription
factor activities.
The toy data also contains a simple net consisting of 3 transcription factors
(Ts) with specific regulation to target genes (either positive or negative).
This network can be visualized like a graph. Green edges are positive regulation
(activation), red edges are negative regulation (inactivation):
<img src="https://github.com/saezlab/decoupleR/blob/master/inst/figures/net_plot.png?raw=1" align="center" width="600">
According to this network, the first population of samples should show high
activity for T1 and T3, while the second one only for T2.
## Methods
`r Biocpkg("decoupleR")` contains several methods. To check how many are
available, run:
```{r usage-show_methods, message=TRUE}
decoupleR::show_methods()
```
Each method models biological activities in a different manner, sometimes
returning more than one estimate or providing significance of the estimation.
To know what each method returns, please check their documentation like this
`?run_mlm`.
To have a unified framework, methods have these shared arguments:
* `mat` : input matrix of molecular readouts.
* `network` : input prior knowledge information relating molecular features to
biological entities.
* `.source`,`.target` and `.mor` : column names where to extract the information
from `network`.
* `.source` refers to the biological entities.
* `.target` refers to the molecular features.
* `.mor` refers to the "strength" of the interaction (if available, else 1s
will be used). Only available for methods that can model interaction weights.
* `minsize` : Minimum of target features per biological entity (5 by default).
If less, sources are removed. This filtering prevents obtaining noisy activities
from biological entities with very few matching target features in `matrix`. For
this example data-set we will have to keep it to 0 though.
## Running methods
### Individual methods
As an example, let’s first run the Gene Set Enrichment Analysis method (`gsea`),
one of the most well-known statistics:
```{r usage-fgsea, message=TRUE}
res_gsea <- decoupleR::run_fgsea(mat = mat,
network = network,
.source = 'source',
.target = 'target',
nproc = 1,
minsize = 0)
res_gsea
```
Methods return a result data-frame containing:
* `statistic`: name of the statistic. Depending on the method, there can be more than one per method.
* `source`: name of the biological entity.
* `condition`: sample name.
* `score`: inferred biological activity.
* `p_value`: if available, significance of the inferred activity.
In the case of `gsea`, it returns a simple estimate of activities (`fgsea`),
a normalized estimate (`norm_fgsea`) and p-values after doing permutations.
Other methods can return different things, for example Univariate Linear Model
(`ulm`):
```{r usage-ulm, message=TRUE}
res_ulm <- decoupleR::run_ulm(mat = mat,
network = network,
.source = 'source',
.target = 'target',
.mor = 'mor',
minsize = 0)
res_ulm
```
In this case, `ulm` returns just an estimate (`ulm`) and its associated p-values.
Each method can return different statistics, we recommend to check their
documentation to know more about them.
Let us plot the obtained results, first for `gsea`:
```{r res_gsea, message=TRUE}
# Transform to matrix
mat_gsea <- res_gsea %>%
dplyr::filter(statistic == 'fgsea') %>%
decoupleR::pivot_wider_profile(id_cols = source,
names_from = condition,
values_from = score) %>%
as.matrix()
# Color scale
colors <- rev(RColorBrewer::brewer.pal(n = 11, name = "RdBu"))
colors.use <- grDevices::colorRampPalette(colors = colors)(100)
# Heatmap
pheatmap::pheatmap(mat = mat_gsea,
color = colors.use,
border_color = "white",
cluster_rows = FALSE,
cluster_cols = FALSE,
cellwidth = 20,
cellheight = 20,
treeheight_row = 0,
treeheight_col = 0)
```
We can observe that for transcription factors T1 and T2, the obtained activities
correctly distinguish the two sample populations. T3, on the other hand, should
be down for the second population of samples since it is a repressor.
This mislabeling of activities happens because `gsea` cannot model weights when
inferring biological activities.
When weights are available in the prior knowledge, we definitely recommend using
any of the methods that take them into account to get better estimates,
one example is `ulm`:
```{r res_ulm, message=TRUE}
# Transform to matrix
mat_ulm <- res_ulm %>%
dplyr::filter(statistic=='ulm') %>%
decoupleR::pivot_wider_profile(id_cols = source,
names_from = condition,
values_from = score) %>%
as.matrix()
# Color scale
colors <- rev(RColorBrewer::brewer.pal(n = 11, name = "RdBu"))
colors.use <- grDevices::colorRampPalette(colors = colors)(100)
# Heatmap
pheatmap::pheatmap(mat = mat_ulm,
color = colors.use,
border_color = "white",
cluster_rows = FALSE,
cluster_cols = FALSE,
cellwidth = 20,
cellheight = 20,
treeheight_row = 0,
treeheight_col = 0)
```
Since `ulm` models weights when estimating biological activities, it correctly
assigns T3 as inactive in the second population of samples.
### Multiple methods
`r Biocpkg("decoupleR")` also allows to run multiple methods at the same time.
Moreover, it computes a consensus score based on the obtained activities across
methods, called `consensus`.
By default, `deocuple` runs only the top performer methods in our benchmark (`mlm`,
`ulm` and `wsum`), and estimates a consensus score across them. Specific
arguments to specific methods can be passed using the variable `args`. For more
information check `?decouple`.
```{r usage-decouple, message=TRUE}
res_decouple <- decoupleR::decouple(mat,
network,
.source ='source',
.target ='target',
minsize = 0)
res_decouple
```
Let us see the result for the consensus score in the previous `decouple` run:
```{r res_decouple, message=TRUE}
# Transform to matrix
mat_consensus <- res_decouple %>%
dplyr::filter(statistic == 'consensus') %>%
decoupleR::pivot_wider_profile(id_cols = source,
names_from = condition,
values_from = score) %>%
as.matrix()
# Color scale
colors <- rev(RColorBrewer::brewer.pal(n = 11, name = "RdBu"))
colors.use <- grDevices::colorRampPalette(colors = colors)(100)
# Heatmap
pheatmap::pheatmap(mat = mat_consensus,
color = colors.use,
border_color = "white",
cluster_rows = FALSE,
cluster_cols = FALSE,
cellwidth = 20,
cellheight = 20,
treeheight_row = 0,
treeheight_col = 0)
```
We can observe that the consensus score correctly predicts that T1 and T3 should
be active in the first population of samples while T2 in the second one.
# Session information
```{r session_info, echo=FALSE}
options(width = 120)
sessioninfo::session_info()
```
# Bibliography
```{r vignetteBiblio, results = "asis", echo = FALSE, warning = FALSE, message = FALSE}
## Print bibliography
PrintBibliography(bib, .opts = list(hyperlink = "to.doc", style = "html"))
```