This vignette demonstrates how you can plug custom functions into the MS-DAP pipeline.
note: in the code blocks shown below (grey areas), the lines that start
with #>
are the respective output that would be printed to the console
if you run these code snippets on your computer
-
load the Skyline output of the LFQbench study (PMID:27701404 ~ this file is bundled with the MS-DAP package, you don’t have to download anything).
-
extract the respective group/condition of each sample by matching a regular expression against the filenames. This is just to demonstrate a more advanced utility function that may come in handy for bioinformatic analyses. In typical MS-DAP workflows, the sample metadata would be loaded from file (a csv or Excel table), as demonstrated in the user guide.
-
finally, we define the contrast: group A versus group B.
note; the sample-to-condition assignments were taken from
process_hye_samples.R
@
https://www.ebi.ac.uk/pride/archive/projects/PXD002952/files
library(msdap)
f <- system.file("extdata", "Skyline_HYE124_TTOF5600_64var_it2.tsv.gz", package = "msdap")
dataset = import_dataset_skyline(f, confidence_threshold = 0.01, return_decoys = F, acquisition_mode = "dia")
#> info: reading Skyline report...
#> info: input file: C:/VU/code/R/msdap/inst/extdata/Skyline_HYE124_TTOF5600_64var_it2.tsv.gz
#> info: 4 unique target (plain)sequences ambiguously mapped to multiple proteins and thus removed. Examples; TTDVTGTIELPEGVEMVMPGDNIK, LNIISNLDCVNEVIGIR, LMDLSINK, EVDEQMLNVQNK
#> info: 34263/35943 precursors remain after selecting the 'best' precursor for each modified sequence
dataset = sample_metadata_custom(dataset, group_regex_array = c(A = "007|009|011", B = "008|010|012") )
dataset = setup_contrasts(dataset, contrast_list = list(c("A", "B")))
#> info: numeric variable: condition
#> info: contrast: A vs B # condition_variable: group
print(dataset$samples %>% select(sample_id, group))
#> # A tibble: 6 × 2
#> sample_id group
#> <chr> <chr>
#> 1 lgillet_L150206_007 A
#> 2 lgillet_L150206_009 A
#> 3 lgillet_L150206_011 A
#> 4 lgillet_L150206_008 B
#> 5 lgillet_L150206_010 B
#> 6 lgillet_L150206_012 B
We here implement a simple normalization approach to illustrate the technique for providing MS-DAP with a custom normalization function. The code below implements quantile normalization.
Note that in MS-DAP, normalization is always applied to peptide-level. If you want to normalize at protein-level instead, we suggest to perform peptide-to-protein rollup within the normalization function, apply normalization and then backport those scaling factors to the peptide-level matrix.
The requirements to make your custom normalization function work are:
-
The function must have a unique name that does not overlap with common R functions (bad name: ‘mean’). So pick unique names like ‘mynorm_quantile’ or ‘my_norm_v1’
-
It should at least have a parameter named
x_as_log2
. Furthermore, make sure to add a...
parameter as a ‘catch all’ for named parameters passed by the MS-DAP pipeline that you won’t make use of, and to be robust against future parameters that might be added.
List of parameters that are currently passed to custom normalization functions, add these to your function to work with the respective data:
x_as_log2
= log2 transformed peptide intensity data matrix, where each row is a peptide and each column a samplegroup_by_cols
= array of characters that describe the sample grouping (which columns belong to the same sample group)group_by_rows
= array of characters that describe the peptide grouping (which rows belong to the same protein)rollup_algorithm
= user’s preferred peptide-to-protein rollup algorithm
- return data has to be of the exact same format as x_as_log2 (but with normalized abundance values)
my_norm = function(x_as_log2, ...) {
cat("Example custom normalization implementation, my_norm(), scaling all samples by some quantile\n")
quantile_for_normalization = 0.95
# value at quantile x for each column/sample
scale_per_sample = matrixStats::colQuantiles(x_as_log2, probs = quantile_for_normalization, na.rm=T)
# instead of scaling samples such that quantile x is zero, adjust by mean shift so output values are of the same order as input
scale_per_sample = scale_per_sample - mean(scale_per_sample)
# apply scaling to each column (transpose, scale 'rows', then transpose again)
return(t(t(x_as_log2) - scale_per_sample))
}
now that we have implemented a normalization function, let’s run a very basic data matrix to test if it works;
mock_data = cbind(1:6, 2:7, 0:5)
print(mock_data)
#> [,1] [,2] [,3]
#> [1,] 1 2 0
#> [2,] 2 3 1
#> [3,] 3 4 2
#> [4,] 4 5 3
#> [5,] 5 6 4
#> [6,] 6 7 5
print(my_norm(mock_data))
#> Example custom normalization implementation, my_norm(), scaling all samples by some quantile
#> [,1] [,2] [,3]
#> [1,] 1 1 1
#> [2,] 2 2 2
#> [3,] 3 3 3
#> [4,] 4 4 4
#> [5,] 5 5 5
#> [6,] 6 6 6
In the last section of this document, we’ll use this new normalization function in MS-DAP
Below we have implemented a simplified DEA function that applies a
t.test()
on protein abundances.
-
The function must have a unique name that does not overlap with common R functions (bad name: ‘mean’). So pick unique names like ‘mydea_ttest’ or ‘my_ttest_v1’
-
Analogous to the custom normalization function presented above, multiple named parameters will be passed to your custom dea function. You can choose which to work with, but make sure to add a
...
parameter as a ‘catch all’ for named parameters passed by the MS-DAP pipeline that you won’t make use of, and to be robust against future parameters that might be added.
List of parameters that are currently passed to custom dea functions, add these to your function to work with the respective data:
peptides
= long-format tibble of peptide data. For simple use-cases, you might first consider the peptide- or protein-level ExpressionSet (these hold simple data matrices to operate on, see example code block below)samples
= wide-format tibble with sample metadata, take note of the columns ‘sample_id’ and ‘condition’ so you can classify samples into conditions while A/B testingeset_peptides
= ExpressionSet for peptides. Holds a peptide*sample log2 intensity matrix as well as row- and column-level sample dataeset_proteins
= ExpressionSet for proteins, analogous toeset_peptides
but peptide-to-protein rollup was already performed.input_intensities_are_log2
= boolean that tells you whether the intensities are log2 transformed (they are by default, but could change in future, so implement a check like below)random_variables
= array of column names in the samples table
The return data must be a tibble with these columns;
protein_id
= character column with the protein identifiers provided in the input data (peptides tibble or metadata from peptide/protein ExpressionSet). These must be unique valuespvalue
= numeric column with p-valuesqvalue
= numeric column with p-values adjusted for multiple testing correction (e.g. after applying FDR to your p-values)foldchange.log2
= numeric column with log2 foldchangesdea_algorithm
= name of your algorithm. must be a non-empty character string uniquely indicating the name/label of your method. DO NOT use names already used by other methods in this pipeline, like ‘ebayes’
The example code, together with code comments, should provide a code skeleton for the implementation of your custom DEA algorithm.
# note that we only add named parameters for data we're using in this example implementation (i.e. additional parameters like `random_variables` and `peptides` are sent to `...`)
my_dea_stats = function(eset_proteins, input_intensities_are_log2, ...) {
### 1) from provided protein-level data; extract the protein intensity matrix, to which condition each sample belongs and find the columns matching groups 1 and 2
x = Biobase::exprs(eset_proteins)
# transform to log2 if input data is non-log
if (!input_intensities_are_log2) {
x = log2(x)
}
# set non-finite values to NA
x[!is.finite(x)] = NA
# extract groups
x_groups = pData(eset_proteins)$condition
x_groups_unique = unique(x_groups)
# assume there are 2 groups
stopifnot(length(x_groups_unique) == 2)
cols_grp1 = x_groups == x_groups_unique[1]
cols_grp2 = x_groups == x_groups_unique[2]
### 2) calculate p-values for each protein
pval = rep(NA, nrow(x))
for(i in 1:nrow(x)) {
i_tt = t.test(x[i,cols_grp1], x[i,cols_grp2], alternative = "two.sided", paired = FALSE, var.equal = FALSE)
pval[i] = i_tt$p.value
}
# log2 fold-change
# in MS-DAP, we compute B/A foldchanges (as noted in msdap::setup_contrasts() )
log2FC = rowMeans(x[,cols_grp2,drop=F], na.rm=T) - rowMeans(x[,cols_grp1,drop=F], na.rm=T)
### 3) create a result tibble that contains all columns required for downstream compatability with this pipeline; protein_id, pvalue, qvalue, foldchange.log2, dea_algorithm
result = tibble(protein_id = rownames(x),
pvalue = pval,
qvalue = p.adjust(pval, method = "fdr"),
foldchange.log2 = log2FC,
# the name of your algorithm. must be a non-empty character string uniquely indicating the name/label of your method
# do NOT use names already used by other methods in this pipeline, like 'ebayes' !
dea_algorithm = "my_dea_stats")
cat("Example custom DEA implementation, my_dea_stats(), yielded", sum(is.finite(result$qvalue) & result$qvalue<=0.01), "hits at qvalue<=0.01\n")
return(result)
}
To use a custom function, we just write the name of the R function as a parameter. Do not confuse this with the ‘dea_algorithm’ name given in the result tibble of your dea function, that is merely the label used in output plots and tables. We remember the LFQbench dataset is a comparison of 2 conditions with 3 replicated measurements each, and apply feature selection rules accordingly.
dataset = analysis_quickstart(
dataset,
filter_min_detect = 2, # only peptides identified in at least 2 samples per sample group
filter_min_quant = 3, # only peptides quantified in at least 3 samples per sample group
norm_algorithm = "my_norm", # custom algorithm !
dea_algorithm = c("ebayes", "my_dea_stats"), # we use eBayes for reference, and also apply our custom dea algorithm independently !
dea_qvalue_threshold = 0.01, # adjusted p-value to cutoff 'significant' hits @ dataset$de_proteins column 'signif' (but note we don't use this output column in code below)
dea_log2foldchange_threshold = 0, # foldchange threshold for proteins to be 'significant' @ dataset$de_proteins column 'signif' (but note we don't use this output column in code below)
diffdetect_min_samples_observed = NA, # in this example, we disable the `differential detection` qualitative analysis
output_dir = NA # in this example, we disable the generation of all output files
)
#> info: no output files will be generated, output_dir was set to NA
#> progress: caching filter data took 2 seconds
#> Example custom normalization implementation, my_norm(), scaling all samples by some quantile
#> Example custom normalization implementation, my_norm(), scaling all samples by some quantile
#> info: filter dataset with settings: min_detect = 2; min_quant = 3; norm_algorithm = 'my_norm'; rollup_algorithm = 'maxlfq'
#> 12756/34263 peptides were retained after filtering over all groups
#> 21591/34263 peptides were retained after filtering within each group independently ("by group")
#> progress: peptide filtering and normalization took 3 seconds
#> info: differential expression analysis for contrast: A vs B # condition_variable: group
#> info: using data from peptide filter: global data filter
#> progress: eBayes took 1 seconds
#> warning: an error occurred in "contrast: A vs B # condition_variable: group" during the execution of DEA function "my_dea_stats"
#> warning: argument "input_intensities_are_log2" is missing, with no default
#> language alg_fun(peptides = peptides_for_contrast, samples = samples_for_contrast, eset_peptides = eset_peptides, eset_proteins = eset_proteins, model_matrix = contr$model_matrix, ...
#> warning: differential detection parameters set to NA, this analysis is cancelled
Print the number of significant proteins
print(dataset$de_proteins %>%
group_by(dea_algorithm) %>%
summarise(`1% FDR` = sum(qvalue <= 0.01),
`1% FDR AND foldchange threshold` = sum(qvalue <= 0.01 & signif)))
#> # A tibble: 1 × 3
#> dea_algorithm `1% FDR` `1% FDR AND foldchange threshold`
#> <chr> <int> <int>
#> 1 ebayes 1045 1045
A simple Venn diagram of proteins at 1% FDR for each method shows our custom function identifies a subset of the results from limma’s ebayes function, so we verified that the example implementation doesn’t yield random stuff.
gplots::venn(list(ebayes=dataset$de_proteins %>% filter(dea_algorithm=="ebayes" & qvalue <= 0.01) %>% pull(protein_id),
my_dea_stats=dataset$de_proteins %>% filter(dea_algorithm=="my_dea_stats" & qvalue <= 0.01) %>% pull(protein_id)))