The Generalised [1] and Projected [2] Covariance Measure tests (GCM, PCM) can be
used to test conditional independence between a real-valued response comets
R package
implements these covariance measure tests (COMETs) with a user-friendly
interface which allows the user to use any sufficiently predictive supervised
learning algorithm of their choosing. The default is to use random forests
implemented in ranger
for all regressions. A Python version of this package
is available here.
Here, we showcase how to use comets
with a simple example in which
set.seed(1)
n <- 300
X <- matrix(rnorm(2 * n), ncol = 2)
colnames(X) <- c("X1", "X2")
Z <- matrix(rnorm(2 * n), ncol = 2)
colnames(Z) <- c("Z1", "Z2")
Y <- X[, 1]^2 + Z[, 2] + rnorm(n)
GCM <- gcm(Y, X, Z) # plot(GCM)
The output for the GCM test, which fails to reject the null hypothesis of
conditional independence in this example, is shown below. The residuals for the
plot(GCM)
(not shown here).
##
## Generalized covariance measure test
##
## data: gcm(Y = Y, X = X, Z = Z)
## X-squared = 3.6131, df = 2, p-value = 0.1642
## alternative hypothesis: true E[cov(Y, X | Z)] is not equal to 0
The PCM test can be run likewise.
PCM <- pcm(Y, X, Z) # plot(PCM)
The output is shown below: The PCM test correctly rejects the null hypothesis of conditional independence in this example.
##
## Projected covariance measure test
##
## data: pcm(Y = Y, X = X, Z = Z)
## Z = 4.8589, p-value = 5.901e-07
## alternative hypothesis: true E[Y | X, Z] is not equal to E[Y | Z]
The comets
package contains an alternative formula-based interface, in which
Y ~ X | Z
with a
corresponding data
argument. This interface is implemented in comet()
and
shown below.
dat <- data.frame(Y = Y, X, Z)
comet(Y ~ X1 + X2 | Z1 + Z2, data = dat, test = "gcm")
##
## Generalized covariance measure test
##
## data: comet(formula = Y ~ X1 + X2 | Z1 + Z2, data = dat, test = "gcm")
## X-squared = 3.5, df = 2, p-value = 0.1738
## alternative hypothesis: true E[cov(Y, X | Z)] is not equal to 0
Different regression methods can supplied for both GCM and PCM tests using the
reg_*
arguments (for instance, reg_YonZ
in gcm()
for the regression of "rf"
for random forests and "lasso"
for cross-validated residual()
(GCM and
PCM) or predict()
(PCM only) method and the following structure:
my_regression <- function(y, x, ...) {
ret <- <run the regression>
class(ret) <- "my_regression"
ret
}
predict.my_regression <- function(object, data, ...) {
<run the prediction routine>
}
residuals.my_regression <- function(object, response, data, ...) {
<run the routine for computing residuals>
}
The input y
and x
and data
are vector and matrix-valued. The output of
predict.my_regression()
should be a vector of length NROW(data)
.
The development version of comets
can be installed using:
# install.packages("remotes")
remotes::install_github("LucasKook/comets")
A stable version of comets
can be installed from CRAN via:
install.packages("comets")
All results in [3] can be reproduced by running make all
in ./inst
after
downloading all required data from the
zenodo repository.
The scripts for reproducing the results manually can be found in ./inst/code/
for the CCLE data (ccle.R
), TCGA data (multiomics.R
) and MIMIC data
(mimic.R
).
[1] Rajen D. Shah, Jonas Peters "The hardness of conditional independence testing and the generalised covariance measure," The Annals of Statistics, 48(3), 1514-1538. doi:10.1214/19-aos1857
[2] Lundborg, A. R., Kim, I., Shah, R. D., & Samworth, R. J. (2024). The Projected Covariance Measure for assumption-lean variable significance testing. The Annals of Statistics, 52(6), 2851-2878. doi:10.1214/24-AOS2447
[3] Kook, L. & Lundborg A. R. (2024). Algorithm-agnostic significance testing in supervised learning with multimodal data. Briefings in Bioinformatics 25(6) 2024. doi:10.1093/bib/bbae475