Run shapr on HPC with a large size for x_explain #370

AbdollahiAz · 2024-01-14T16:06:24Z

Hi,
I try to use the shapr in my project, but I have a problem. I set these parameters in my code and run it on HPC with 100 GB memory and 4 workers:

explanation <- explain(
  model = model,
  x_explain = x_test,
  x_train = x_train,
  approach = "copula",
  prediction_zero = p0,
  n_combinations = 10000,
  n_batches=20

In my main dataset, I have x_train=16X13920 and x_test=16X3480. I decided to only consider a sunset of 300 for my explain. So, I considered the "x_explain" with 300 sample size and 16 features, . but I see this error:

Error in unserialize(node$con) :
  MultisessionFuture (future_lapply-1) failed to receive message results from cluster RichSOCKnode #1 (PID 47928 on localhost 'localhost'). The reason reported was 'error reading from connection'. Post-mortem diagnostic: No process exists with this PID, i.e. the localhost worker is no longer alive. Detected a non-exportable reference ('externalptr') in one of the globals ('future.call.arguments' of class 'DotDotDotList') used in the future expression. The total size of the 8 globals exported is 370.46 KiB. The three largest globals are 'future.call.arguments' (98.45 KiB of class 'list'), '...future.FUN' (87.62 KiB of class 'function') and 'compute_preds' (62.97 KiB of class 'function')
Calls: explain ... resolved -> resolved.ClusterFuture -> receiveMessageFromWorker
Execution halted

Surprisingly, when I considered the "x_explain" with 200 sample size and 16 features, it is done successfully. Could you explain why this problem is happen? In fact, my major problem is that how can I explain all 3480 observations. Is it possible? Please note that I have HPC and the computational cost is not really matters.
Thanks in advance for your help and the outstanding shapr!

The text was updated successfully, but these errors were encountered:

martinju · 2024-01-14T17:05:49Z

Hi

Hmm, how exactly did you set up the parallellization with future? What backend are you using, and what OS is this on? It seems some data gets lots to some of the computers...

Before doing anything else, I would, however, recommend trying out the latest commit in branch LHBO:Lars/Improve_Gaussian in PR #366

This should speed up the copula method by orders of magnitude.

We'll probably merge the PR into master tomorrow.

AbdollahiAz · 2024-01-14T17:32:21Z

Dear Martin,

Many thanks for this quick feedback. I am new in R and shapr. The hpc is in Linux environment. Regrading the PR 366, I cannot understand the suggestions. Would you please help me. Please find the code as follow:

library(xgboost)
library(shapr)
library(readxl)
library(openxlsx)
#library(ggbeeswarm)
library(future)
future::plan(multisession, workers = 2)

x_var <- c("Day",	"SC",	"Env_Be",	"Env_L",	"Env_mod"	,"Env_Sev"	,"Sp_num",	"h_NoD",	"h_Mi"	,"h_Mo"	,"h_Ex",	"h_Se"	,"Res_Mi",	"Res_mo",	"Res_ex",	"Res_Se")
y_var <- "Re"

x_train1 <- read_excel("Year07.xlsx", sheet = "X_train")
y_train1 <- read_excel("Year07.xlsx", sheet = "Y_train")
x_test1 <- read_excel("Year07.xlsx", sheet = "X_test")
y_test1 <- read_excel("Year07.xlsx", sheet = "Y_test")
x_train=as.matrix(x_train1)
y_train=as.matrix(y_train1)
x_test=as.matrix(x_test1)
y_test=as.matrix(y_test1)


cor(x_train)

model <- xgboost(
  data = as.matrix(x_train),
  label = y_train,
  nround = 20,
  verbose = FALSE
)

p0 <- mean(y_train)

explanation <- explain(
  model = model,
  x_explain = x_test,
  x_train = x_train,
  approach = "copula",
  prediction_zero = p0,
  n_combinations = 10000,
  n_batches=10

)

print(explanation$shapley_values)
write.xlsx(explanation$shapley_value, "outputResult.xlsx")
pdf("/home/aza.iut/RFolder/Plot.pdf", width = 15, height = 10)
#plot(explanation)
#dev.off()
if (requireNamespace("ggplot2", quietly = TRUE)) {
  plot(explanation, plot_type = "scatter")
  #plot(explanation, plot_type = "beeswarm")
}
dev.off()

I can also send you the dataset. I am most grateful for your priceless time in advance.

Best,
Az

martinju · 2024-01-15T12:38:42Z

Hi again. The mentioned PR is now merged, so simply installing shapr again with remotes::install_github("NorskRegnesentral/shapr") will give you the newest version which has a new, much faster copula method implemented. Please try that first to see if the problem is still there. If now, try doing it without parallelization, as there might be something up with the data you are explaining since it works with 200 observations, but not 300.

AbdollahiAz · 2024-01-15T13:38:56Z

Hello Martin,

I really appreciate your quick feedback.
May I ask a somehow naïve questions? For cases with more than 200 observations, Is it possible to split x_explain into subset of 200. In other words, suppose I have 400 observations, with two separate runs, one with 1-200 and other with 201-400, can I have Shapley values for the whole dataset?

Best,
Az

martinju · 2024-01-15T17:34:45Z

Yes, indeed. This is essentially what is done if you set n_batches =2. Depending on how much preprocessing there is, calling explain twice may take much longer or almost the same time.

AbdollahiAz · 2024-01-17T13:51:18Z

Many thanks, Martin.

As mentioned before I am new in shapr and R. I cannot understand the meaning of

This is essentially what is done if you set n_batches =2.

According to (https://norskregnesentral.github.io/shapr/articles/understanding_shapr.html)

we advice that n_batches equals some positive integer multiplied by the number of workers.

The memory/computation time trade-off is most apparent for models with more than say 6-7 features. Below we a basic example where n_batches=10:

Let me reexplain my problem. I have a quite large dataset (about 3000 observation needed to be explained by shapr). It seems the only solution is that I should split the x_explain into 15 subsets. Am I right?

I will be honored if I can benefit from your guidance. Would you please have a glance at my code and let me know your suggestions. (My HPC has enough computational capacity).

Kind regards,
Az

martinju · 2024-01-17T17:47:23Z

If you set n_batches=15, that essentially corresponds to splitting your x_explain into 15 parts and calling explain() 15 times (with n_batches=1). Since you have very large data which may take hours or days to explain properly, I would still split your x_explain into 15 (and use n_batches=10 or so). The reason is that if something crashes you loose everything as there is no temporary saving to disk or so implemented as of now. If you do it in one part at a time, you can do that yourself.

I would also recommend using the progress bsr option to follow progress. See the vignette for how to set that up.

Hope this helps.

AbdollahiAz · 2024-01-17T18:24:32Z

Dear Martin,

I would like to extend my sincere gratitude to you for this fruitful discussion. I hope you outstanding achievements in the development of the shapr package.

Best regards,
Az

AbdollahiAz closed this as completed Jan 17, 2024

aliamini-uq mentioned this issue Apr 4, 2024

Comparing different dependency-aware approaches when the size of x_explain is large #387

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run shapr on HPC with a large size for x_explain #370

Run shapr on HPC with a large size for x_explain #370

AbdollahiAz commented Jan 14, 2024

martinju commented Jan 14, 2024

AbdollahiAz commented Jan 14, 2024

martinju commented Jan 15, 2024

AbdollahiAz commented Jan 15, 2024

martinju commented Jan 15, 2024

AbdollahiAz commented Jan 17, 2024

martinju commented Jan 17, 2024

AbdollahiAz commented Jan 17, 2024

Run shapr on HPC with a large size for x_explain #370

Run shapr on HPC with a large size for x_explain #370

Comments

AbdollahiAz commented Jan 14, 2024

martinju commented Jan 14, 2024

AbdollahiAz commented Jan 14, 2024

martinju commented Jan 15, 2024

AbdollahiAz commented Jan 15, 2024

martinju commented Jan 15, 2024

AbdollahiAz commented Jan 17, 2024

martinju commented Jan 17, 2024

AbdollahiAz commented Jan 17, 2024