Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harmonize batch distribution ++ #359

Merged
merged 65 commits into from
Nov 20, 2023

Conversation

LHBO
Copy link
Collaborator

@LHBO LHBO commented Aug 24, 2023

Done in this PR:

  • Breaking change: input argument approach should now of length one less than the number of features (of multiple approaches are combined). The last one was previously in any case just ignored. Docs and vignette updated accordingly.

  • Introduce internal helper arguments, n_approaches + n_unique_approaches

  • Bugfix how to set the default number of batches

  • Robustify procedure to distribute batches among multiple approaches

  • Update all snaphot test due to change output format of testthat 3.2.0

  • Temporarily disable oldrel-1 and oldrel-2 test due to changes in how R output error messages from oldrel-1 (R.4.2) to release (R 4.3) causing the output to look different (with the new testthat package).


OLD text:

…er of batches (when not provided in the explain function call) to a number which then throws an error in the check_n_batches function. Logical error as get_default_n_batches could previously set n_batches to a larger value than n_combinations - 1. Subtract one as the check_n_batches function specifies that n_batches must be strictly less than n_combinations.

The bug occurred for example for:

# The other stuff here is the variables from the first code block in the vignette.
> explanation <- explain(
+   model = model,
+   x_explain = x_explain,
+   x_train = x_train,
+   approach = "gaussian",
+   prediction_zero = p0,
+   n_combinations = 8
+ )
Note: Feature classes extracted from the model contains NA.
Assuming feature classes from the data are correct.

Setting parameter 'n_batches' to 10 as a fair trade-off between memory consumption and computation time.
Reducing 'n_batches' typically reduces the computation time at the cost of increased memory consumption.

Error in check_n_batches(internal) : 
  `n_batches` (10) must be smaller than the number of feature combinations/`n_combinations` (8)

LHBO added 3 commits August 24, 2023 10:47
…g several approaches the old version only used the first approach. Verified this by adding a print in each prepare_data.approach() function and saw that only the first approach in internal$parameters$approach was used.

Can maybe remove code comments before pull request is accepted. Maybe a better method to get the approach?

Also updated roxygen2 for the function, as it seemed that it was reflecting the old version of shapr(?) due to arguments which are no longer present.

However, one then get a warning when creating the roxygen2 documentation. Discuss some solutions as comments below. Discuss with Martin.
…ion `check_n_batches` threw an error for the vignette with gaussian approach with `n_combinations` = 8 and `n_batches = NULL`, as this function here then set `n_batches` = 10, which was too large. We subtract 1 as `check_n_batches` function specifies that `n_batches` must be strictly less than `n_combinations`.
@LHBO
Copy link
Collaborator Author

LHBO commented Aug 24, 2023

The tests failed as I changed the printout. I added the word of as I thought that was more grammatically correct. But one can just remove it again. The intention of the sentence is clear.

The new printout is:
n_batches (32) must be smaller than the number of feature combinations/n_combinations (32)

LHBO added 3 commits August 24, 2023 12:44
…to 2^m", but all the test only tested for "larger than". I.e., if the user specified n_combinations = 2^m in the call to shapr::explain, the function would not treat it as exact.
…t mode when `n_combinations = 2^m`, before the bugfix.
…`n_combinations >= 2^m`. Remove the large comment after discussing that with Martin.
@@ -217,7 +217,7 @@ check_n_batches <- function(internal) {

if (n_batches >= actual_n_combinations) {
stop(paste0(
"`n_batches` (", n_batches, ") must be smaller than the number feature combinations/`n_combinations` (",
"`n_batches` (", n_batches, ") must be smaller than the number of feature combinations/`n_combinations` (",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes the previous tests fail, as now the error messages are different. Let Martin decide if we keep this alteration or not.

…est checking that we do not get an error when runing the code after the bugfix has been applied.
@LHBO LHBO requested a review from martinju August 31, 2023 12:27
@LHBO
Copy link
Collaborator Author

LHBO commented Sep 4, 2023

Should not be merged yet, as another bug occurred when I was working on combined approaches.

explain_numeric = explain(
  model = model_lm_numeric,
  x_explain = x_explain_numeric,
  x_train = x_train_numeric,
  approach = c("independence", "empirical", "gaussian", "copula", "empirical"),
  prediction_zero = p0,
  n_batches = NULL,
  timing = FALSE)
Setting parameter 'n_batches' to 2 as a fair trade-off between memory consumption and computation time.
Reducing 'n_batches' typically reduces the computation time at the cost of increased memory consumption.


explain_numeric$internal$objects$S_batch
$`1`
[1]  2  3  4  5  6 32

$`2`
 [1]  7  8  9 10 11 12 13 14 15 16

$`3`
 [1] 17 18 19 20 21 22 23 24 25 26

$`4`
[1] 27 28 29 30 31

> explain_numeric$internal$parameters$n_batches
[1] 2

@LHBO
Copy link
Collaborator Author

LHBO commented Sep 4, 2023

There are also some other additional bugs with the combined approaches, e.g., that setting seed does not work.

LHBO added 13 commits September 4, 2023 12:25
… the number of approaches and the number of unique approaches.

This is for example useful to check that the provided `n_batches` is a valid value. (see next commits)
…the number of unique approaches. Before the user could, e.g., set `n_batches = 2`, but use 4 approaches and then shapr would use 4 but not update `n_batches` and without giwing a warning to the user.
… of unique approaches that is used. This was not done before and gave inconsistency in what number shapr would reccomend and use when `n_batches` was set to `null` by the user.
…ombined approaches.

Furthermore, added if test, because previous version resulted in not reproducible code, as setting seed to `null` ruins that we set seed in `explain()`.

Just consider this small example:
# Set seed to get same values twice
set.seed(123)
rnorm(1)

# Seting the same seed gives the same value
set.seed(123)
rnorm(1)

# If we also include null then the seed is removed and we do not get the same value
set.seed(123)
set.seed(NULL)
rnorm(1)

# Setining seed to null actually gives a new "random" number each time.
set.seed(123)
set.seed(NULL)
rnorm(1)
@LHBO
Copy link
Collaborator Author

LHBO commented Sep 4, 2023

Looks like some test fails. Will look at that tomorrow.

@LHBO
Copy link
Collaborator Author

LHBO commented Sep 11, 2023

I looked at the tests here and almost all of them are OK (just adding two new objects), except for the test involving the independence approach, where 'something' is off slightly, causing a small numerical change (10th decimal place or something). It is probably fine, but would be good to know what causes it. That will be easier to see once #356 is merged as this PR contains all those changes as well, making it harder to spot what is actually changed here.

I have merged #356 into this PR now. Will try to look more into it.

The file test-setup.R gave no fails, except for two fails; due to that, I added the word "of" to the printout of an error message.

There are a lot of errors in test-output.R:

  1. Some of the mistakes are related to that in the comb explainers where n_batches is set to 1, which is now an Invalid input as n_batches must be equal or larger than the number of unique approaches used. This was not the case before; then, the user could let n_batches = 1, but shapr would silently overwrite it. I would recommend to use n_batches = 10, then we test that both the random distribution of the batches between the approaches are the same each time and the generated MC samples.
  2. I have tried to look more into it, but not sure about the pipeline/workflow. E.g., when I call testthat::snapshot_review('output/') to review changes, I just get errors:
Error: embedded nul in string: '\037\x8b\b\0\0\0\0\0\0\003\xed\032\ttTE\xf2\xcfE\xeedB\002\xe4 \x87\020\x82@\fgH\u0091߁\x80$$\001\002\xe1\022\030&\xc9'\031\x99̌3\023\004d5\034\vB\x94C@\x96p\bx\0*^xn\x84\x89 \b\vD\016\021\027\x82r(+(AQ\\D \xdbݿz\xe6\xcf\xcf f\t\xba\xef\xb9\xf3R\xa9\xee\xea\xaa\xea\xaa\xea\xee\xfa\xdd\xfd\xffh_\x8e\xe3T\x9cZ\xa1\xe0T\032\\\xe44\xf9#\006ޗ\x82i!\xb4A\xc4\xcd0\004\x92:\032\x90r\xf6\xe9q\x9f\xce"x\036\xf7ؿ\b^\026\x9d\022\xcbګی\xd8dXU\xf6\004\nH\xc9\xd6\xed{_\x83\u009b\xaf\x88\xf0\xae[\xe8\x94o\xbf>i\xee\xe9\xd4\xfc\xeaV\xb1{\xf6\xcek\031[\x9dd9߭\xe8\x8b\021\xce\xf6\x8e\023\xbc̏\xc6\005VG\x84\xa5$\xb5\x9b\x99P\035\x955\xa1 (\032\xb1vG\xed\xbaw3\xdf?\033\xe68qt\\\xe9\xd2\031g\xf8k\xb1W\xben\xd3\xf2_\xce\xfeC\x92\x86\xae\xf1iy\xb0:<\xd1\177Ea\xe7J\0247w䖃gJ9N\xad\xc4\xed\nN\xcd\xf9\020'M\xfaR\xc1\x86\vZ\xea\x9cHT\x9b\xcc&\001\xca^\xc3\xcdF\xbd51\x8f5\x8d2\x98\x8aXy\x84Pj\x81\xb2*C?\r\x8a\xfe9f\x93\xbdD7I_h7[e\xdd\xf9X\xcd\017'\xb2.\0030(\xcb\xf1\xbf\xfa\xfa
  1. Talk with Martin and see if it works on his computer.

LHBO and others added 4 commits September 12, 2023 11:00
…us value (`n_batches = 1`) is not allowed anymore as it is lower than the number of unique used approaches.
@LHBO
Copy link
Collaborator Author

LHBO commented Nov 16, 2023

La til noen comments

Copy link
Collaborator Author

@LHBO LHBO left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added comments review

}
min_checked <- max(c(this_min, suggestion))
ret <- min(c(this_max, min_checked, n_combinations - 1))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We subtract 1 as thecheck_n_batches() function specifies that n_batches must be strictly less than n_combinations.

# Error in check_n_batches(internal) :
# `n_batches` (10) must be smaller than the number feature combinations/`n_combinations` (8)
# Bug only occures for "ctree", "gaussian", and "copula" as they are treated different in
# `get_default_n_batches()`, I am not certain why. Ask Martin about the logic behind that.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Hidden) question here... I should be better at making them more visable

Comment on lines 667 to 686
# Ensures that the number of batches corresponds to `n_batches`
if (sum(batch_count_dt$n_batches_per_approach) != n_batches) {
# Ensure that the number of batches is not larger than `n_batches`.
# Remove one batch from the approach with the most batches.
while (sum(batch_count_dt$n_batches_per_approach) > n_batches) {
approach_to_subtract_batch <- which.max(batch_count_dt$n_batches_per_approach)
batch_count_dt$n_batches_per_approach[approach_to_subtract_batch] <-
batch_count_dt$n_batches_per_approach[approach_to_subtract_batch] - 1
}

# Ensure that the number of batches is not lower than `n_batches`.
# Add one batch to the approach with most coalitions per batch
while (sum(batch_count_dt$n_batches_per_approach) < n_batches) {
approach_to_add_batch <- which.max(batch_count_dt$n_S_per_approach /
batch_count_dt$n_batches_per_approach)
batch_count_dt$n_batches_per_approach[approach_to_add_batch] <-
batch_count_dt$n_batches_per_approach[approach_to_add_batch] + 1
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LHBO Not sure whether this extra complexity is really worth it to cover the edge cases where the n_batches does not add up. An alternative is to provide a warning that the number of batches is changed (if specified byt he user), or otherwise just change it. What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martinju, I agree that it is not very elegant code, but it does not take long to run and is only run once.
The two main reasons I added this is:

  1. The first reason was to ensure that the number of batches adds up to n_batches.
  2. The second reason, and likely the most important, is that if the user has specified future::plan(multisession, workers = n_batches), then it would be nice that shapractually used n_batches.

If this is not that important, then I am okay with just giving a warning.

@martinju
Copy link
Member

The failing tests on R4.2 and R4.1 (oldrel-1 and oldrel-2) seems to be due to a change on how errors are reported to the terminal. In the previous versions a missing input argument would not throw an error before one tried to access the missing argument, while in R4.3 an error is thrown immediately. This causes an error now due to an update in testthat (remember to update to v3.2.0 if you run locally) which now shows which function is throwing the error, which is then different in the two versions. I will try to look into how to get around this. One possibility is to simply ignore it as we know the reason for the error, it does not cause practical problems.

Copy link
Collaborator Author

@LHBO LHBO left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Propose some code simplification in get_extra_parameters().

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need the if-else anymore, so a simpler version of the code could be:

# Get the number of approaches, which is always either one or 
# one less than the number of features if a combination of approaches is used.
internal$parameters$n_approaches <- length(internal$parameters$approach)

# Get the number of unique approaches, as the same
# approach can be used for several feature combination sizes.
internal$parameters$n_unique_approaches <- length(unique(internal$parameters$approach))

@martinju martinju changed the title Fixed bug in get_default_n_batches function where shapr sets the numb… Harmonize batch distribution ++ Nov 20, 2023
Copy link
Member

@martinju martinju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are ready to merge this now (provided the checks pass)

@LHBO LHBO merged commit 90afa9a into NorskRegnesentral:master Nov 20, 2023
7 checks passed
@LHBO LHBO deleted the Lars/bugfix_get_default_n_batches branch November 20, 2023 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants