Allow use of importance weights in the meta-model #159

mark-burdon · 2024-10-15T10:13:08Z

Feature

Where we have class imbalance, one way of addressing it is to use importance weights. Not all parsnip models support them, but many do. However, when calibrating predictions using {probably} there's no option to use importance weights, which means the prediction probabilities shift back towards the majority class. I know {betacal} is the underlying model for Beta calibration and doesn't take weights, but it should be possible for the other models.

Here's an example: illustrating the problem:


library(tidyverse)
library(tidymodels)
set.seed(100)

# Create imbalanced data, add weight column giving roughly equal overall weight
df <- caret::twoClassSim(n=1000, intercept = -12) |>
  dplyr::mutate(weights = dplyr::if_else(Class == "Class1", 0.15, 0.85)) |>
  dplyr::mutate(weights = hardhat::importance_weights(weights),
                Class = case_match(Class, "Class1" ~ "Majority",
                                   "Class2" ~ "Minority"))

# Create recipe and logistic regression specification
glm_recipe <- recipes::recipe(x = df, formula = Class ~ .)

glm_spec <- parsnip::logistic_reg(mode = "classification",
                                  engine = "glm")

# Combine into workflow
glm_wf <- workflows::workflow(preprocessor = glm_recipe,
                              spec = glm_spec) |>
  workflows::add_case_weights(col = weights)

# Create resamples for model fitting
resamples <- rsample::vfold_cv(data = df,
                               v = 5,
                               strata = Class)

# Fit the model
wf_fit <- tune::fit_resamples(object = glm_wf,
                              resamples = resamples,
                              control = control_resamples(save_pred = TRUE,
                                                          save_workflow = TRUE))

# Collect the predictions
predictions <- tune::collect_predictions(wf_fit)

# Visualise the predictions
predictions |> pull(.pred_Minority) |> hist()

# Find median prediction
predictions |> summarise(median = median(.pred_Minority))

The median prediction pre-calibration is around 36%. We would expect this to change a bit with calibration.

# Now we want to calibrate the probabilities
cal <- probably::cal_estimate_logistic(.data = predictions, truth = Class)
predictions_calibrated <- probably::cal_apply(.data = predictions, object = cal)

# Visualise the predictions
predictions_calibrated |> pull(.pred_Minority) |> hist()

# Find median prediction
predictions_calibrated |> summarise(median = median(.pred_Minority))

The median prediction after calibration is around 10% which is much, much lower than pre-calibration.

The text was updated successfully, but these errors were encountered:

mark-burdon mentioned this issue Oct 29, 2024

Implement importance weighting in stacked ensembles tidymodels/stacks#233

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow use of importance weights in the meta-model #159

Allow use of importance weights in the meta-model #159

mark-burdon commented Oct 15, 2024

Allow use of importance weights in the meta-model #159

Allow use of importance weights in the meta-model #159

Comments

mark-burdon commented Oct 15, 2024

Feature