Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow use of importance weights in the meta-model #159

Open
mark-burdon opened this issue Oct 15, 2024 · 0 comments
Open

Allow use of importance weights in the meta-model #159

mark-burdon opened this issue Oct 15, 2024 · 0 comments

Comments

@mark-burdon
Copy link

Feature

Where we have class imbalance, one way of addressing it is to use importance weights. Not all parsnip models support them, but many do. However, when calibrating predictions using {probably} there's no option to use importance weights, which means the prediction probabilities shift back towards the majority class. I know {betacal} is the underlying model for Beta calibration and doesn't take weights, but it should be possible for the other models.

Here's an example: illustrating the problem:


library(tidyverse)
library(tidymodels)
set.seed(100)

# Create imbalanced data, add weight column giving roughly equal overall weight
df <- caret::twoClassSim(n=1000, intercept = -12) |>
  dplyr::mutate(weights = dplyr::if_else(Class == "Class1", 0.15, 0.85)) |>
  dplyr::mutate(weights = hardhat::importance_weights(weights),
                Class = case_match(Class, "Class1" ~ "Majority",
                                   "Class2" ~ "Minority"))

# Create recipe and logistic regression specification
glm_recipe <- recipes::recipe(x = df, formula = Class ~ .)

glm_spec <- parsnip::logistic_reg(mode = "classification",
                                  engine = "glm")

# Combine into workflow
glm_wf <- workflows::workflow(preprocessor = glm_recipe,
                              spec = glm_spec) |>
  workflows::add_case_weights(col = weights)

# Create resamples for model fitting
resamples <- rsample::vfold_cv(data = df,
                               v = 5,
                               strata = Class)

# Fit the model
wf_fit <- tune::fit_resamples(object = glm_wf,
                              resamples = resamples,
                              control = control_resamples(save_pred = TRUE,
                                                          save_workflow = TRUE))

# Collect the predictions
predictions <- tune::collect_predictions(wf_fit)

# Visualise the predictions
predictions |> pull(.pred_Minority) |> hist()

# Find median prediction
predictions |> summarise(median = median(.pred_Minority))

The median prediction pre-calibration is around 36%. We would expect this to change a bit with calibration.

# Now we want to calibrate the probabilities
cal <- probably::cal_estimate_logistic(.data = predictions, truth = Class)
predictions_calibrated <- probably::cal_apply(.data = predictions, object = cal)

# Visualise the predictions
predictions_calibrated |> pull(.pred_Minority) |> hist()

# Find median prediction
predictions_calibrated |> summarise(median = median(.pred_Minority))

The median prediction after calibration is around 10% which is much, much lower than pre-calibration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant