Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

address fit() slowdown with sparse tibble and formula preprocessor #246

Open
simonpcouch opened this issue Sep 13, 2024 · 4 comments
Open

Comments

@simonpcouch
Copy link
Contributor

Related to #239—just a place to keep notes on the thought process for supporting sparse tibbles with formula preprocessors. In #245, we see:

library(tidymodels)
  
sparse_hotel_rates <- function() {
  # 99.2 sparsity
  hotel_rates <- modeldata::hotel_rates
  
  prefix_colnames <- function(x, prefix) {
    colnames(x) <- paste(colnames(x), prefix, sep = "_")
    x
  }
  
  dummies_country <- hardhat::fct_encode_one_hot(hotel_rates$country)
  dummies_company <- hardhat::fct_encode_one_hot(hotel_rates$company)
  dummies_agent <- hardhat::fct_encode_one_hot(hotel_rates$agent)
  
  res <- cbind(
    hotel_rates["avg_price_per_room"],
    prefix_colnames(dummies_country, "country"),
    prefix_colnames(dummies_company, "company"),
    prefix_colnames(dummies_agent, "agent")
  )
  
  res <- as.matrix(res)
  Matrix::Matrix(res, sparse = TRUE)
}

hotel_data <- sparse_hotel_rates()
hotel_data <- sparsevctrs::coerce_to_sparse_tibble(hotel_data)

spec <- boost_tree() %>%
  set_mode("regression") %>%
  set_engine("xgboost")

form <- avg_price_per_room ~ .
rec <- recipe(form, data = hotel_data)

wflow <- workflow(spec = spec)

system.time({fit(wflow %>% add_recipe(rec), hotel_data)})
#>    user  system elapsed 
#>   0.255   0.014   0.269
system.time({fit(wflow %>% add_formula(form), hotel_data)})
#>    user  system elapsed 
#>   3.847   0.039   3.905

Created on 2024-09-13 with reprex v2.1.1

In the formula preprocessor fit() evaluation, the data type conversions don't actually take a ton of time:

Screenshot 2024-09-13 at 10 16 32 AM

It's just that, with add_formula(), parsnip::xgb_train(x) is a matrix, whereas it's a dgCMatrix when passed with add_recipe(), and xgboost is much slower when data that ought to be sparse is dense.

@simonpcouch
Copy link
Contributor Author

Ultimately, recipes::juice() and hardhat::recompose() are what handles running recipes preprocessors via fit.action_recipe() -> mold() -> run_mold() -> mold_recipe_default_process(), and they can avoid model.matrix() while they do so because recipes don't allow for tricky formulae. Carving out a path in fit.action_formula() to detect sparse tibbles and then go through a different preprocessor code path (recipe or variables) would only be possible if the formula were to pass recipes:::inline_check().

@simonpcouch
Copy link
Contributor Author

simonpcouch commented Sep 13, 2024

In order of most to least preferred, my thoughts on approaches we could take for fit()ting workflows with formula preprocessors and sparse tibbles:

  1. Maybe overopinionated: Warn on add_formula() with a sparse tibble, recommending that people use add_recipe() instead to preserve sparsity. This is a relatively easy switch in most cases: least convenient when people need to translate some tricky formulae into recipe steps, which isn't too bad.
  2. Inconsistent, somewhat hacky: Carve out a path in fit.action_formula() to detect sparse tibbles and then go through a recipes code path if the formula passes recipes:::inline_check(), otherwise reverting to 1) or 3) or erroring.
  3. High-effort and bug-prone: Re-write a version of model.matrix() that can handle all the same kinds of tricky formulae, but that can do so with sparse vectors.

@EmilHvitfeldt
Copy link
Member

Quick note: if you install the latest dev version of {sparsevctrs} and run withr::local_options("sparsevctrs.verbose_materialize" = 3) you now get an error if a sparse vector is forced materialized by something other that {sparsevctrs} itself.

@simonpcouch
Copy link
Contributor Author

Notes from chatting with Emil on this:

parsnip experiences the same issue with its fit() interface: the sparse data is passed to model.matrix(), which is somewhat slow but, even more importantly, introduces a substantial slowdown with many engines when dense data would have been better represented as sparse. So, we need to address the analogous problem in parsnip.

Approach 1) (with 3) in the long term) is adaptable to both situations and is relatively future-proof. We would raise a condition if users pass sparse data to fit.model_spec() (as in, with a formula) to tell them to transition to fit_xy() if they'd like to make use of sparsity.

Users will only pass fit.workflow(data) or fit.model_spec(data) as sparse data if they do so intentionally; this is "opt-in," and tidymodels doesn't do so by itself. If recipes, at some point in the future, introduces default behavior to coerce to sparsity in smart situations, users are already in the happy path, because they're using a recipe preprocessor with a workflow, and workflows always uses parsnip's fit_xy() interface (unless users have a model formula, in which case... there's a tricky formula and model.matrix() will be needed!).

Emil and I chatted and are on the same page here!


I originally proposed in 1) that we would warn in that case. I think a more tidyverse-style-esque approach would be to error and then reference an option (or control_*() argument) that users can set to say "I know I opted in to use sparse data but am choosing an interface that will coerce to dense and see a slowdown. Let me do it!"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants