Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using group_initial_split() with small group will fail even if adjusting the prop parameter? #534

Open
MatthieuStigler opened this issue Sep 8, 2024 · 3 comments

Comments

@MatthieuStigler
Copy link

MatthieuStigler commented Sep 8, 2024

The problem

Summary: group_initial_split() fails often with small-frequency groups even if adjusting prop to reflect the small-frequency group?

I'm using group_initial_split() with a small number (4) groups. As I have one group with low frequency (10%), my intuition was that by setting prop=0.9, this group would be selected within the training sample. However, I get very often (around 70%) error messages such as:

#> Error in group_mc_cv():
#> ! Some assessment sets contained zero rows
#> ℹ Consider using a non-grouped resampling method

How come this happens even if I adjusted prop? This fails even if I get the exact proportion of the group (1-freq(small_group))!? Am I misunderstanding the prop argument?

Thanks!

Reproducible example

library(rsample)
dat <- data.frame(group = sample(LETTERS[1:4], prob = c(0.3, 0.3, 0.3, 0.1), replace = TRUE, size=1000),
                  x = rnorm(1000))
table(dat$group)
#> 
#>   A   B   C   D 
#> 340 270 298  92


set.seed(123)
dat_split <- group_initial_split(dat, group, prop=0.9)
#> Error in `group_mc_cv()`:
#> ! Some assessment sets contained zero rows
#> ℹ Consider using a non-grouped resampling method

# This will fail about 80% times:
set.seed(1234)
mean(sapply(1:100, \(x) inherits(try(group_initial_split(dat, group, prop=0.9), silent = TRUE), "try-error")))
#> [1] 0.79

Created on 2024-09-08 with reprex v2.1.1

@hfrick
Copy link
Member

hfrick commented Sep 12, 2024

Hi @MatthieuStigler

From the docs:

group_initial_split() creates splits of the data based on some grouping variable, so that all data in a "group" is assigned to the same split.

while you are

trying to get each group at least once in the test sample.

Since groups as a whole get allotted to training or testing, they can't be all represented in the test set, otherwise there would be no observations left for the training set.

Stratification (as opposed to grouped resampling) aims to ensure that the proportion of each group is the same in the training and testing set as it is in the full dataset. So if you have a small group and want a training and testing set which both contain all groups, including that small group, stratification is typically what you want to use. This can be done with the strata argument for initial_split(), see example below.

Does this help?

library(rsample)

set.seed(123)
dat <- data.frame(group = sample(LETTERS[1:4], prob = c(0.3, 0.3, 0.3, 0.1), replace = TRUE, size=1000),
                  x = rnorm(1000))
# proportion of each group in the data
table(dat$group) / nrow(dat)
#> 
#>     A     B     C     D 
#> 0.296 0.301 0.311 0.092

dat_split <- initial_split(dat, strata = "group", prop = 0.75)
dat_train <- training(dat_split)
dat_test <- testing(dat_split)

# preserved proportions
table(dat_train$group) / nrow(dat_train)
#> 
#>          A          B          C          D 
#> 0.29906542 0.29773031 0.30841121 0.09479306
table(dat_test$group) / nrow(dat_test)
#> 
#>          A          B          C          D 
#> 0.28685259 0.31075697 0.31872510 0.08366534

# what the prop argument does
nrow(dat_train) / nrow(dat)
#> [1] 0.749

Created on 2024-09-12 with reprex v2.1.0

@MatthieuStigler
Copy link
Author

Hi @hfrick

thanks a lot for the answer. Sorry, that last statement was a bit misleading (I meant that by running K times, I want to each time one group in the test sample), so I removed that part.

The main question remains: how come, having one group with frequency 0.1, setting prop=0.9 fails consistently (instead of attributing the 10% group in the test sample).

Thanks!

@hfrick
Copy link
Member

hfrick commented Sep 13, 2024

Ah, I see. Thanks for clarifying!

I would say this could be loosely answered with "the error happens because we are sampling, not optimizing". In your example, we have 4 groups with one group about the size of the test set. So a grouped split with prob = 0.9 only works if we assign that smallest group, D, to the test set. But we have 4 to choose from, so it should fail in 3/4 of the attempts.

If you increase the number of attempts in your last illustration, you should be able to see it move towards 0.75.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants