Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sorting of strata in training data from initial_split #484

Open
dicook opened this issue May 21, 2024 · 0 comments
Open

Sorting of strata in training data from initial_split #484

dicook opened this issue May 21, 2024 · 0 comments
Labels
feature a feature request or enhancement

Comments

@dicook
Copy link

dicook commented May 21, 2024

I have just completed running a kaggle challenge for my machine learning class. I had a surprise that the training data is sorted by the variable used for strata:

# Here is code to reproduce
set.seed(921)
d <- tibble(x=runif(100), y=sample(c("y", "n"), 100, replace=TRUE))
d_split <- initial_split(d, strata=y)
d_tr <- training(d_split)
d_tr$y
d_ts <- testing(d_split)
d_ts$y

and we find

> d_tr$y
 [1] "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n"
[13] "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n"
[25] "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n" "n"
[37] "n" "n" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y"
[49] "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y"
[61] "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y" "y"
[73] "y" "y"

although test set is not (luckily for me!)

d_ts$y
 [1] "n" "n" "n" "n" "y" "y" "n" "y" "y" "y" "y" "y"
[13] "y" "y" "y" "n" "n" "n" "n" "n" "y" "y" "n" "y"
[25] "n" "n"

I'm not sure if this is intentional but it would be better to have the default be that these are in a random order, with an input parameter to sort being optional.

@EmilHvitfeldt EmilHvitfeldt added the feature a feature request or enhancement label May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants