rep_col_shuffle()? #484

andrewpbray · 2023-03-13T18:20:28Z

This semester I've been seeing how far I can get in terms of simulation-based inference without using the main part of the infer package. rep_slice_sample() is all you need to do bootstrapping (and it's also very handy for simulation). I'm curious what y'all think about an analogous function like rep_col_shuffle() (or rep_shuffle_col())?

The motivation here is that the default API for infer is based around the formalism of a NHST. These two functions - rep_slice_sample() and rep_shuffle_col() - would allow users (and teachers) to get to through the generate step without the formalism. This helpful for creating a more porous boundary with other forms of simulation; there would be just two fairly generic mechanistically named functions instead of five functions laser focused on the NHST framework.

In terms of implementation, it looks like generate() takes two paths: rep_slice_sample() for bootstrapping and permute() > permute_once() > permute_col() > sample() for permutations. Seems like the easiest approach would be to just wrap permute().

Thoughts?

The text was updated successfully, but these errors were encountered:

simonpcouch · 2023-03-16T12:57:16Z

I dig it! If folks would find this pedagogically useful, I think this is surely within scope and would have a low maintenance burden. :)

mine-cetinkaya-rundel · 2023-03-16T16:08:20Z

I think I can see the value, but I'm having a rough time picturing what procedures would look like based on @andrewpbray's description.

@andrewpbray -- Could you write up a couple of examples as though rep_shuffle_col() existed? Also, I think the name would need to be something else -- shuffle is to sample and slice is to col here (though obviously row would have been better.

rep_slice_sample() vs. rep_col_shuffle()
rep_slice_sample() vs. rep_mutate_shuffle() -- I don't love this at all, but seems more of a parity

andrewpbray · 2023-03-20T17:39:04Z

Here's an example of a permutation test using a difference in means, starting with the existing implementation from full pipeline examples docs.

library(infer)

# existing implementation
null_dist <- gss %>%
  specify(age ~ college) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in means", order = c("degree", "no degree"))
  
# new approach (to get through the generate step)
gss %>%
  rep_col_shuffle(age, reps = 1000)

where the output of the second pipeline would be a data frame with nrow(gss) * reps rows and ncol(gss) + 1 columns, the new column being replicate. In that data frame, age will now be sample(age).

The syntax would be the same for a permutation test for a difference in proportions, the coefficient of a linear model, etc.

If we did a close port of rep_slice_sample(), then that output data frame wouldn't have any of the metadata normally appended by specify() and hypothesize() that is used by calculate(), so the user would have to use dplyr to group_by(replicate) and calculate their statistics. I think that's ok.

simonpcouch added the enhancement label Oct 31, 2023

simonpcouch added feature a feature request or enhancement and removed enhancement labels Mar 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rep_col_shuffle()? #484

rep_col_shuffle()? #484

andrewpbray commented Mar 13, 2023

simonpcouch commented Mar 16, 2023

mine-cetinkaya-rundel commented Mar 16, 2023

andrewpbray commented Mar 20, 2023

rep_col_shuffle()? #484

rep_col_shuffle()? #484

Comments

andrewpbray commented Mar 13, 2023

simonpcouch commented Mar 16, 2023

mine-cetinkaya-rundel commented Mar 16, 2023

andrewpbray commented Mar 20, 2023