Move part of README and vignette into Article

courtiol · Oct 27, 2023 · 25add0a · 25add0a
1 parent 4c6faf4
commit 25add0a
Show file tree

Hide file tree

Showing 10 changed files with 200 additions and 495 deletions.
diff --git a/.Rbuildignore b/.Rbuildignore
@@ -11,3 +11,4 @@
 ^_pkgdown\.yml$
 ^docs$
 ^pkgdown$
+^vignettes$
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -31,6 +31,5 @@ Suggests:
     slider,
     testthat (>= 2.1.0),
     tidyr
-VignetteBuilder: knitr
 Roxygen: list(markdown = TRUE)
 RoxygenNote: 7.2.3
diff --git a/README.Rmd b/README.Rmd
@@ -40,14 +40,12 @@ There is hardly any code behind `lay()` (it can be coded in 3 lines), so this pa
 
 ### Installation
 
-You can install a development version of **{lay}** with:
+You can install the development version of **{lay}** with:
 
 ``` r
-# install.packages("remotes")
 remotes::install_github("courtiol/lay")
 ```
 
-
 ### Motivation
 
 Consider the following dataset, which contains information about the use of pain relievers for non medical purpose.
@@ -110,11 +108,11 @@ world_bank_pop |>
              ~ tibble(min = min(.x), mean = mean(.x), max = max(.x))), .after = indicator)
 ```
 
-Since the other backbone of `lay()` is [**{vctrs}**](https://vctrs.r-lib.org), the splicing happens automatically (unless the output of the call is used to create a named column). This is why, in the last chunk of code, three different columns (*min*, *mean* and *max*) where directly created.
+Since the other backbone of `lay()` is [**{vctrs}**](https://vctrs.r-lib.org), the splicing happens automatically (unless the output of the call is used to create a named column). This is why, in the last chunk of code, three different columns (*min*, *mean* and *max*) are directly created.
 
 **Important:** when using `lay()` the function you want to use for the rowwise job must output a scalar (vector of length 1), or a tibble or data frame with a single row.
 
-We can apply a function that returns a vector of length > 1 by turning such vector into a tibble using `as_tibble_row()` from [**{tibble}**](https://tibble.tidyverse.org/):
+We can apply a function that returns a vector of length > 1 by turning such a vector into a tibble using `as_tibble_row()` from [**{tibble}**](https://tibble.tidyverse.org/):
 
 ```{r worldbank2}
 world_bank_pop |>
@@ -123,202 +121,12 @@ world_bank_pop |>
              ~ as_tibble_row(quantile(.x, na.rm = TRUE))), .after = indicator)
 ```
 
-
-### Alternatives to `lay()`
-
-Of course, there are many alternatives to perform rowwise jobs.
-
-Let's now consider, in turns, these alternatives -- sticking to our example about drugs usage.
-
-
-#### Alternative 1: vectorized solution
-
-One solution is to simply do the following:
-```{r vector}
-drugs_full |>
-  mutate(everused = codeine | hydrocd | methdon | morphin | oxycodp | tramadl | vicolor)
-```
-It is certainly very efficient from a computational point of view, but coding this way presents two main limitations:
-
-  - you need to name all columns explicitly, which can be problematic when dealing with many columns
-  - you are stuck with expressing your task with logical and arithmetic operators, which is not always sufficient
-
-
-#### Alternative 2: 100% [**{dplyr}**](https://dplyr.tidyverse.org/)
-
-```{r dplyr}
-drugs |>
-  rowwise() |>
-  mutate(everused = any(c_across(-caseid))) |>
-  ungroup()
-```
-It is easy to use as `c_across()` turns its input into a vector and `rowwise()` implies that the
-vector only represents one row at a time. Yet, for now it remains quite slow on large datasets (see **Efficiency** below).
-
-
-#### Alternative 3: [**{tidyr}**](https://tidyr.tidyverse.org/)
-
-```{r, }
-library(tidyr)  ## requires to have installed {tidyr}
-
-drugs |>
-  pivot_longer(-caseid) |>
-  group_by(caseid) |>
-  mutate(everused = any(value)) |>
-  ungroup() |>
-  pivot_wider() |>
-  relocate(everused, .after = last_col())
-```
-Here the trick is to turn the rowwise problem into a column problem by pivoting the values and then
-pivoting the results back. Many find that this involves a little too much intellectual gymnastic. It
-is also not particularly efficient on large dataset both in terms of computation time and memory required
-to pivot the tables.
-
-
-#### Alternative 4: [**{purrr}**](https://purrr.tidyverse.org/)
-
-```{r purrr}
-library(purrr)  ## requires to have installed {purrr}
-
-drugs |>
-  mutate(everused = pmap_lgl(pick(-caseid), ~ any(...)))
-```
-This is a perfectly fine solution and actually part of what one implementation of `lay()` relies on
-(if `.method = "tidy"`), but from a user perspective it is a little too geeky-scary.
-
-
-#### Alternative 5: [**{slider}**](https://slider.r-lib.org/)
-
-```{r slider}
-library(slider)   ## requires to have installed {slider}
-
-drugs |>
-  mutate(everused = slide_vec(pick(-caseid), any))
-```
-The package [**{slider}**](https://slider.r-lib.org/) is a powerful package which provides several *sliding window* functions.
-It can be used to perform rowwise operations and is quite similar to **{lay}** in terms syntax.
-It is however not as efficient as **{lay}** and I am not sure it supports the automatic splicing demonstrated above.
-
-
-#### Alternative 6: [**{data.table}**](https://rdatatable.gitlab.io/data.table/)
-
-```{r data.table, message=FALSE}
-library(data.table)  ## requires to have installed {data.table}
-
-drugs_dt <- data.table(drugs)
-
-drugs_dt[, ..I := .I]
-drugs_dt[, everused := any(.SD), by = ..I, .SDcols = -"caseid"]
-drugs_dt[, ..I := NULL]
-as_tibble(drugs_dt)
-```
-This is a solution for those using [**{data.table}**](https://rdatatable.gitlab.io/data.table/).
-It is not particularly efficient, nor particularly easy to remember for those who do not program frequently using [**{data.table}**](https://rdatatable.gitlab.io/data.table/).
-
-
-#### Alternative 7: `apply()`
-
-```{r apply}
-drugs |>
-  mutate(everused = apply(pick(-caseid), 1L, any))
-```
-This is the base R solution. Very efficient and actually part of the default method used in `lay()`.
-Our implementation of `lay()` strips the need of defining the margin (the `1L` above) and benefits from
-the automatic splicing and the lambda syntax as shown above.
-
-
-#### Alternative 8: `for (i in ...) {...}`
-
-```{r for}
-drugs$everused <- NA
-
-columns_in <- !colnames(drugs) %in% c("caseid", "everused")
-
-for (i in seq_len(nrow(drugs))) {
-  drugs$everused[i] <- any(drugs[i, columns_in])
-}
-
-drugs
-```
-This is another base R solution, which does not involve any external package. It is not very pretty,
-nor particularly efficient.
-
-
-#### Other alternatives?
-
-There are probably other ways. If you think of a nice one, please leave an issue and we will add it here!
-
-
-### Efficiency
-
-Here are the results of a benchmark comparing alternative implementations for our simple rowwise job on
-a larger dataset with `r ncol(drugs_full)` columns and `r nrow(drugs_full)` rows (see [benchmark](https://courtiol.github.io/lay/articles/benchmark.html) for details and more tests):
-
-```{r bench_run1, eval=TRUE, echo=FALSE, warning=FALSE, message=FALSE, fig.width=8, fig.height=5}
-rm(drugs)
-
-drugs_full_dt <- data.table(drugs_full) ## coercion to data.table
-
-benchmark1 <- bench::mark(
-  vectorized = {
-    drugs_full |>
-      mutate(everused = codeine | hydrocd | methdon | morphin | oxycodp | tramadl | vicolor)},
-  lay = {
-    drugs_full |>
-      select(-caseid) |>
-      mutate(everused = lay(pick(everything()), any))},
-  c_across = {
-    drugs_full |>
-      rowwise() |>
-      mutate(everused = any(c_across(-caseid))) |>
-      ungroup()},
-  pivot_pivot = {
-    drugs_full |>
-      pivot_longer(-caseid) |>
-      group_by(caseid) |>
-      mutate(everused = any(value)) |>
-      ungroup() |>
-      pivot_wider()},
-  pmap = {
-    drugs_full |>
-      mutate(everused = pmap_lgl(pick(-caseid), ~ any(...)))},
-  slider = {
-    drugs_full |>
-      mutate(everused = slide_lgl(pick(-caseid), any))},
-  data.table = {
-    drugs_full_dt[, ..I := .I]
-    drugs_full_dt[, everused := any(.SD), by = ..I, .SDcols = -"caseid"]},
-  apply = {
-    drugs_full |>
-      mutate(everused = apply(pick(-caseid), 1, any))},
-  'for' = {
-    everused <- logical(nrow(drugs_full))
-    columns_in <- colnames(drugs_full) != "caseid"
-    for (i in seq_len(nrow(drugs_full))) everused[i] <- any(drugs_full[i, columns_in])},
-  iterations = 5,
-  time_unit = "ms",
-  check = FALSE
-  )
-benchmark1 |>
-  mutate(expression = forcats::fct_reorder(as.character(expression), median, .desc = TRUE)) |>
-  plot()
-```
-
-Note that the x-axis of the plot is on a logarithmic scale.
-
-As you can see, `lay()` is not just simple and powerful, it is also quite efficient!
-
-
 ### History
 
 <img src="https://github.com/courtiol/lay/raw/main/.github/pics/lay_history.png" alt="lay_history" align="right" width="400">
 
-The first draft of this package has been created by **@romainfrancois** as a reply to a tweet I posted under **@rdataberlin** in February 2020.
-At the time I was exploring different ways to perform rowwise jobs in R and I was experimenting with various ideas on how to exploit 
-the fact that the newly introduced function `across()` from [**{dplyr}**](https://dplyr.tidyverse.org/) creates tibbles on which on can easily apply a function.
-Romain came up with `lay()` as the better solution making good use of [**{rlang}**](https://rlang.r-lib.org/) & [**{vctrs}**](https://vctrs.r-lib.org/).
-
-The verb `lay()` never made it to be integrated within [**{dplyr}**](https://dplyr.tidyverse.org/) and, so far, I still find `lay()` superior than
-most alternatives, which is why I decided to maintain this package.
+The first draft of this package has been created by **@romainfrancois** as a reply to a tweet I (Alexandre Courtiol) posted under **@rdataberlin** in February 2020.
+At the time I was exploring different ways to perform rowwise jobs in R and I was experimenting with various ideas on how to exploit the fact that the newly introduced function `across()` from [**{dplyr}**](https://dplyr.tidyverse.org/) creates tibbles on which one can easily apply a function.
+Romain came up with `lay()` as the better solution, making good use of [**{rlang}**](https://rlang.r-lib.org/) & [**{vctrs}**](https://vctrs.r-lib.org/).
 
-In short, I deserve little credit and instead you should feel free to buy Romain a coffee [here](https://ko-fi.com/romain) or to sponsor his [github profile](https://github.com/romainfrancois).
+The verb `lay()` never made it to be integrated within [**{dplyr}**](https://dplyr.tidyverse.org/), but, so far, I still find `lay()` superior than most alternatives, which is why I decided to document and maintain this package.