Skip to content

Commit

Permalink
Move part of README and vignette into Article
Browse files Browse the repository at this point in the history
  • Loading branch information
courtiol committed Oct 27, 2023
1 parent 4c6faf4 commit 25add0a
Show file tree
Hide file tree
Showing 10 changed files with 200 additions and 495 deletions.
1 change: 1 addition & 0 deletions .Rbuildignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,4 @@
^_pkgdown\.yml$
^docs$
^pkgdown$
^vignettes$
1 change: 0 additions & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,5 @@ Suggests:
slider,
testthat (>= 2.1.0),
tidyr
VignetteBuilder: knitr
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.2.3
206 changes: 7 additions & 199 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -40,14 +40,12 @@ There is hardly any code behind `lay()` (it can be coded in 3 lines), so this pa

### Installation

You can install a development version of **{lay}** with:
You can install the development version of **{lay}** with:

``` r
# install.packages("remotes")
remotes::install_github("courtiol/lay")
```


### Motivation

Consider the following dataset, which contains information about the use of pain relievers for non medical purpose.
Expand Down Expand Up @@ -110,11 +108,11 @@ world_bank_pop |>
~ tibble(min = min(.x), mean = mean(.x), max = max(.x))), .after = indicator)
```

Since the other backbone of `lay()` is [**{vctrs}**](https://vctrs.r-lib.org), the splicing happens automatically (unless the output of the call is used to create a named column). This is why, in the last chunk of code, three different columns (*min*, *mean* and *max*) where directly created.
Since the other backbone of `lay()` is [**{vctrs}**](https://vctrs.r-lib.org), the splicing happens automatically (unless the output of the call is used to create a named column). This is why, in the last chunk of code, three different columns (*min*, *mean* and *max*) are directly created.

**Important:** when using `lay()` the function you want to use for the rowwise job must output a scalar (vector of length 1), or a tibble or data frame with a single row.

We can apply a function that returns a vector of length > 1 by turning such vector into a tibble using `as_tibble_row()` from [**{tibble}**](https://tibble.tidyverse.org/):
We can apply a function that returns a vector of length > 1 by turning such a vector into a tibble using `as_tibble_row()` from [**{tibble}**](https://tibble.tidyverse.org/):

```{r worldbank2}
world_bank_pop |>
Expand All @@ -123,202 +121,12 @@ world_bank_pop |>
~ as_tibble_row(quantile(.x, na.rm = TRUE))), .after = indicator)
```


### Alternatives to `lay()`

Of course, there are many alternatives to perform rowwise jobs.

Let's now consider, in turns, these alternatives -- sticking to our example about drugs usage.


#### Alternative 1: vectorized solution

One solution is to simply do the following:
```{r vector}
drugs_full |>
mutate(everused = codeine | hydrocd | methdon | morphin | oxycodp | tramadl | vicolor)
```
It is certainly very efficient from a computational point of view, but coding this way presents two main limitations:

- you need to name all columns explicitly, which can be problematic when dealing with many columns
- you are stuck with expressing your task with logical and arithmetic operators, which is not always sufficient


#### Alternative 2: 100% [**{dplyr}**](https://dplyr.tidyverse.org/)

```{r dplyr}
drugs |>
rowwise() |>
mutate(everused = any(c_across(-caseid))) |>
ungroup()
```
It is easy to use as `c_across()` turns its input into a vector and `rowwise()` implies that the
vector only represents one row at a time. Yet, for now it remains quite slow on large datasets (see **Efficiency** below).


#### Alternative 3: [**{tidyr}**](https://tidyr.tidyverse.org/)

```{r, }
library(tidyr) ## requires to have installed {tidyr}
drugs |>
pivot_longer(-caseid) |>
group_by(caseid) |>
mutate(everused = any(value)) |>
ungroup() |>
pivot_wider() |>
relocate(everused, .after = last_col())
```
Here the trick is to turn the rowwise problem into a column problem by pivoting the values and then
pivoting the results back. Many find that this involves a little too much intellectual gymnastic. It
is also not particularly efficient on large dataset both in terms of computation time and memory required
to pivot the tables.


#### Alternative 4: [**{purrr}**](https://purrr.tidyverse.org/)

```{r purrr}
library(purrr) ## requires to have installed {purrr}
drugs |>
mutate(everused = pmap_lgl(pick(-caseid), ~ any(...)))
```
This is a perfectly fine solution and actually part of what one implementation of `lay()` relies on
(if `.method = "tidy"`), but from a user perspective it is a little too geeky-scary.


#### Alternative 5: [**{slider}**](https://slider.r-lib.org/)

```{r slider}
library(slider) ## requires to have installed {slider}
drugs |>
mutate(everused = slide_vec(pick(-caseid), any))
```
The package [**{slider}**](https://slider.r-lib.org/) is a powerful package which provides several *sliding window* functions.
It can be used to perform rowwise operations and is quite similar to **{lay}** in terms syntax.
It is however not as efficient as **{lay}** and I am not sure it supports the automatic splicing demonstrated above.


#### Alternative 6: [**{data.table}**](https://rdatatable.gitlab.io/data.table/)

```{r data.table, message=FALSE}
library(data.table) ## requires to have installed {data.table}
drugs_dt <- data.table(drugs)
drugs_dt[, ..I := .I]
drugs_dt[, everused := any(.SD), by = ..I, .SDcols = -"caseid"]
drugs_dt[, ..I := NULL]
as_tibble(drugs_dt)
```
This is a solution for those using [**{data.table}**](https://rdatatable.gitlab.io/data.table/).
It is not particularly efficient, nor particularly easy to remember for those who do not program frequently using [**{data.table}**](https://rdatatable.gitlab.io/data.table/).


#### Alternative 7: `apply()`

```{r apply}
drugs |>
mutate(everused = apply(pick(-caseid), 1L, any))
```
This is the base R solution. Very efficient and actually part of the default method used in `lay()`.
Our implementation of `lay()` strips the need of defining the margin (the `1L` above) and benefits from
the automatic splicing and the lambda syntax as shown above.


#### Alternative 8: `for (i in ...) {...}`

```{r for}
drugs$everused <- NA
columns_in <- !colnames(drugs) %in% c("caseid", "everused")
for (i in seq_len(nrow(drugs))) {
drugs$everused[i] <- any(drugs[i, columns_in])
}
drugs
```
This is another base R solution, which does not involve any external package. It is not very pretty,
nor particularly efficient.


#### Other alternatives?

There are probably other ways. If you think of a nice one, please leave an issue and we will add it here!


### Efficiency

Here are the results of a benchmark comparing alternative implementations for our simple rowwise job on
a larger dataset with `r ncol(drugs_full)` columns and `r nrow(drugs_full)` rows (see [benchmark](https://courtiol.github.io/lay/articles/benchmark.html) for details and more tests):

```{r bench_run1, eval=TRUE, echo=FALSE, warning=FALSE, message=FALSE, fig.width=8, fig.height=5}
rm(drugs)
drugs_full_dt <- data.table(drugs_full) ## coercion to data.table
benchmark1 <- bench::mark(
vectorized = {
drugs_full |>
mutate(everused = codeine | hydrocd | methdon | morphin | oxycodp | tramadl | vicolor)},
lay = {
drugs_full |>
select(-caseid) |>
mutate(everused = lay(pick(everything()), any))},
c_across = {
drugs_full |>
rowwise() |>
mutate(everused = any(c_across(-caseid))) |>
ungroup()},
pivot_pivot = {
drugs_full |>
pivot_longer(-caseid) |>
group_by(caseid) |>
mutate(everused = any(value)) |>
ungroup() |>
pivot_wider()},
pmap = {
drugs_full |>
mutate(everused = pmap_lgl(pick(-caseid), ~ any(...)))},
slider = {
drugs_full |>
mutate(everused = slide_lgl(pick(-caseid), any))},
data.table = {
drugs_full_dt[, ..I := .I]
drugs_full_dt[, everused := any(.SD), by = ..I, .SDcols = -"caseid"]},
apply = {
drugs_full |>
mutate(everused = apply(pick(-caseid), 1, any))},
'for' = {
everused <- logical(nrow(drugs_full))
columns_in <- colnames(drugs_full) != "caseid"
for (i in seq_len(nrow(drugs_full))) everused[i] <- any(drugs_full[i, columns_in])},
iterations = 5,
time_unit = "ms",
check = FALSE
)
benchmark1 |>
mutate(expression = forcats::fct_reorder(as.character(expression), median, .desc = TRUE)) |>
plot()
```

Note that the x-axis of the plot is on a logarithmic scale.

As you can see, `lay()` is not just simple and powerful, it is also quite efficient!


### History

<img src="https://github.com/courtiol/lay/raw/main/.github/pics/lay_history.png" alt="lay_history" align="right" width="400">

The first draft of this package has been created by **@romainfrancois** as a reply to a tweet I posted under **@rdataberlin** in February 2020.
At the time I was exploring different ways to perform rowwise jobs in R and I was experimenting with various ideas on how to exploit
the fact that the newly introduced function `across()` from [**{dplyr}**](https://dplyr.tidyverse.org/) creates tibbles on which on can easily apply a function.
Romain came up with `lay()` as the better solution making good use of [**{rlang}**](https://rlang.r-lib.org/) & [**{vctrs}**](https://vctrs.r-lib.org/).

The verb `lay()` never made it to be integrated within [**{dplyr}**](https://dplyr.tidyverse.org/) and, so far, I still find `lay()` superior than
most alternatives, which is why I decided to maintain this package.
The first draft of this package has been created by **@romainfrancois** as a reply to a tweet I (Alexandre Courtiol) posted under **@rdataberlin** in February 2020.
At the time I was exploring different ways to perform rowwise jobs in R and I was experimenting with various ideas on how to exploit the fact that the newly introduced function `across()` from [**{dplyr}**](https://dplyr.tidyverse.org/) creates tibbles on which one can easily apply a function.
Romain came up with `lay()` as the better solution, making good use of [**{rlang}**](https://rlang.r-lib.org/) & [**{vctrs}**](https://vctrs.r-lib.org/).

In short, I deserve little credit and instead you should feel free to buy Romain a coffee [here](https://ko-fi.com/romain) or to sponsor his [github profile](https://github.com/romainfrancois).
The verb `lay()` never made it to be integrated within [**{dplyr}**](https://dplyr.tidyverse.org/), but, so far, I still find `lay()` superior than most alternatives, which is why I decided to document and maintain this package.
Loading

0 comments on commit 25add0a

Please sign in to comment.