README.Rmd

---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# gpmodels

## A Grammar of Prediction Models

This package provides a grammar for data preparation and evaluation of fixed-origin and rolling-origin prediction models using data collected at irregular intervals.

<!-- badges: start -->
[![Lifecycle: maturing](https://img.shields.io/badge/lifecycle-maturing-blue.svg)](https://www.tidyverse.org/lifecycle/#maturing)
<!-- badges: end -->

## Installation

You can install the GitHub version of gpmodels with:

```{r eval=FALSE}
remotes::install_github('ML4LHS/gpmodels')
```

## How to set up a time_frame()

Start by loading and package and defining your `time_frame()`. A `time_frame` is simply a list with the class `time_frame` and contains all the key information needed to describe both your fixed dataset (such as demographics, one row per patient) and your temporal dataset (one row per observation linked to a timestamp).

```{r}
library(gpmodels)
```

```{r}
library(magrittr)
library(lubridate)

future::plan('multisession')

unlink(file.path(tempdir(), 'gpmodels_dir', '*.*'))

tf = time_frame(fixed_data = sample_fixed_data,
               temporal_data = sample_temporal_data %>% dplyr::filter(id %in% 1:100),
               fixed_id = 'id',
               fixed_start = 'admit_time',
               fixed_end = 'dc_time',
               temporal_id = 'id',
               temporal_time = 'time',
               temporal_variable = 'variable',
               temporal_category = 'category',
               temporal_value = 'value',
               step = hours(6),
               max_length = days(7), # optional parameter to limit to first 7 days of hospitalization
               output_folder = file.path(tempdir(), 'gpmodels_dir'),
               create_folder = TRUE)

```

## Let's look at the automatically generated data dictionaries

```{r}
names(tf)

tf$step

tf$step_units

tf$fixed_data_dict

tf$temporal_data_dict
```

## Let's dummy code the temporal categorical variables


```{r}
tf = tf %>% 
  pre_dummy_code()
```


# This affects only the temporal data and not the fixed data.

```{r}
tf$fixed_data_dict

tf$temporal_data_dict
```

## Let's add some predictors and outcomes

The default method writes output to the folder defined in your `time_frame`. When you write your output to file, you are allowed to chain together `add_predictors()` and `add_outcomes()` functions. This is possble because these functions invisibly return a `time_frame`.

If, however, you set `output_file` to `FALSE`, then your actual output is returned (rather than the `time_frame`) so you cannot chain functions.

```{r}
tf %>%           
  add_rolling_predictors(variables = 'cr', # Note: You can supply a vector of variables
                         lookback = hours(12), 
                         window = hours(6), 
                         stats = c(mean = mean,
                                   min = min,
                                   max = max,
                                   median = median,
                                   length = length)) %>%
  add_baseline_predictors(variables = 'cr', # add baseline creatinine
                          lookback = days(90),
                          offset = hours(10),
                          stats = c(min = min)) %>%
  add_growing_predictors(variables = 'cr', # cumulative max creatinine since admission
                         stats = c(max = max)) %>%
  add_rolling_predictors(category = 'med', # Note: category is always a regular expression 
                         lookback = days(7),
                         stats = c(sum = sum)) %>% 
  add_rolling_outcomes(variables = 'cr',
                       lookahead = hours(24), 
                       stats = c(max = max))
```

## Let's combine our output into a single data frame

You can provide `combine_output()` with a set of data frames separated by commas. Or, you can provide a vector of file names using the `files` argument. If you leave `files` blank, it will automatically find all the `.csv` files from the `output_folder` of your `time_frame`.

This resulting frame is essentially ready for modeling (using `tidymodels`, for example). Make sure to keep individual patients in the same fold if you divide this dataset into multiple folds.

```{r}
model_data = combine_output(tf)

head(model_data)
```


## Testing time_frame without writing output to files

If you want to simply test `time_frame`, you may prefer not to write your output to file. You can accomplish this by setting `output_file` to `FALSE`.

```{r}
tf %>% 
  add_rolling_predictors(variables = 'cr',
                         lookback = hours(12), 
                         window = hours(6), 
                         stats = c(mean = mean,
                                   min = min,
                                   max = max,
                                   median = median,
                                   length = length),
                         output_file = FALSE) %>% 
  head()
```

## You can also supply a vector of variables

```{r}
tf %>% 
  add_rolling_predictors(variables = c('cr', 'med_aspirin'),
                         lookback = weeks(1), 
                         stats = c(length = length),
                         output_file = FALSE) %>% 
  head()
```

## Category accepts regular expressions

```{r}
tf %>% 
  add_rolling_predictors(category = 'lab|med',
                         lookback = hours(12), 
                         stats = c(length = length),
                         output_file = FALSE) %>% 
  head()
```

## Let's benchmark the performance on our package

### Running in parallel

```{r message=FALSE}
benchmark_results = list()

# future::plan('multisession')

benchmark_results[['multisession']] = 
  microbenchmark::microbenchmark(
    tf %>% 
      add_rolling_predictors(variable = 'cr',
                             lookback = hours(48), 
                             window = hours(6), 
                             stats = c(mean = mean,
                                       min = min,
                                       max = max,
                                       median = median,
                                       length = length)),
    times = 1
  )
```

### Running in parallel with a chunk_size of 20

```{r}

tf_with_chunks = tf
tf_with_chunks$chunk_size = 20

benchmark_results[['multisession with chunk_size 20']] = 
  microbenchmark::microbenchmark(
    tf_with_chunks %>% 
      add_rolling_predictors(variable = 'cr',
                             lookback = hours(48), 
                             window = hours(6), 
                             stats = c(mean = mean,
                                       min = min,
                                       max = max,
                                       median = median,
                                       length = length)),
    times = 1
  )
```

### Running in serial

```{r message=FALSE}
future::plan('sequential')

benchmark_results[['sequential']] = 
  microbenchmark::microbenchmark(
  tf %>% 
    add_rolling_predictors(variable = 'cr',
                           lookback = hours(48), 
                           window = hours(6), 
                           stats = c(mean = mean,
                                     min = min,
                                     max = max,
                                     median = median,
                                     length = length)),
  times = 1
  )

```

## Benchmark results

```{r}
benchmark_results
```

```{r include=FALSE}
unlink(file.path(tempdir(), 'gpmodels_dir', '*.*'))
```