-
Notifications
You must be signed in to change notification settings - Fork 26
/
Copy pathREADME.Rmd
273 lines (201 loc) · 9.65 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
---
title: "pipelearner"
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
```
pipelearner makes it easy to create machine learning pipelines in R.
## Installation and background
pipelearner is currently available from github as a development package only. It can be installed by running:
```{r, eval = F}
# install.packages("devtools")
devtools::install_github("drsimonj/pipelearner")
```
pipelearner is built on top of [tidyverse](https://github.com/tidyverse/tidyverse) packages like [modelr](https://github.com/hadley/modelr). To harness the full power of pipelearner, it will help to possess some technical knowledge of tidyverse tools such as:
- `%>%` the pipe operator from [magrittr](https://github.com/tidyverse/magrittr) package.
- tibbles from [tibble](https://github.com/tidyverse/tibble) package.
- `map()` and other iteration functions from [purrr](https://github.com/hadley/purrr) package.
- `resample` objects from [modelr](https://github.com/hadley/modelr) package.
An excellent resource to get started with these is [R for Data Science](http://r4ds.had.co.nz/), by Garrett Grolemund and Hadley Wickham.
## API
Similar to the way ggplot2 elements are layered with `+`, you initialize and customize a pipelearner object, which is a list, with functions that can be piped into eachother with `%>%`. Rather than plotting, however, a pipelearner then learns.
**Initialize** a pipelearner object with:
- `pipelearner()`
**Customize** a pipelearner with:
- `learn_cvpairs()` to customize the cross-validation pairs.
- `learn_curves()` to customize the learning curves using incremental proportions of training data.
- `learn_models()` to add new learning models.
**Learn** (fit) everything and obtain a tibble of results with:
- `learn()`
### Initialization
The following initializes a pipelearner object that will use the `iris` data set and linear regression (`lm`) to learn how to predict `Sepal.Length` with all other available variables (`Sepal.Length ~ .`).
```{r}
library(pipelearner)
pl <- pipelearner(iris, lm, Sepal.Length ~ .)
```
Print a pipelearner object to expose the list elements.
```{r}
pl
```
##### Defaults to note
- `data` is split into a single cross-validation pair of resample objects (under `cv_pairs`) referencing 80% of the data for training and 20% for testing.
- Learning is done on the entire proportion of the training data (`train_ps == 1`).
### Learning
Once a pipelearner is setup, use `learn()` to fit all models to every combination of training proportions (`train_ps`) and set of training data in the cross-validation pairs (`cv_pairs`), and return a tibble of the results.
```{r}
pl %>% learn()
```
##### Quick notes
- `fit` contains the fitted models.
- `params` contains all model parameters including the formula.
- `train` contains a resample object referencing the data that each model was fitted to.
- `test` contains a resample object referencing test data that models were *not* fitted to (for later use).
### Cross-validation pairs
Cross-validation pairs can be customized with `learn_cvpairs()`. The following implements k-fold cross validation with `k = 4`.
```{r}
pl %>%
learn_cvpairs(crossv_kfold, k = 4) %>%
learn()
```
Notice the five rows where the model has been fitted to training data for each fold, represented by `cv_pairs.id`. The precise training data sets are also stored under `train`.
### Learning curves
Learning curves can be customized wth `learn_curves()`. The following will fit the model to three proportions of the training data (.5, .75, and 1):
```{r}
pl %>%
learn_curves(.5, .75, 1) %>%
learn()
```
Notice the three rows where the model has been fitted to the three proportions of the training data, represented by `train_p`. Again, `train` contains references to the precise data used in each case.
### More models
Add more models with `learn_models()`. For example, the following adds a decision tree to be fitted:
```{r}
pl %>%
learn_models(rpart::rpart, Sepal.Length ~ .) %>%
learn()
```
Notice two rows where the regression and decision tree models have been fit to the training data, represented by `models.id`. The different model calls also appear under `model`.
Things to know about `learn_models()`:
- Unlike the other `learn_*()` functions, it can be called multiple times within the pipeline.
- It is called implicitly by `pipelearner()` when arguments beyond a data frame are supplied. For example, `pipelearner(d, l, f, ...)` is equivalent to `pipelearner(d) %>% learn_models(l, f, ...)`.
- Its arguments can all be vectors, which will be expanded to all combinations. This makes it easy to do things like compare many models with the same formulas, compare many different formulas, or do grid-search.
For example, the following fits two models with three formulas:
```{r}
pipelearner(iris) %>%
learn_models(c(lm, rpart::rpart),
c(Sepal.Length ~ Sepal.Width,
Sepal.Length ~ Sepal.Width + Petal.Length,
Sepal.Length ~ Sepal.Width + Petal.Length + Species)) %>%
learn()
```
The following fits a regression model and grid-searches hyperparameters of a decision tree:
```{r}
pipelearner(iris) %>%
learn_models(lm, Sepal.Length ~ .) %>%
learn_models(rpart::rpart, Sepal.Length ~ .,
minsplit = c(2, 20), cp = c(0.01, 0.1)) %>%
learn()
```
Remember that these additional parameters (including different formulas) are contained under `params`.
## Bringing it all together
After initialization, pipelearner functions can be combined in a single pipeline. For example, the following will:
- Initialize a blank pipelearner object with the `iris` data set.
- Create 50 cross-validation pairs (holding out random 20% of data by default in each)...
- to each be fitted in sample size proportions of .5 to 1 in increments of .1.
- With a regression modelling all interactions...
- and a decision tree modelling all features.
- Fit all models and return the results.
```{r}
iris %>%
pipelearner() %>%
learn_cvpairs(crossv_mc, n = 50) %>%
learn_curves(seq(.5, 1, by = .1)) %>%
learn_models(lm, Sepal.Width ~ .*.) %>%
learn_models(rpart::rpart, Sepal.Width ~ .) %>%
learn()
```
## Beyond learning
As you can see, pipelearner makes it easy to fit many models. The next step is to extract performance metrics from the tibble of results. This is where prior familiarity working with tidyverse tools becomes useful if not essential.
At present, pipelearner doesn't provide functions to extract any further information. This is because the information to be extracted can vary considerably between the models fitted to the data.
The following will demonstrate an example of visualising learning curves by extracting performance information from regression models.
`r_square()` is setup to extract an R-squared value. It is based on `modelr::rsquare`, but adjusted to handle new data sets (I've submitted [an issue](https://github.com/hadley/modelr/issues/37) to incorporate into `modelr`).
```{r}
# R-Squared scoring (because modelr rsquare doen't work right now)
response_var <- function(model) {
formula(model)[[2L]]
}
response <- function(model, data) {
eval(response_var(model), as.data.frame(data))
}
r_square <- function(model, data) {
actual <- response(model, data)
residuals <- predict(model, data) - actual
1 - (var(residuals, na.rm = TRUE) / var(actual, na.rm = TRUE))
}
```
Using a subset of the `weather` data from the `nycflights13` package, fit a single regression model to 50 cross-validation pairs, holding out 15% of the data for testing in each case, in iterative training proportions. Note heavy use of tidyverse functions.
```{r, message = F}
library(tidyverse)
# Create the data set
library(nycflights13)
d <- weather %>%
select(visib, humid, precip, wind_dir) %>%
drop_na() %>%
sample_n(2000)
results <- d %>%
pipelearner() %>%
learn_cvpairs(crossv_mc, n = 50, test = .15) %>%
learn_curves(seq(.1, 1, by = .1)) %>%
learn_models(lm, visib ~ .) %>%
learn()
results
```
New columns are added with `dplyr::mutate` containing the rsquared values for each set of training and test data by using `purrr` functions.
```{r, message = F}
results <- results %>%
mutate(
rsquare_train = map2_dbl(fit, train, r_square),
rsquare_test = map2_dbl(fit, test, r_square)
)
results %>% select(cv_pairs.id, train_p, contains("rsquare"))
```
We can visualize these learning curves as follows:
```{r eg_curve}
results %>%
select(train_p, contains("rsquare")) %>%
gather(source, rsquare, contains("rsquare")) %>%
ggplot(aes(train_p, rsquare, color = source)) +
geom_jitter(width = .03, alpha = .3) +
stat_summary(geom = "line", fun.y = mean) +
stat_summary(geom = "point", fun.y = mean, size = 4) +
labs(x = "Proportion of training data used",
y = "R Squared")
```
The example below fits a decision tree and random forest to 20 folds of a subset of the data.
```{r}
results <- d %>%
pipelearner() %>%
learn_cvpairs(crossv_kfold, k = 20) %>%
learn_models(c(rpart::rpart, randomForest::randomForest),
visib ~ .) %>%
learn()
results
```
Then compute R-Square statistics and visualize the results:
```{r eg_2models, message = F}
results %>%
mutate(rsquare_train = map2_dbl(fit, train, r_square),
rsquare_test = map2_dbl(fit, test, r_square)) %>%
select(model, contains("rsquare")) %>%
gather(source, rsquare, contains("rsquare")) %>%
ggplot(aes(model, rsquare, color = source)) +
geom_jitter(width = .05, alpha = .3) +
stat_summary(geom = "point", fun.y = mean, size = 4) +
labs(x = "Learning model",
y = "R Squared")
```