README.Rmd

# forecastxgb-r-package
The `forecastxgb` package provides time series modelling and forecasting functions that combine the machine learning approach of Chen, He and Benesty's [`xgboost`](https://CRAN.R-project.org/package=xgboost) with the convenient handling of time series and familiar API of Rob Hyndman's [`forecast`](http://github.com/robjhyndman/forecast).  It applies to time series the Extreme Gradient Boosting proposed in [*Greedy Function Approximation: A Gradient Boosting Machine*, by Jermoe Friedman in 2001](http://www.jstor.org/stable/2699986). xgboost has become an important machine learning algorithm; nicely explained in [this accessible documentation](http://xgboost.readthedocs.io/en/latest/model.html).

[![Travis-CI Build Status](https://travis-ci.org/ellisp/forecastxgb-r-package.svg?branch=master)](https://travis-ci.org/ellisp/forecastxgb-r-package)
[![CRAN version](http://www.r-pkg.org/badges/version/forecastxgb)](http://www.r-pkg.org/pkg/forecastxgb)
[![CRAN RStudio mirror downloads](http://cranlogs.r-pkg.org/badges/forecastxgb)](http://www.r-pkg.org/pkg/forecastxgb)

**Warning: this package is under active development and is some way off a CRAN release (realistically, no some time in 2017).  Currently the forecasting results with the default settings are, frankly, pretty rubbish, but there is hope I can get better settings.  The API and default values of arguments should be expected to continue to change.**

## Installation
Only on GitHub, but plan for a CRAN release in November 2016.  Comments and suggestions welcomed.

This implementation uses as explanatory features: 

* lagged values of the response variable
* dummy variables for seasons.
* current and lagged values of any external regressors supplied as `xreg`

```{r echo = FALSE}
set.seed(123)
library(knitr)
knit_hooks$set(mypar = function(before, options, envir) {
    if (before) par(bty = "l", family = "serif")
})
opts_chunk$set(comment=NA, fig.width=7, fig.height=5, cache = FALSE, mypar = TRUE)
```


```{r eval = FALSE}
devtools::install_github("ellisp/forecastxgb-r-package/pkg")
```

## Usage


## Basic usage

The workhorse function is `xgbar`.  This fits a model to a time series.  Under the hood, it creates a matrix of explanatory variables based on lagged versions of the response time series, and (optionally) dummy variables of some sort for seasons.  That matrix is then fed as the feature set for `xgboost` to do its stuff.

### Univariate

Usage with default values is straightforward.  Here it is fit to Australian monthly gas production 1956-1995, an example dataset provided in `forecast`:
```{r message = FALSE}
library(forecastxgb)
model <- xgbar(gas)
```
(Note: the "Stopping. Best iteration..." to the screen is produced by `xgboost::xgb.cv`, which uses `cat()` rather than `message()` to print information on its processing.)

By default, `xgbar` uses row-wise cross-validation to determine the best number of rounds of iterations for the boosting algorithm without overfitting.  A final model is then fit on the full available dataset.  The relative importance of the various features in the model can be inspected by `importance_xgb()` or, more conveniently, the `summary` method for objects of class `xgbar`.


```{r}
summary(model)
```
We see in the case of the gas data that the most important feature in explaining gas production is the production 12 months previously; and then other features decrease in importance from there but still have an impact.

Forecasting is the main purpose of this package, and a `forecast` method is supplied.  The resulting objects are of class `forecast` and familiar generic functions work with them.

```{r}
fc <- forecast(model, h = 12)
plot(fc)
```

Note that prediction intervals are not currently available.

See the vignette for more extended examples.

### With external regressors
External regressors can be added by using the `xreg` argument familiar from other forecast functions like `auto.arima` and `nnetar`.  `xreg` can be a vector or `ts` object but is easiest to integrate into the analysis if it is a matrix (even a matrix with one column) with well-chosen column names; that way feature names persist meaningfully.  

The example below, with data taken from the `fpp` package supporting Athanasopoulos and Hyndman's [Forecasting Principles and Practice](https://www.otexts.org/fpp) book, shows income being used to explain consumption.  In the same way that the response variable `y` is expanded into lagged versions of itself, each column in `xreg` is expanded into lagged versions, which are then treated as individual features for `xgboost`.

```{r message = FALSE}
library(fpp)
consumption <- usconsumption[ ,1]
income <- matrix(usconsumption[ ,2], dimnames = list(NULL, "Income"))
consumption_model <- xgbar(y = consumption, xreg = income)
summary(consumption_model)
```
We see that the two most important features explaining consumption are the two previous quarters' values of consumption; followed by the income in this quarter; and so on.


The challenge of using external regressors in a forecasting environment is that to forecast, you need values of the future external regressors.  One way this is sometimes done is by first forecasting the individual regressors.  In the example below we do this, making sure the data structure is the same as the original `xreg`.  When the new value of `xreg` is given to `forecast`, it forecasts forward the number of rows of the new `xreg`.  
```{r}
income_future <- matrix(forecast(xgbar(usconsumption[,2]), h = 10)$mean, 
                        dimnames = list(NULL, "Income"))
plot(forecast(consumption_model, xreg = income_future))
```

## Options

### Seasonality

Currently there are three methods of treating seasonality.  

- The current default method is to throw dummy variables for each season into the mix of features for `xgboost` to work with.  
- An alternative is to perform classic multiplicative seasonal adjustment on the series before feeding it to `xgboost`.   This seems to work better.
- A third option is to create a set of pairs of Fourier transform variables and use them as x regressors

```{r echo = FALSE}
model1 <- xgbar(co2, seas_method = "dummies")
model2 <- xgbar(co2, seas_method = "decompose")
model3 <- xgbar(co2, seas_method = "fourier")
plot(forecast(model1), main = "Dummy variables for seasonality")
plot(forecast(model2), main = "Decomposition seasonal adjustment for seasonality")
plot(forecast(model3), main = "Fourier transform pairs as x regressors")
```

All methods perform quite poorly at the moment, suffering from the difficulty the default settings have in dealing with non-stationary data (see below).

### Transformations

The data can be transformed by a modulus power transformation (as per John and Draper, 1980) before feeding to `xgboost`.  This transformation is similar to a Box-Cox transformation, but works with negative data.  Leaving the `lambda` parameter as 1 will effectively switch off this transformation.
```{r echo = FALSE}
model1 <- xgbar(co2, seas_method = "decompose", lambda = 1)
model2 <- xgbar(co2, seas_method = "decompose", lambda = BoxCox.lambda(co2))
plot(forecast(model1), main = "No transformation")
plot(forecast(model2), main = "With transformation")
```

Version 0.0.9 of `forecastxgb` gave `lambda` the default value of `BoxCox.lambda(abs(y))`.  This returned spectacularly bad forecasting results.  Forcing this to be between 0 and 1 helped a little, but still gave very bad results.  So far there isn't evidence (but neither is there enough investigation) that a Box Cox transformation helps xgbar do its model fitting at all.

### Non-stationarity
From experiments so far, it seems the basic idea of `xgboost` struggles in this context with extrapolation into a new range of variables not in the training set.  This suggests better results might be obtained by transforming the series into a stationary one before modelling - a similar approach to that taken by `forecast::auto.arima`.  This option is available by `trend_method = "differencing"` and seems to perform well - certainly better than without - and it will probably be made a default setting once more experience is available.

```{r}
model <- xgbar(AirPassengers, trend_method = "differencing", seas_method = "fourier")
plot(forecast(model, 24))
```


## Future developments
Future work might include: 

* additional automated time-dependent features (eg dummy variables for trading days, Easter, etc)
* ability to include xreg values that don't get lagged
* some kind of automated multiple variable forecasting, similar to a vector-autoregression.
* better choices of defaults for values such as `lambda` (for power transformations), `K` (for Fourier transforms) and, most likely to be effective, `maxlag`.