Average predictive comparisons

This came from compiling a ".rmd" file

One reason for building models is to understand how some response $y$ (or its expectation) varies with an input $x_i$, holding the other inputs constant. For a linear model like

$$\mathbb{E}(y) = \sum \beta_i x_i,$$

the $\beta_i$ answer this question: 1 unit increase in $x_i$ corresponds to $\beta_i$ unit increase in $\mathbb{E}(y)$.

But any number of important models may be much harder to understand:

generalized linear models
hierarchical models
models that make use of each input multiple times as multiple different features (such as $x_i$, $x_i^2$, etc.)
linear models with interactions
random forests
etc.

Two methods that address this question for more complicated models are:

partial plots, as implemented in R's randomForest package; and
average predictive comparisons, as described by Andrew Gelman and Iain Pardoe.

Both methods agree with the $\beta_i$ for the linear model above, but both methods are also much more general.

This post will:

explain partial plots;
illustrate partial plots with a simple artificial example where it works well;
illustrate my concerns about _partial plots in a simple artificial example where it doesn't work well;
explain average predictive comparisons.

randomForest's partialPlot

If we want to know the effect of changing $u$ holding other inputs (represented as $v$) equal, then the key question is at what values should we hold all else constant? The partialPlot function answers this by creating one copy of the data set for each potential value of $u$, leaving all other features the same as in the original data set. This generates a range of predictions for each value of $u$, and the partialPlot function plots their average:

$$\tilde{f}(u) = \frac{1}{n}\sum_{i=1}^n f(u,v_i)$$

where $f(u,v)$ is our estimate for $y$ at inputs $u$ and $v$.

First I'll illustrate with a toy example with two input variables $u$ and $v$ drawn from independent standard normal distributions. In the toy example, my model for $y$ is:

$$\mathbb{E}[y|u,v] = 2 u + u v.$$

In the plot, the colored points show the copy of the data set used at each value of $u$, with different values of $v$ indicated by color. We can see the different slope for the model output at different values of $v$, and a thick red line for the mean model output. The red line is the partial plot, showing the average of $\mathbb{E}[y|u,v]$.

(My example plots don't use the partialPlot function itself because rather than using a random forest, I'm pretending we magically know $\mathbb{E}[y|u,v]$, and because I wanted richer plots showing the individual points as well as the average.)

plot of chunk unnamed-chunk-1

This plot also shows the value in looking at the whole range of predictions rather than just the average. We can see not only the average effect of $u$ but also the variation.

Correlated Inputs

When inputs are correlated, a potential problem with partialPlot is that substituting in one value for one feature while holding the rest the same may generate very unlikely combinations of inputs, causing the results to be misleading. In the previous example, $u$ and $v$ were uncorrelated so this problem couldn't arise. Now I'll describe an example where this is more of a problem.

For simplicity, my example will have only a small number of possible combinations of $u$, $v$, and $y$, so we can express the entire density function $P(u,v,y)$ in a small table:

$P(\cdot)$	$u$	$v$	$y$
$.4$	$0$	$0$	$0$
$.4$	$1$	$1$	$1$
$.2$	$1$	$0$	$1$

In this case we can't even make the partial plot, because there's no such thing as $\mathbb{E}(y|u=0, v=1)$ (because $u=0, v=1$ is impossible), which we'd be required to know to build the partial plot! If we're in a situation like this using the partialPlot function with a randomForest, the randomForest will of course output some predictions, but they won't be based on reality!

So let's instead suppose that there is enough data with $u=0, v=1$ to form reasonable inferences about $\mathbb{E}(y|u=0, v=1)$, but still the density function has very very little mass on that combination relative to the others:

$P(\cdot)$	$u$	$v$	$y$
~$0.4$	$0$	$0$	$0$
~$0.4$	$1$	$1$	$1$
~$0$	$0$	$1$	$10$
~$0.2$	$1$	$0$	$1$

plot of chunk unnamed-chunk-2

To determine what the partial plot would look like, let's add a row to the table corresponding to $y$ when $u$ is switched to $0$ and another row for when $u$ is switched to $1$. This gives us two copies of the data, one for $u=0$ and one for $u=1$ (each holding $v$ at its original values).

label	$P(\cdot)$	$u$	$v$	$y$	$y$ (if we switch $u$ to 0)	$y$ (if we switch $u$ to 1)
a	~$.4$	$0$	$0$	$0$	$0$	$1$
b	~$.4$	$1$	$1$	$1$	$10$	$1$
c	~$0$	$0$	$1$	$10$	$10$	$1$
d	~$.2$	$1$	$0$	$1$	$0$	$1$

So weighting by probability mass, the partial plot for $u$ takes $y$ from $4$ at $u=0$ to $1$ at $u=1$. This is weird! The problem is that the very low probability situation (c) ends up having a large effect on row (b).

Here's that in the form of a graph:

plot of chunk unnamed-chunk-3

To summarize, the problems here are both practical and philosophical:

Practical: Given the correlations in the data, the partial plot is very sensitive to the estimated $\mathbb{E}(y)$ in a very small portion of the data. This means it may be difficult or impossible to correctly estimate the partial plot.
Philosophical: The possibility for this extreme dependence on a small portion of the data should be a clue that we're looking at the wrong thing. When we're trying to understand the influence of $u$, should we really let ourselves be influenced by very unlikely combinations of features? I'd say no.

Before discussing another approach, I'll briefly mention a solution that doesn't work: modeling the distribution of $v$ given $u$, and using those values of $v$ rather than the overall distribution of $v$ in making the partial plot. We'd just end up with a way of estimating $\mathbb{E}(y | u)$, which would be how $y$ varies with $u$, but not holding other stuff constant.

Average Predictive Comparison

For a given jump in $u$ from $u^{(1)}$ to $u^{(2)}$ and at a particular value of $v$, Gelman and Pardoe define the corresponding predictive comparison as the change in $\mathbb{E}(y)$ per unit change in $u$ for those particular values of $u$, $v$:

$$\delta_u(u^{(1)} \rightarrow u^{(2)}, v) = \frac{\mathbb{E}(y|u^{(2)},v) - \mathbb{E}(y|u^{(1)},v)}{u^{(2)} - u^{(1)}}.$$

(I'm ommitting the posterior parameters $\theta$ since for my toy examples I'm pretending there's no uncertainty.)

The average predictive comparison (APC) $\Delta_u$ is an appropriate weighted average of these $\delta_u$'s, where the average is taken with respect to $p(v) \cdot p(u^{(1)}|v) \cdot p(u^{(2)}|v)$. This means that pairings between $v$ (the other inputs) and $u^{(1)}, u^{(2)}$ that are more likely receive more weight.

The paper has a nice discussion about how to estimate this, but in my toy example (where the joint distribition of $u,v,y$ is very simple and where there's no uncertainty in estimating it):

$P(\cdot)$	$u$	$v$	$y$
$.4$	$0$	$0$	$0$
$.4$	$1$	$1$	$1$
$.2$	$1$	$0$	$1$

In this case we have only one predictive comparison that receives any weight at all, so

$$\Delta_u = \delta_u(0 \rightarrow 1, 0) = 1.$$

Unlike with the partial plot approach, I believe that's the appropriate answer (in this case) to "How does $\mathbb{E}(y)$ change with a 1-unit change in $u$, holding $v$ constant."

Please have a look at Gelman and Pardoe's paper for more discussion of:

the motivation for APC
defining APC for other kinds of inputs, like unordered categorical variables
estimating the APC
subtlies involved in interpreting the APC
an example that's both interesting and useful (unlike mine)

Different effect in different input ranges?

Something I've been wondering (which is not really addressed in the paper) is how can we understand not just the APC (overall) but the different effect of an input at different ranges for that input? This is important when the model is non-linear. If the effect of $u$ depends on a lot on what range $u$ is in, we'd like to know that. This is one thing the partial plot does nicely, despite its flaws.

One approach would be to plot the predictive comparisons according to the location of the jump $u^{(1)} \rightarrow u^{(2)}$, using either endpoint, or the midpoint, or maybe better, using all jumps that pass through each particular value of $u$.

Another approach would be to restrict to jumps in a given range. But then we're missing the information in larger jumps not contained in any of these ranges. Gelman and Pardoe are explicit that they aren't looking for any kind of limit as jump sizes get small, and this may be part of why. Still, looking at the impact of small jumps in each range could make sense.

Estimation / Implementation

As discussed on p. 37, estimating the APC is all about estimating the density $p(u^{(2)}|v)$. How well does the proposal in the paper work in practice? I'd like to understand when the proposed estimate will work well and when it won't. Maybe it would make sense to try it with some simulated examples where the "right" answer is known.

Last I heard, Andrew Gelman and a student were working on an implementation, but I don't know anything else about that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly