-
Notifications
You must be signed in to change notification settings - Fork 3
Average predictive comparisons
This came from compiling a ".rmd" file
One reason for building models is to understand how some response
the
But any number of important models may be much harder to understand:
- generalized linear models
- hierarchical models
- models that make use of each input multiple times as multiple different features (such as
$x_i$ ,$x_i^2$ , etc.) - linear models with interactions
- random forests
- etc.
Two methods that address this question for more complicated models are:
- partial plots, as implemented in R's randomForest package; and
- average predictive comparisons, as described by Andrew Gelman and Iain Pardoe.
Both methods agree with the
This post will:
- explain partial plots;
- illustrate partial plots with a simple artificial example where it works well;
- illustrate my concerns about _partial plots in a simple artificial example where it doesn't work well;
- explain average predictive comparisons.
If we want to know the effect of changing
where
First I'll illustrate with a toy example with two input variables
In the plot, the colored points show the copy of the data set used at each value of
(My example plots don't use the partialPlot function itself because rather than using a random forest, I'm pretending we magically know
This plot also shows the value in looking at the whole range of predictions rather than just the average. We can see not only the average effect of
When inputs are correlated, a potential problem with partialPlot is that substituting in one value for one feature while holding the rest the same may generate very unlikely combinations of inputs, causing the results to be misleading. In the previous example,
For simplicity, my example will have only a small number of possible combinations of
In this case we can't even make the partial plot, because there's no such thing as
So let's instead suppose that there is enough data with
~$0.4$ | |||
~$0.4$ | |||
~$0$ | |||
~$0.2$ |
To determine what the partial plot would look like, let's add a row to the table corresponding to
label |
|
|
||||
---|---|---|---|---|---|---|
a | ~$.4$ | |||||
b | ~$.4$ | |||||
c | ~$0$ | |||||
d | ~$.2$ |
So weighting by probability mass, the partial plot for
Here's that in the form of a graph:
To summarize, the problems here are both practical and philosophical:
-
Practical: Given the correlations in the data, the partial plot is very sensitive to the estimated
$\mathbb{E}(y)$ in a very small portion of the data. This means it may be difficult or impossible to correctly estimate the partial plot. -
Philosophical: The possibility for this extreme dependence on a small portion of the data should be a clue that we're looking at the wrong thing. When we're trying to understand the influence of
$u$ , should we really let ourselves be influenced by very unlikely combinations of features? I'd say no.
Before discussing another approach, I'll briefly mention a solution that doesn't work: modeling the distribution of
For a given jump in
(I'm ommitting the posterior parameters
The average predictive comparison (APC)
The paper has a nice discussion about how to estimate this, but in my toy example (where the joint distribition of
In this case we have only one predictive comparison that receives any weight at all, so
Unlike with the partial plot approach, I believe that's the appropriate answer (in this case) to "How does
Please have a look at Gelman and Pardoe's paper for more discussion of:
- the motivation for APC
- defining APC for other kinds of inputs, like unordered categorical variables
- estimating the APC
- subtlies involved in interpreting the APC
- an example that's both interesting and useful (unlike mine)
Something I've been wondering (which is not really addressed in the paper) is how can we understand not just the APC (overall) but the different effect of an input at different ranges for that input? This is important when the model is non-linear. If the effect of
One approach would be to plot the predictive comparisons according to the location of the jump
Another approach would be to restrict to jumps in a given range. But then we're missing the information in larger jumps not contained in any of these ranges. Gelman and Pardoe are explicit that they aren't looking for any kind of limit as jump sizes get small, and this may be part of why. Still, looking at the impact of small jumps in each range could make sense.
As discussed on p. 37, estimating the APC is all about estimating the density
Last I heard, Andrew Gelman and a student were working on an implementation, but I don't know anything else about that.