Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option for training fold "blocks" to avoid over-fitting #6

Open
dtpc opened this issue Jun 5, 2019 · 5 comments
Open

Add option for training fold "blocks" to avoid over-fitting #6

dtpc opened this issue Jun 5, 2019 · 5 comments

Comments

@dtpc
Copy link
Collaborator

dtpc commented Jun 5, 2019

Models can overfit when training samples are spatially adjacent.

A way to mitigate this is to select a pixel block size when extracting training folds such that pixels in the same local block are assigned the same fold.

The model will be encouraged to predict well outside areas local to the training data during cross-validation/model-selection.

@dtpc
Copy link
Collaborator Author

dtpc commented Jun 13, 2019

I've implemented this here: https://github.com/dtpc/landshark/tree/feature/6-fold-blocks

It does not account for the distribution of training points over the area, so may/will result in folds of unequal size.

Another approach I think would be useful is grouping based on some other training point property (e.g. https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators-for-grouped-data). Implementing this would require some more structural changes to the code, though. Currently the target HDF5 file only contains y and coord data.

@dsteinberg
Copy link
Contributor

Oh yeah? do you mean when we select data randomly for our train/test folds, we can get an underestimate of the true error if our test points are often close to the training points?

Or by doing this are we testing if our model generalizes well away from the training data?

@dtpc
Copy link
Collaborator Author

dtpc commented Jun 17, 2019

The later, although I think "away from the training data" may not be that far in some cases.

Typically the training data is heavily biased, sparse but often locally dense. I think this can lead to learning very localised models, especially if the targets are highly correlated spatially. In the extreme case if neighbouring pixels (and target values) are more or less identical, then the model could potentially just learn the input (this is even more of a issue if we have training points located within the same pixel). This would be an accurate model, but probably not a very useful one to generate a predictive map from.

So, I think there is a need for different ways of splitting train/test data to encourage a more general model during model selection.

@dsteinberg
Copy link
Contributor

Yeah agreed - a few more splitting methods would be useful.
This problem in general though is very hard -- it's really hard to know how a model will behave "away" from the training data... the exception is maybe a Gaussian Process with a prior distribution over kernel parameters - these sorts of models "revert" to their prior away from data, and you can specify that prior (Gaussian processes where we "learn" the prior don't necessarily have this behaviour). There are also models where you can learn what your training data looks like, and when you are querying the model with different data.

@dtpc
Copy link
Collaborator Author

dtpc commented Jun 20, 2019

Yes, this is definitely not intended as a solution for covariate shift. I guess just providing more flexibility around model selection/evaluation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants