Skip to content

Commit

Permalink
reduce check time; prepare cran submission
Browse files Browse the repository at this point in the history
  • Loading branch information
HannaMeyer committed Apr 13, 2024
1 parent ac12cfa commit e39623d
Show file tree
Hide file tree
Showing 9 changed files with 49 additions and 58 deletions.
4 changes: 2 additions & 2 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Package: CAST
Type: Package
Title: 'caret' Applications for Spatial-Temporal Models
Version: 1.0.0
Version: 1.0.1
Authors@R: c(person("Hanna", "Meyer", email = "[email protected]", role = c("cre", "aut")),
person("Carles", "Milà", role = c("aut")),
person("Marvin", "Ludwig", role = c("aut")),
Expand All @@ -13,7 +13,7 @@ Authors@R: c(person("Hanna", "Meyer", email = "[email protected]", rol
person("Edzer", "Pebesma", role = c("ctb")))
Author: Hanna Meyer [cre, aut], Carles Milà [aut], Marvin Ludwig [aut], Jan Linnenbrink [aut], Fabian Schumacher [aut], Philipp Otto [ctb], Chris Reudenbach [ctb], Thomas Nauss [ctb], Edzer Pebesma [ctb]
Maintainer: Hanna Meyer <[email protected]>
Description: Supporting functionality to run 'caret' with spatial or spatial-temporal data. 'caret' is a frequently used package for model training and prediction using machine learning. CAST includes functions to improve spatial or spatial-temporal modelling tasks using 'caret'. It includes the newly suggested 'Nearest neighbor distance matching' cross-validation to estimate the performance of spatial prediction models and allows for spatial variable selection to selects suitable predictor variables in view to their contribution to the spatial model performance. CAST further includes functionality to estimate the (spatial) area of applicability of prediction models. Methods are described in Meyer et al. (2018) <doi:10.1016/j.envsoft.2017.12.001>; Meyer et al. (2019) <doi:10.1016/j.ecolmodel.2019.108815>; Meyer and Pebesma (2021) <doi:10.1111/2041-210X.13650>; Milà et al. (2022) <doi:10.1111/2041-210X.13851>; Meyer and Pebesma (2022) <doi:10.1038/s41467-022-29838-9>; Linnenbrink et al. (2023) <doi:10.5194/egusphere-2023-1308>.
Description: Supporting functionality to run 'caret' with spatial or spatial-temporal data. 'caret' is a frequently used package for model training and prediction using machine learning. CAST includes functions to improve spatial or spatial-temporal modelling tasks using 'caret'. It includes the newly suggested 'Nearest neighbor distance matching' cross-validation to estimate the performance of spatial prediction models and allows for spatial variable selection to selects suitable predictor variables in view to their contribution to the spatial model performance. CAST further includes functionality to estimate the (spatial) area of applicability of prediction models. Methods are described in Meyer et al. (2018) <doi:10.1016/j.envsoft.2017.12.001>; Meyer et al. (2019) <doi:10.1016/j.ecolmodel.2019.108815>; Meyer and Pebesma (2021) <doi:10.1111/2041-210X.13650>; Milà et al. (2022) <doi:10.1111/2041-210X.13851>; Meyer and Pebesma (2022) <doi:10.1038/s41467-022-29838-9>; Linnenbrink et al. (2023) <doi:10.5194/egusphere-2023-1308>. The package is described in detail in Meyer et al. (2024) <doi:10.48550/arXiv.2404.06978>.
License: GPL (>= 2)
URL: https://github.com/HannaMeyer/CAST,
https://hannameyer.github.io/CAST/
Expand Down
3 changes: 3 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
# `CAST` 1.0.1
* bug fix: fix failed tests in global_validation

# `CAST` 1.0.0
* new features:
* calculate local point density within AOA
Expand Down
2 changes: 2 additions & 0 deletions R/CAST-package.R
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,13 @@
#' CAST further includes functionality to estimate the (spatial) area of applicability of prediction models
#' by analysing the similarity between new data and training data.
#' Methods are described in Meyer et al. (2018); Meyer et al. (2019); Meyer and Pebesma (2021); Milà et al. (2022); Meyer and Pebesma (2022); Linnenbrink et al. (2023).
#' The package is described in detail in Meyer et al. (2024).
#' @name CAST
#' @title 'caret' Applications for Spatial-Temporal Models
#' @author Hanna Meyer, Carles Milà, Marvin Ludwig, Jan Linnenbrink, Fabian Schumacher
#' @references
#' \itemize{
#' \item Meyer, H., Ludwig, L., Milà, C., Linnenbrink, J., Schumacher, F. (2024): The CAST package for training and assessment of spatial prediction models in R. arXiv, https://doi.org/10.48550/arXiv.2404.06978.
#' \item Linnenbrink, J., Milà, C., Ludwig, M., and Meyer, H.: kNNDM: k-fold Nearest Neighbour Distance Matching Cross-Validation for map accuracy estimation, EGUsphere [preprint], https://doi.org/10.5194/egusphere-2023-1308, 2023.
#' \item Milà, C., Mateu, J., Pebesma, E., Meyer, H. (2022): Nearest Neighbour Distance Matching Leave-One-Out Cross-Validation for map validation. Methods in Ecology and Evolution 00, 1– 13.
#' \item Meyer, H., Pebesma, E. (2022): Machine learning-based global maps of ecological variables and the challenge of assessing them. Nature Communications. 13.
Expand Down
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ https://hannameyer.github.io/CAST/

## Tutorials

* [The CAST package for training and assessment of spatial prediction models in R](https://arxiv.org/abs/2404.06978)

* [Introduction to CAST](https://hannameyer.github.io/CAST/articles/cast01-CAST-intro.html)

* [Visualization of nearest neighbor distance distributions](https://hannameyer.github.io/CAST/articles/cast02-plotgeodist.html)
Expand All @@ -29,6 +31,8 @@ https://www.youtube.com/watch?v=mkHlmYEzsVQ.

## Scientific documentation of the methods

* Meyer, H., Ludwig, L., Milà, C., Linnenbrink, J., Schumacher, F. (2024): The CAST package for training and assessment of spatial prediction models in R. arXiv, https://doi.org/10.48550/arXiv.2404.06978.

### Spatial cross-validation
* Milà, C., Mateu, J., Pebesma, E., Meyer, H. (2022): Nearest Neighbour Distance Matching Leave-One-Out Cross-Validation for map validation. Methods in Ecology and Evolution 00, 1– 13.
https://doi.org/10.1111/2041-210X.13851
Expand Down
2 changes: 2 additions & 0 deletions man/CAST.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

52 changes: 28 additions & 24 deletions tests/testthat/test-fss.R
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,15 @@ test_that("ffs works with default arguments and the splotopen dataset (numerical
data("splotdata")
splotdata = splotdata |> sf::st_drop_geometry()
set.seed(1)
selection = ffs(predictors = splotdata[,6:16],
selection = ffs(predictors = splotdata[,6:12],
response = splotdata$Species_richness,
seed = 1,
verbose = FALSE,
ntree = 5)
ntree = 5,
tuneLength = 1)


expect_identical(selection$selectedvars, c("bio_6", "bio_12", "bio_15"))
expect_identical(selection$selectedvars, c("bio_6", "bio_12", "bio_5", "bio_4"))
expect_identical(selection$metric, "RMSE")
expect_identical(selection$maximize, FALSE)

Expand All @@ -23,13 +24,14 @@ test_that("ffs works with default arguments and the splotopen dataset (include c
data("splotdata")
splotdata = splotdata |> sf::st_drop_geometry()
set.seed(1)
selection = ffs(predictors = splotdata[,c(4,6:16)],
selection = ffs(predictors = splotdata[,c(4,6:12)],
response = splotdata$Species_richness,
verbose = FALSE,
seed = 1,
ntree = 5)
ntree = 5,
tuneLength = 1)

expect_identical(selection$selectedvars, c("bio_6", "bio_12", "bio_15"))
expect_identical(selection$selectedvars, c("bio_6", "bio_12", "Biome","bio_1" , "bio_5"))
expect_identical(selection$metric, "RMSE")
expect_identical(selection$maximize, FALSE)
})
Expand All @@ -40,37 +42,39 @@ test_that("ffs works for classification with default arguments",{
splotdata = splotdata |> sf::st_drop_geometry()
splotdata$Biome = droplevels(splotdata$Biome)
set.seed(1)
selection = ffs(predictors = splotdata[,c(6:16)],
selection = ffs(predictors = splotdata[,c(6:12)],
response = splotdata$Biome,
verbose = FALSE,
seed = 1,
ntree = 5)
ntree = 5,
tuneLength = 1)

expect_identical(selection$selectedvars, c("bio_4", "bio_6", "bio_13",
"bio_14", "bio_12", "bio_8", "elev"))
expect_identical(selection$selectedvars, c("bio_4", "bio_8", "bio_12",
"bio_9"))
expect_identical(selection$metric, "Accuracy")
expect_identical(selection$maximize, TRUE)

})


test_that("ffs works for withinSE = TRUE",{
data("splotdata")
splotdata = splotdata |> sf::st_drop_geometry()
splotdata$Biome = droplevels(splotdata$Biome)
set.seed(1)
selection = ffs(predictors = splotdata[,c(6:16)],
response = splotdata$Biome,
seed = 1,
verbose = FALSE,
ntree = 5,
withinSE = TRUE)
#test_that("ffs works for withinSE = TRUE",{
# data("splotdata")
# splotdata = splotdata |> sf::st_drop_geometry()
# splotdata$Biome = droplevels(splotdata$Biome)
# set.seed(1)
# selection = ffs(predictors = splotdata[,c(6:16)],
# response = splotdata$Biome,
# seed = 1,
# verbose = FALSE,
# ntree = 5,
# withinSE = TRUE,
# tuneLength = 1)

expect_identical(selection$selectedvars, c("bio_4", "bio_6", "bio_12",
"bio_14", "bio_8"))
# expect_identical(selection$selectedvars, c("bio_4", "bio_8", "bio_12",
# "bio_13","bio_14", "bio_5"))


})
#})



Expand Down
4 changes: 2 additions & 2 deletions vignettes/cast01-CAST-intro.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -221,7 +221,7 @@ ffsmodel <- ffs(st_drop_geometry(splotdata)[,predictors],
method="rf",
tuneGrid=data.frame("mtry"=2),
verbose=FALSE,
ntree=50,
ntree=25, #make it faster for this tutorial
trControl=trainControl(method="cv",
index = indices_knndm$indx_train,
savePredictions = "final"))
Expand All @@ -240,7 +240,7 @@ By plotting the results of ffs, we can visualize how the performance of the mode
plot(ffsmodel)
```

See that the best model using all combinations of two variables. Based on the best performing twi variables, using any third variable could slightly increase the R². Any further variable could not improve the LLO performance.
See that the best model using all combinations of two variables. Based on the best performing two variables, using a third variable could slightly increase the R², same applies to a fourth variable. Any further variables could not improve the LLO performance.
Note that the R² features a high standard deviation regardless of the variables being used. This is due to the small dataset that was used which cannot lead to robust results here.

What effect does the new model has on the spatial representation of species richness?
Expand Down
34 changes: 5 additions & 29 deletions vignettes/cast02-plotgeodist.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ Here we can define some parameters to run the example with different settings

```{r, message = FALSE, warning=FALSE}
seed <- 10 # random realization
samplesize <- 300 # how many samples will be used?
samplesize <- 250 # how many samples will be used?
nparents <- 20 #For clustered samples: How many clusters?
radius <- 500000 # For clustered samples: What is the radius of a cluster?
Expand Down Expand Up @@ -200,14 +200,15 @@ We see that the nearest neighbor distances during cross-validation don't match t

#### Nearest Neighbour Distance Matching CV

A good way to approximate the geographical prediction distances during the CV is to use Nearest Neighbour Distance Matching (NNDM) CV (see [Milà et al., 2022](https://doi.org/10.1111/2041-210X.13851) for more details). NNDM CV is a variation of LOO CV in which the empirical distribution function of nearest neighbour distances found during prediction is matched during the CV process.
A good way to approximate the geographical prediction distances during the CV is to use Nearest Neighbour Distance Matching (NNDM) CV (see [Milà et al., 2022](https://doi.org/10.1111/2041-210X.13851) for more details). NNDM CV is a variation of LOO CV in which the empirical distribution function of nearest neighbour distances found during prediction is matched during the CV process. Since NNDM CV is highly time consuming, the k-fold version may provide a good trade-off.
See (see [Linnenbrink et al., 2023](https://doi.org/10.5194/egusphere-2023-1308) for more details on knndm)



```{r,message = FALSE, warning=FALSE, results='hide'}
nndmfolds_clstr <- nndm(pts_clustered, modeldomain=co.ee, samplesize = 2000)
nndmfolds_clstr <- knndm(pts_clustered, modeldomain=co.ee, samplesize = 2000)
dist_clstr <- geodist(pts_clustered,co.ee,
sampling = "Fibonacci",
cvfolds = nndmfolds_clstr$indx_test,
Expand All @@ -220,7 +221,7 @@ The NNDM CV-distance distribution matches the sample-to-prediction distribution

```{r,message = FALSE, warning=FALSE, results='hide'}
nndmfolds_rand <- nndm(pts_random_co, modeldomain=co.ee, samplesize = 2000)
nndmfolds_rand <- knndm(pts_random_co, modeldomain=co.ee, samplesize = 2000)
dist_rand <- geodist(pts_random_co,co.ee,
sampling = "Fibonacci",
cvfolds = nndmfolds_rand$indx_test,
Expand All @@ -232,31 +233,6 @@ plot(dist_rand, unit = "km")+scale_x_log10(labels=round)
The NNDM CV-distance still matches the sample-to-prediction distance function.


#### k-fold Nearest Neighbour Distance Matching CV
Since NNDM CV is highly time consuming, the k-fold version may provide a good trade-off.
See (see [Linnenbrink et al., 2023](https://doi.org/10.5194/egusphere-2023-1308) for more details)

```{r,message = FALSE, warning=FALSE, results='hide'}
knndmfolds_clstr <- knndm(pts_clustered, modeldomain=co.ee, samplesize = 2000)
pts_clustered$knndmCV <- as.character(knndmfolds_clstr$clusters)
ggplot() + geom_sf(data = co.ee, fill="#00BFC4",col="#00BFC4") +
geom_sf(data = pts_clustered, aes(color=knndmCV),size=0.5, shape=3) +
scale_color_manual(values=rainbow(length(unique(pts_clustered$knndmCV))))+
guides(fill = FALSE, col = FALSE) +
labs(x = NULL, y = NULL)+ ggtitle("spatial fold membership by color")
dist_clstr <- geodist(pts_clustered,co.ee,
sampling = "Fibonacci",
cvfolds = knndmfolds_clstr$indx_test,
cvtrain = knndmfolds_clstr$indx_train)
plot(dist_clstr, unit = "km")+scale_x_log10(labels=round)
```



## Distances in feature space

Expand Down
2 changes: 1 addition & 1 deletion vignettes/cast04-AOA-tutorial.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -258,7 +258,7 @@ The AOA is then calculated (for comparison) using the model validated by random
```{r,message = FALSE, warning=FALSE}
AOA_spatial <- aoa(predictors, model, LPD = TRUE, verbose = FALSE)
AOA_random <- aoa(predictors, model_random, LPD = TRUE, verbose = FALSE)
AOA_random <- aoa(predictors, model_random, LPD = FALSE, verbose = FALSE)
```


Expand Down

0 comments on commit e39623d

Please sign in to comment.