reduce check time; prepare cran submission

HannaMeyer · Apr 13, 2024 · e39623d · e39623d
1 parent ac12cfa
commit e39623d
Show file tree

Hide file tree

Showing 9 changed files with 49 additions and 58 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,7 +1,7 @@
 Package: CAST
 Type: Package
 Title: 'caret' Applications for Spatial-Temporal Models
-Version: 1.0.0
+Version: 1.0.1
 Authors@R: c(person("Hanna", "Meyer", email = "[email protected]", role = c("cre", "aut")),
              person("Carles", "Milà", role = c("aut")),
              person("Marvin", "Ludwig", role = c("aut")),
@@ -13,7 +13,7 @@ Authors@R: c(person("Hanna", "Meyer", email = "[email protected]", rol
              person("Edzer", "Pebesma", role = c("ctb")))
 Author: Hanna Meyer [cre, aut], Carles Milà [aut], Marvin Ludwig [aut], Jan Linnenbrink [aut], Fabian Schumacher [aut], Philipp Otto [ctb], Chris Reudenbach [ctb], Thomas Nauss [ctb], Edzer Pebesma [ctb]
 Maintainer: Hanna Meyer <[email protected]>
-Description: Supporting functionality to run 'caret' with spatial or spatial-temporal data. 'caret' is a frequently used package for model training and prediction using machine learning. CAST includes functions to improve spatial or spatial-temporal modelling tasks using 'caret'. It includes the newly suggested 'Nearest neighbor distance matching' cross-validation to estimate the performance of spatial prediction models and allows for spatial variable selection to selects suitable predictor variables in view to their contribution to the spatial model performance. CAST further includes functionality to estimate the (spatial) area of applicability of prediction models. Methods are described in Meyer et al. (2018) <doi:10.1016/j.envsoft.2017.12.001>; Meyer et al. (2019) <doi:10.1016/j.ecolmodel.2019.108815>; Meyer and Pebesma (2021) <doi:10.1111/2041-210X.13650>; Milà et al. (2022) <doi:10.1111/2041-210X.13851>; Meyer and Pebesma (2022) <doi:10.1038/s41467-022-29838-9>; Linnenbrink et al. (2023) <doi:10.5194/egusphere-2023-1308>.
+Description: Supporting functionality to run 'caret' with spatial or spatial-temporal data. 'caret' is a frequently used package for model training and prediction using machine learning. CAST includes functions to improve spatial or spatial-temporal modelling tasks using 'caret'. It includes the newly suggested 'Nearest neighbor distance matching' cross-validation to estimate the performance of spatial prediction models and allows for spatial variable selection to selects suitable predictor variables in view to their contribution to the spatial model performance. CAST further includes functionality to estimate the (spatial) area of applicability of prediction models. Methods are described in Meyer et al. (2018) <doi:10.1016/j.envsoft.2017.12.001>; Meyer et al. (2019) <doi:10.1016/j.ecolmodel.2019.108815>; Meyer and Pebesma (2021) <doi:10.1111/2041-210X.13650>; Milà et al. (2022) <doi:10.1111/2041-210X.13851>; Meyer and Pebesma (2022) <doi:10.1038/s41467-022-29838-9>; Linnenbrink et al. (2023) <doi:10.5194/egusphere-2023-1308>. The package is described in detail in Meyer et al. (2024) <doi:10.48550/arXiv.2404.06978>.
 License: GPL (>= 2)
 URL: https://github.com/HannaMeyer/CAST,
     https://hannameyer.github.io/CAST/

diff --git a/NEWS.md b/NEWS.md
@@ -1,3 +1,6 @@
+# `CAST` 1.0.1
+* bug fix: fix failed tests in global_validation
+
 # `CAST` 1.0.0
 * new features:
   * calculate local point density within AOA

diff --git a/R/CAST-package.R b/R/CAST-package.R
@@ -8,11 +8,13 @@
 #' CAST further includes functionality to estimate the (spatial) area of applicability of prediction models
 #' by analysing the similarity between new data and training data.
 #' Methods are described in Meyer et al. (2018); Meyer et al. (2019); Meyer and Pebesma (2021); Milà et al. (2022); Meyer and Pebesma (2022); Linnenbrink et al. (2023).
+#' The package is described in detail in Meyer et al. (2024).
 #' @name CAST
 #' @title 'caret' Applications for Spatial-Temporal Models
 #' @author Hanna Meyer, Carles Milà, Marvin Ludwig, Jan Linnenbrink, Fabian Schumacher
 #' @references
 #' \itemize{
+#' \item Meyer, H., Ludwig, L., Milà, C., Linnenbrink, J., Schumacher, F. (2024): The CAST package for training and assessment of spatial prediction models in R. arXiv, https://doi.org/10.48550/arXiv.2404.06978.
 #' \item Linnenbrink, J., Milà, C., Ludwig, M., and Meyer, H.: kNNDM: k-fold Nearest Neighbour Distance Matching Cross-Validation for map accuracy estimation, EGUsphere [preprint], https://doi.org/10.5194/egusphere-2023-1308, 2023.
 #' \item Milà, C., Mateu, J., Pebesma, E., Meyer, H. (2022): Nearest Neighbour Distance Matching Leave-One-Out Cross-Validation for map validation. Methods in Ecology and Evolution 00, 1– 13.
 #' \item Meyer, H., Pebesma, E. (2022): Machine learning-based global maps of ecological variables and the challenge of assessing them. Nature Communications. 13.

diff --git a/README.md b/README.md
@@ -9,6 +9,8 @@ https://hannameyer.github.io/CAST/
 
 ## Tutorials
 
+* [The CAST package for training and assessment of spatial prediction models in R](https://arxiv.org/abs/2404.06978)
+
 * [Introduction to CAST](https://hannameyer.github.io/CAST/articles/cast01-CAST-intro.html)
 
 * [Visualization of nearest neighbor distance distributions](https://hannameyer.github.io/CAST/articles/cast02-plotgeodist.html)
@@ -29,6 +31,8 @@ https://www.youtube.com/watch?v=mkHlmYEzsVQ.
 
 ## Scientific documentation of the methods
 
+* Meyer, H., Ludwig, L., Milà, C., Linnenbrink, J., Schumacher, F. (2024): The CAST package for training and assessment of spatial prediction models in R. arXiv, https://doi.org/10.48550/arXiv.2404.06978.
+
 ### Spatial cross-validation
 * Milà, C., Mateu, J., Pebesma, E., Meyer, H. (2022): Nearest Neighbour Distance Matching Leave-One-Out Cross-Validation for map validation. Methods in Ecology and Evolution 00, 1– 13.
 https://doi.org/10.1111/2041-210X.13851

diff --git a/man/CAST.Rd b/man/CAST.Rd
diff --git a/tests/testthat/test-fss.R b/tests/testthat/test-fss.R
@@ -3,14 +3,15 @@ test_that("ffs works with default arguments and the splotopen dataset (numerical
   data("splotdata")
   splotdata = splotdata |> sf::st_drop_geometry()
   set.seed(1)
-  selection = ffs(predictors = splotdata[,6:16],
+  selection = ffs(predictors = splotdata[,6:12],
                   response = splotdata$Species_richness,
                   seed = 1,
                   verbose = FALSE,
-                  ntree = 5)
+                  ntree = 5,
+                  tuneLength = 1)
 
 
-  expect_identical(selection$selectedvars, c("bio_6", "bio_12", "bio_15"))
+  expect_identical(selection$selectedvars, c("bio_6", "bio_12", "bio_5", "bio_4"))
   expect_identical(selection$metric, "RMSE")
   expect_identical(selection$maximize, FALSE)
 
@@ -23,13 +24,14 @@ test_that("ffs works with default arguments and the splotopen dataset (include c
   data("splotdata")
   splotdata = splotdata |> sf::st_drop_geometry()
   set.seed(1)
-  selection = ffs(predictors = splotdata[,c(4,6:16)],
+  selection = ffs(predictors = splotdata[,c(4,6:12)],
                   response = splotdata$Species_richness,
                   verbose = FALSE,
                   seed = 1,
-                  ntree = 5)
+                  ntree = 5,
+                  tuneLength = 1)
 
-  expect_identical(selection$selectedvars, c("bio_6", "bio_12", "bio_15"))
+  expect_identical(selection$selectedvars, c("bio_6", "bio_12", "Biome","bio_1" , "bio_5"))
   expect_identical(selection$metric, "RMSE")
   expect_identical(selection$maximize, FALSE)
 })
@@ -40,37 +42,39 @@ test_that("ffs works for classification with default arguments",{
   splotdata = splotdata |> sf::st_drop_geometry()
   splotdata$Biome = droplevels(splotdata$Biome)
   set.seed(1)
-  selection = ffs(predictors = splotdata[,c(6:16)],
+  selection = ffs(predictors = splotdata[,c(6:12)],
                   response = splotdata$Biome,
                   verbose = FALSE,
                   seed = 1,
-                  ntree = 5)
+                  ntree = 5,
+                  tuneLength = 1)
 
-  expect_identical(selection$selectedvars, c("bio_4", "bio_6",  "bio_13",
-                                             "bio_14", "bio_12", "bio_8", "elev"))
+  expect_identical(selection$selectedvars, c("bio_4", "bio_8",  "bio_12",
+                                             "bio_9"))
   expect_identical(selection$metric, "Accuracy")
   expect_identical(selection$maximize, TRUE)
 
 })
 
 
-test_that("ffs works for withinSE = TRUE",{
-  data("splotdata")
-  splotdata = splotdata |> sf::st_drop_geometry()
-  splotdata$Biome = droplevels(splotdata$Biome)
-  set.seed(1)
-  selection = ffs(predictors = splotdata[,c(6:16)],
-                  response = splotdata$Biome,
-                  seed = 1,
-                  verbose = FALSE,
-                  ntree = 5,
-                  withinSE = TRUE)
+#test_that("ffs works for withinSE = TRUE",{
+#  data("splotdata")
+#  splotdata = splotdata |> sf::st_drop_geometry()
+#  splotdata$Biome = droplevels(splotdata$Biome)
+#  set.seed(1)
+#  selection = ffs(predictors = splotdata[,c(6:16)],
+#                  response = splotdata$Biome,
+#                  seed = 1,
+#                  verbose = FALSE,
+#                  ntree = 5,
+#                  withinSE = TRUE,
+#                  tuneLength = 1)
 
-  expect_identical(selection$selectedvars, c("bio_4", "bio_6",  "bio_12",
-                                             "bio_14", "bio_8"))
+#  expect_identical(selection$selectedvars, c("bio_4", "bio_8",  "bio_12",
+#                                             "bio_13","bio_14", "bio_5"))
 
 
-})
+#})
 
 
 

diff --git a/vignettes/cast01-CAST-intro.Rmd b/vignettes/cast01-CAST-intro.Rmd
@@ -221,7 +221,7 @@ ffsmodel <- ffs(st_drop_geometry(splotdata)[,predictors],
                     method="rf", 
                     tuneGrid=data.frame("mtry"=2),
                     verbose=FALSE,
-                    ntree=50,
+                    ntree=25, #make it faster for this tutorial
                     trControl=trainControl(method="cv",
                                            index = indices_knndm$indx_train,
                                            savePredictions = "final"))
@@ -240,7 +240,7 @@ By plotting the results of ffs, we can visualize how the performance of the mode
 plot(ffsmodel)
 ```
 
-See that the best model using all combinations of two variables. Based on the best performing twi variables, using any third variable could slightly increase the R². Any further variable could not improve the LLO performance.
+See that the best model using all combinations of two variables. Based on the best performing two variables, using a third variable could slightly increase the R², same applies to a fourth variable. Any further variables could not improve the LLO performance.
 Note that the R² features a high standard deviation regardless of the variables being used. This is due to the small dataset that was used which cannot lead to robust results here.
 
 What effect does the new model has on the spatial representation of species richness?

diff --git a/vignettes/cast02-plotgeodist.Rmd b/vignettes/cast02-plotgeodist.Rmd
@@ -41,7 +41,7 @@ Here we can define some parameters to run the example with different settings
 
 ```{r, message = FALSE, warning=FALSE}
 seed <- 10 # random realization
-samplesize <- 300 # how many samples will be used?
+samplesize <- 250 # how many samples will be used?
 nparents <- 20 #For clustered samples: How many clusters? 
 radius <- 500000 # For clustered samples: What is the radius of a cluster?
 
@@ -200,14 +200,15 @@ We see that the nearest neighbor distances during cross-validation don't match t
 
 #### Nearest Neighbour Distance Matching CV
 
-A good way to approximate the geographical prediction distances during the CV is to use Nearest Neighbour Distance Matching (NNDM) CV (see [Milà et al., 2022](https://doi.org/10.1111/2041-210X.13851) for more details). NNDM CV is a variation of LOO CV in which the empirical distribution function of nearest neighbour distances found during prediction is matched during the CV process.
+A good way to approximate the geographical prediction distances during the CV is to use Nearest Neighbour Distance Matching (NNDM) CV (see [Milà et al., 2022](https://doi.org/10.1111/2041-210X.13851) for more details). NNDM CV is a variation of LOO CV in which the empirical distribution function of nearest neighbour distances found during prediction is matched during the CV process. Since NNDM CV is highly time consuming, the k-fold version may provide a good trade-off.
+See (see [Linnenbrink et al., 2023](https://doi.org/10.5194/egusphere-2023-1308) for more details on knndm)
 
 
 
 ```{r,message = FALSE, warning=FALSE, results='hide'}
 
 
-nndmfolds_clstr <- nndm(pts_clustered, modeldomain=co.ee, samplesize = 2000)
+nndmfolds_clstr <- knndm(pts_clustered, modeldomain=co.ee, samplesize = 2000)
 dist_clstr <- geodist(pts_clustered,co.ee,
                            sampling = "Fibonacci",
                            cvfolds = nndmfolds_clstr$indx_test, 
@@ -220,7 +221,7 @@ The NNDM CV-distance distribution matches the sample-to-prediction distribution
 
 ```{r,message = FALSE, warning=FALSE, results='hide'}
 
-nndmfolds_rand <- nndm(pts_random_co,  modeldomain=co.ee, samplesize = 2000)
+nndmfolds_rand <- knndm(pts_random_co,  modeldomain=co.ee, samplesize = 2000)
 dist_rand <- geodist(pts_random_co,co.ee,
                           sampling = "Fibonacci",
                           cvfolds = nndmfolds_rand$indx_test, 
@@ -232,31 +233,6 @@ plot(dist_rand, unit = "km")+scale_x_log10(labels=round)
 The NNDM CV-distance still matches the sample-to-prediction distance function.
 
 
-#### k-fold Nearest Neighbour Distance Matching CV
-Since NNDM CV is highly time consuming, the k-fold version may provide a good trade-off.
-See (see [Linnenbrink et al., 2023](https://doi.org/10.5194/egusphere-2023-1308) for more details)
-
-```{r,message = FALSE, warning=FALSE, results='hide'}
-
-knndmfolds_clstr <- knndm(pts_clustered, modeldomain=co.ee, samplesize = 2000)
-pts_clustered$knndmCV <- as.character(knndmfolds_clstr$clusters)
-
-ggplot() + geom_sf(data = co.ee, fill="#00BFC4",col="#00BFC4") +
-  geom_sf(data = pts_clustered, aes(color=knndmCV),size=0.5, shape=3) +
-  scale_color_manual(values=rainbow(length(unique(pts_clustered$knndmCV))))+
-  guides(fill = FALSE, col = FALSE) +
-  labs(x = NULL, y = NULL)+ ggtitle("spatial fold membership by color")
-
-
-dist_clstr <- geodist(pts_clustered,co.ee,
-                           sampling = "Fibonacci",
-                           cvfolds = knndmfolds_clstr$indx_test, 
-                           cvtrain = knndmfolds_clstr$indx_train)
-plot(dist_clstr, unit = "km")+scale_x_log10(labels=round)
-
-```
-
-
 
 ## Distances in feature space
 

diff --git a/vignettes/cast04-AOA-tutorial.Rmd b/vignettes/cast04-AOA-tutorial.Rmd
@@ -258,7 +258,7 @@ The AOA is then calculated (for comparison) using the model validated by random
 ```{r,message = FALSE, warning=FALSE}
 AOA_spatial <- aoa(predictors, model, LPD = TRUE, verbose = FALSE)
 
-AOA_random <- aoa(predictors, model_random, LPD = TRUE, verbose = FALSE)
+AOA_random <- aoa(predictors, model_random, LPD = FALSE, verbose = FALSE)
 ```