Skip to content

Latest commit

 

History

History
256 lines (173 loc) · 18.4 KB

Ecological-Niche-Modelling.en.adoc

File metadata and controls

256 lines (173 loc) · 18.4 KB

Ecological Niche Models

By the end of the module, you should be able to understand what it is you are modelling and the different approaches available for developing your model, what kind of data you will need to model ecological niches, and how you can assess the quality of the models you produce. You will finish with an introduction on how to you can project your models into novel environments.

ENM australia
Carlos-Júnior LA, Barbosa NP, Moulton TP, Creed JC. Ecological Niche Model used to examine the distribution of an invasive, non-indigenous coral. Mar Environ Res. 2015 Feb;103:115-24. doi: 10.1016/j.marenvres.2014.10.004.

What is an ecological niche model?

An ecological niche model is an equation or set of equations that describe the ecological niche of a species. Niche modeling essentially does quantitatively, what natural historians do qualitatively, predicting suitable habitat for a species in geographic space. Models identify the characteristics of where a species has been found or observed and use that information to predict where else might be suitable habitat. Modelling approaches allow for the inference of a species without the need for experimental manipualtion that may be too costly, logigistically impossible and/or unethical. These approaches also allow us to use existing data that we may have collected or observed ourselves, harvested from literature or obtained from data aggregators such as GBIF.

Note
In this video (07:14), you will review the different concepts of a niche.

Uses

Beyond defining the limits of a species niche, ecological niche models have a number of practical uses including:

  • The designation of protected areas - comparisons of protected area coverage and niche models of interest species can identify gaps in protected areas coverage that merit increased protection. By stacking niche models for different species, we can also identify areas where biodiversity is concentrated and maximise the protection of as many species as possible e.g. Leathwick et al., 2008

  • Invasive species management - We can build niche models of known occurrences of an invasive species within its native range and project this information into regions of potential invasion to determine the potential invasive threat of the species e.g. Carlos-Júnior et al., 2015.

  • Climate change mitigation - using models to define the characteristics of a species' niche, we can then predict the availability of suitable habitat under different climate scenarios and thus infer the impacts of climate change e.g. Aguilar et al. 2015

Commonly used algorithms

Algorithms to develop ecological niche models can be divided into three categories: presence-only, presence-background and presence-absence.

Presence-only algorithms focus solely on the environmental values linked to each occurrence record for calibration and create environmental envelopes, which are ellipsoids, squares, or convex-hull that surround the occurrences in an environmental space. These models are not very predictive Calibration of these modes is insensitive to changes in the extent of the study area. Commonly used algorithms include Bioclim and NicheA.

bioclim

Presence-absence algorithms need a set of localities where the organism occurs (i.e., presence) and a set of localities where the organisms does not occur (i.e., absence). Presence-absence models are calibrated by comparing environmental conditions where the organism is present vs. where it is absent and are generally useful to reconstruct the distribution at fine scale and short periods, resulting in the need of accurate localities and high-resolution environmental variables. These models, however, have limited capacities to be projected to different areas or periods, instead, their signals are space and time specific. Many algorithms are available including regression (e.g., Generalized Linear Models and Generalized Additive Models) and classification (e.g., Boosted Regression Trees, Random Forest, and Support Vector Machines) algorithms, with protocols described in detail elsewhere

Absence data is limited in availability and is of questionable quality, as it is difficult to confirm the absence of a species from a particular area. To overcome the issues, presence-background models simulate absence data by generating random points (i.e., fake absence data) across the study area to be able to use presence-absence algorithms. Because the background corresponds to the study area, calibration of these algorithms is highly sensitive to variations in the extent of the study area extent selected. A popular ecological niche modeling algorithm using this approach is Maxent, and wll will focus on this algorithm throughout this course.

maxent
Elith, J., Phillips, S.J., Hastie, T., Dudík, M., Chee, Y.E. and Yates, C.J. (2011), A statistical explanation of MaxEnt for ecologists. Diversity and Distributions, 17: 43-57. https://doi.org/10.1111/j.1472-4642.2010.00725.x

Environmental variables

Environmental variables, also known as environmental data, explanatory variables, bioclimatic data or covariates are anything that can be summarized by a raster (gridded dataset). These variables are used to characterize the niche of a species. The data can be either continuous or categorical (i.e. data expressed as vectors), direct measurements or derived products, static or dynamic or terrestrial, aquatic or atmospheric.

Environmental data

The tables below give examples of how these data can be classified.

Continuous Categorical

Elevation, bathymetry

Geology, Ecosystem

Direct Measurement Derived Product

Remotely sensed data (raw), weather station data

climatology data, GCMs, derived remotely sensed data

Static Dynamic

Altitude, bathymetry, slope, aspect, soil charecteristics

temperature, precipitation, sea surface height

Terrestrial Aquatic Freshwater Atmospheric

Climate, terrain, vegetation/land cover, soil

Sea surface temperature, bathymetry, pH, salinity

Flow rates, accumulation, temperature

Wind (UV), radiation

Common sources of data

  • WorldClim (Terrestrial)

  • EarthEnv (Terrestrial and Freshwater)

  • Bio-Oracle (Marine)

  • National Geophysical Data Center (Terrestrial and Marine)

  • National Snow and Ice Data Center (Terrestrial and Marine

  • World Ocean Atlas (Marine)

  • Raw GCM outputs (ALL)

WorldClim is the most commonly-used climate data consisting of 19 derived bioclimatic variables (“BioClim”). These are typically divided into “quarters” (warmest quarter, driest quarter) and are related to seasonality. WorldClim also produces past and future modeled climate * Past: HCO, LGM, LIG * Future: to 2100 AD

But there are other sources e.g. http://ecoclimate.org/ that stretch back farther. These are often not just climate models but also models of land position/amount. These past and future models differ in that past models are parameterized and testable using direct evidence, whereas future models are based on forcing variables (e.g. CO2)

Selecting covariates (or environmental variables)

More environmental data isn’t always better. You want to balance to achieve a balance between the number of data points and the number of environmental variables so that you do not overfit you model. When selecting variables we want to be sure that:

  • our variables are biologically relevant - they should reflect the species of study’s biology e.g. solar radiation my not be a relevant environmental variable for soil dwelling species

  • our variables are not highly correlated - for instance, if we take the two variables: elevation and temperature. Temperature is not independent of elevation so we may want to remove one of these variables. In this instance, elevation would be preferably removed as it is more accurately measured.

  • we do not use all 19 Bioclim variables

Importantly, spatio-temporal resolution and covariate data extent should align with:

  • the limitations of other input data (e.g., available usable occurrence data)

  • the scope of the base question(s)/hypotheses

For example, if your environmental data have a spatial resolution of 10 Arc Minutes and a temporal resolution between 1955 and 2006, then the temporal and spatial resolution of the GBIF-meadited data you are going to use should correspond to those resolutions.

Training regions

Training regions (or study areas) are the areas from which model algorithms sample the background for model inference. In the case of presence-background models such as Maxent, this will be the area from which the model will randomly pick pseudoabsences that are use for calibrating the model. The training area can be thought of as the areas where the species could potentially experience envinronmental conditions. The species may not actually occur there, but it is possible that the species can reach those areas. Points to consider when delimiting your training regions are:

  • Where did the species originate?

  • How far can the species diserse?

  • Are there any biogeographic barriers that would prevent the dispersal of the species?

  • it should not be a rectangle

  • it should not correspond to political boundaries

  • it should not be a coarse range delimitation (e.g. range map)

  • bigger is not better

Training region

In the above example, the isthmus of Panama acts as effective barrier to the isolation of the Panamic porkfish to the Pacific and the Porkfish to the Caribbean. Training regions for each species would not contain areas on the opposite side of the Isthmus from where the species was found.

Interpretation and Post-Processing of Niche Models

You are now ready to build your model and this means deciding on the level of complexity of your model. This is done through two key factors: feature classes and the regularization multiplier. Feature classes determine the shape of available modeled relationships in environmental space and the more feature classes chosen, the higher the potential for model complexity. The regularization multiplier penalizes complexity to a greater degree, with higher values leading to simpler models with fewer variables. For these reasons, evaluating model performance and estimating optimal model complexity constitute important elements of a niche/distributional modeling for examples simultaneously varying the feature classes allowed and the regularization multiplers applied to each of them. Phillips, S.J., & Dudík, M. (2008). Modeling of species distributions with Maxent: new extensions and a comprehensive evaluation. Ecography. 31: 161-175.

Model Evaluation

You will have to assess the model’s precision and significance — that is, whether the model can correctly predict independent presence (or absence) data and whether the model prediction is better than null expectations. Outputs for your model will include variable response curves and a number of statistics that can be used for assessing the performance of your model.

Variable Response Curves

Variable response curves are model outputs that describe how well your model has characterised how the species responds to the variable. Approximately normal curves may indicate better estimates of the fundamental niche of the species e.g.

variableresponse

Curves that deviate from normal distributions or are flat, may indicate that the variable may not be a good estimator of a species’s fundamental nicehe. However, some variables such as ice concentrations, the lower curve in the diagram above, do not work like that - very few species can live enclosed in ice!

Statistics

In the ideal modeling scenario…​ You would seek to identify the ideal model calibration for your data and modeling intent, by comparing:

  • multiple calibration scenarios for an individual algorithm and

  • the best model calibration scenario across multiple algorithms

In the use cases, where you will be dipping your toes into the major theoretical concepts underpinning ENM/SDM, you’ll be looking at only 1 algorithm.

Many options exist for evaluating model calibration scenarios.

Common and accepted approaches are:

  • Akaike Information Criterion (AIC) - AIC is a log likelihood based evaluation metric, commonly used within regression methods. It compares and identifies the best model calibration scenario for an individual statistical algorithm. It balances model fit with model complexity but can NOT be used to compare between different algorithms. We can evaluate the performance of a model i.e. “which model performed better” by choosing the model with the lowest AIC. However, when AICs are only within 2 points of each other, these do not differ significantly and you will need to look at other factors (e.g., variable contribution through variable response curves) that may suggest which (if any) of the equivalent models is more ideal

  • Omission Rate (OR) - compares model performance across algorithms. It is a method of evaluating a model’s ability to accurately predict to test data (typically after applying a threshold). When OR = 0, then no presences were predicted as absent and the model has performed well.

Thresholding a Niche Model

Thresholding is the process by which we convert the continuous (raw) output, or continuous suitability surface, from a statistical model to a binary output. The binary output is generally interpreted as areas that are suitable/not suitable for the species. Models are rarely perfect and it is likely that they will predict species as being present where they are not actually present (commission errors) and, conversely, absent where they actually occur (omission errors). When we threshold out model we want to decide on a threshold at which we are minimising both commission and omission errors. If we have threshold value of 100 then all areas are suitable for the species and we will have a high number of commission errors and the number of omission errors will approach 0.

Species is present Species is absent

Model predicts species as present

Accurate

Type 1 Error (commission)

Model predicts species as absent

Type 2 Error (omission)

Accurate

We choose the “threshold” value that determines a presence versus an absence of the species using the: - Minimum Training Presence (MTP) - this threshold assumes that the least suitable habitat at which the species is known to occur is the minimum suitability value for the species - MTP + user-selected error rate (e.g., E=5%, E=10%) - a user-selected threshold that omits all regions with habitat suitability lower than the suitability values for the lowest 5% or 10% of occurrence records. It assumes that the percentage of occurrence records in the least suitable habitat do not occurr in regions that are representative of the species overall habitat, and thus should be omitted. This threshold omits a greater region than the MTP.

threshold

Precise method by which you do this depends on the quality of the data that you used to build the model.

Projecting a Niche Model

You project a niche model when you map your model onto the training region to find additional suitable habitat. You can also map your model into the past or the future or into novel environments. You are asking, where can the species persist?

Projecting to your training region is the most common and simplest form. However, you can also project into different contemporaneous geographies too, for example:

  • target sampling in undersurveyed regions for rare organisms e.g. de Siqueira et al. 2009

  • predicting the existence of sister species e.g. Owens et al. 2013

  • predicting the invasive potential of introduced species.

We can also project into the past and the future, for example: * to hindcast distributions in the case of determining paleodistributions of modern taxa for identifying refugia e.g. Peterson and Nyári, 2007 * to forecast species distributions to identify range shifts due to cliamte change e.g. Wang et al., 2016.

The Big Caveat

Models are built using a specific set of occurrence data and environmental data and we do not know how our model will behave in new environments. Transferring a model across space and/or time may lead to extrapolation if the projected environments are novel relative to training environments. Model algorithms have three strategies for dealing with extrapolation of response curves into environmental conditions different than those existing in the region of model calibration, they can:

Truncate - designate all conditions outside of the calibration data range as unsuitable and thus not project beyond the training region Clamp - use the marginal values in the calibration area as the prediction for more extreme conditions in transfer areas thus potentially under predicting the full extent of the projected niche Extrapolate - extend the response curve based on trends obtained from calibration conditions or assumptions about the niche

It is left to the user whether they want their model to clamp or not.

Projection Uncertainty

MESS: Multivariate Environmental Suitability Surface is a measure of the similarity between the new environments and those in the training sample. They allow modelers to identify areas of model extrapolation in novel environments. It measures the similarity of any given point to a reference set of points, with respect to the chosen predictor variables. It reports the closeness of the point to the distribution of reference points, gives negative values for dissimilar points and maps these values across the whole prediction region. The map below is an example of a MESS with areas in red on the map highlighting areas of model extrapolation where into potentially unsuitable environments for the species.

mess