Few of the respondents to the online survey (Chapman 2006) recorded that they randomize data as opposed to generalizing it. Reasons for not randomizing included the extra work and computation involved, the increased chance of mistakes being made, and the less reliability that users may be able to place in the data.
Some respondents to the survey stated that they were comfortable with displaying presence/absence of sensitive data within large polygons or grids squares, because it still reflected the real data, but were aghast at the idea of deliberately ‘faking’ point coordinates such that locations appear as precise representations, but are randomly offset from the real data – i.e., they represent the deliberate introduction of error. Whereas [generalization] creates/retains ‘true’ data, [randomization] creates deliberately ‘false’ data.
In the fields of data mining and protection of privacy (including of individual privacy in census data), it is generally regarded that bottom-up [generalization] is far more practical and scientifically defensible than [randomization] (see for example, Wang et al. 2004, Dalvi and Keole 2015).
Another advantage of [generalization] is that it:
-
scales up, allowing the use of a consistent methodology at different scales
-
can be set to give different people different resolutions, depending on set roles, etc.
-
can simply provide for different scales of generalization for different categories of sensitivity
-
is easier to implement for those with low technological expertise
Tip
|
Randomization is a methodology that I strongly recommend not be used. |
One of the most common requirements for generalizing sensitive biodiversity information is to generalize the spatial locality or geographic coordinates (Chapman and Wieczorek 2020). Traditionally this has been done in many ways, and there has been little consistency in methodologies and very little documentation as to what has been done in each case. This has considerably reduced the value of the data for analysis, and often users are unaware that the data has even been modified.
Generalization (at least in a spatial sense) is usually of one of two types, namely:
-
Generalization to a grid (metric or geographic)
-
Generalization to a polygon (socio-political region, country, biogeographic region, watershed)
Many respondents to the survey (Chapman 2006) argued for the simplicity of generalization to a grid. The reasons given included:
-
the simplicity of being able to vary the scale for different categories of sensitivity,
-
the ease of maintenance and training, and
-
the simplicity of creating suitable documentation.
Generalizing to a grid, while protecting the exact locations of sensitive taxa, also provides data in a format that is still useable for a majority of users, especially where a standard grid is used.
Figure 1 shows two methods that are commonly used to grid data. The first (recommended here) is a geographic grid (e.g. cartographic grids such as those based on geographic coordinates), the second-a metric grid (e.g. Europe Equal Area Grid 2001 – EPSG:19986). How each of these should be handled with respect to determining location and uncertainty, see Chapman and Wieczorek (2020). We recommend the use of a geographic grid, as discussed below, because of the ease of preparation and documentation, and because biological data being shared via the Darwin Core standard (Wieczorek et al. 2012) would need to be converted from a metric grid to a geographic grid before publication, with subsequent loss of precision. However, as noted in Chapman and Wieczorek (2020),
using the southwest corner as the coordinate for a point-radius georeference is wasteful, since the geographic radial would be from there to the farthest corner, which would be twice as far as it would be if the center of the grid cell was used instead. In any case, the characteristics of the grid should be recorded with the locality information.
Where data is generalized to a geographic or biogeographic region (a polygon), the data has less usability for many analyses, but was seen by many as a more secure way of ‘hiding’ sensitive data locations. Currently, GBIF.org has only limited ability to incorporate polygon data. There are some parallels with this method with the reporting of census results in many countries where summaries are reported using Statistical Local Areas or census tracts to restrict possible identification of individuals (Australian Bureau of Statistics 2006). A key difference is that with census data results are summarized over many individuals within a region, whereas with biological data we want to hide the location of a single entity within an area. It does de facto produce a summary, but this is not the primary intent. One problem with this method is that there is no guarantee that political (or even biogeographic) boundaries will remain constant over time and this further reduces the value of the data for many purposes. This has already been found to be a problem when comparing census data over time where census districts or tracts have altered. (Noble et al. 2011).
If georeferences are given for data that is generalized to a biogeographic or political region, the result can be quite misleading – a coastal species, for example, may end up with a georeference that is hundreds of kilometres inland, reducing usefulness for analysis or data cleaning. Making such data available without suitable documentation can lead to quite disastrous results for users. It is often better in some cases to not supply a georeference, but if one is supplied, then documentation should be clear with either a large Uncertainty Radius, or well documented Spatial Fit (see discussion under Spatial fit below).
Good practice dictates that whatever you do to generalize the data that you document it so that users of the data know what reliance they can place in it. Note, that when documenting what has been done, it is essential that both the coordinates and the coordinate uncertainty in meters be recorded. How this is best done can be seen in the Georeferencing Quick Reference Guide (Zermoglio et al. 2020).
I recommend that data providers who are generalizing their data do so using a standard methodology (see below), and to document this accordingly. As most biodiversity data is currently made available using decimal degrees (Chapman and Wieczorek 2020), the recommended method means that protocols such as Darwin Core (Wieczorek et al. 2012) do not need modification, other than to allow for suitable metadata documentation.
The method recommended below allows for several levels of [generalization] that conform to Categories 1–4 described in Table 1.
The recommended method for [generalization] is:
Category |
Sensitivity |
Georeference |
Category 1 |
Extreme |
Geographic coordinates not released or data may be released by watershed/bioregion/county, rounded to 1 degree, etc. |
Category 2 |
High |
Geographic coordinates rounded to 0.1 degree |
Category 3 |
Medium |
Geographic coordinates rounded to 0.01 degree |
Category 4 |
Low |
Geographic coordinates rounded to 0.001 degree |
Not sensitive |
Not sensitive |
Geographic coordinates unrestricted |
The South African National Biodiversity Institute, in largely implementing the criteria as laid out in the previous document (Chapman and Grafton 2008), have implemented only two categories (SANBI 2010, SANBI 2016): the original, non-generalized data and data generalized to one-quarter of a degree (0.25 degree (QDS)). We believe that this is a more difficult generalization to implement, and, unless fully and clearly documented, it could lead to a misleading level of precision. It also reduces the flexibility provided by the above four-categorization method.
It is important to document the method and level of [generalization] so that users are aware of what has been done to the data, and what reliability they may be able to place in the data. Currently, neither Darwin Core nor the ABCD protocols provide fields for the recommended metadata. It has been proposed that these protocols be modified to accept such metadata (see Afterword), but in the meantime, it is recommended that the information be recorded using existing Darwin Core terms at the record-level (e.g., dwc:informationWithheld, dwc:dataGeneralizations or any of the 'Remarks' fields).
As far as the [generalization] of georeferencing data is concerned it is important to record that the data has been generalized using a ‘decimal geographic grid’ and record both:
-
Precision of the data provided (e.g. 0.1 degree; 0.001 degree, etc.)
-
Precision of the data stored or held (e.g. 0.0001 degree, 0.1 minute, 1 second, 100m square, etc.)
The recommendations for metadata for inclusion in the Darwin Core Location Class (TDWG 2018) are set out in the [Afterword]. Once they (or similar) have been adopted, then it is recommended that the appropriate fields be recorded and distributed with the data.
With plants, especially, and with other taxa (like insects), collectors often gather multiple specimens (duplicates or parts of sets)—usually on the order of four to six, though examples of more than 80 have been cited (Paul Morris 2007, personal communication, April)—with these duplicates or parts of sets often sent to many institutions around the world. One problem that arises is originating institutions may lose control of what happens to the information (including locality information) distributed to collections from those secondary institutions – remembering that the duplicates may have been distributed prior to the taxon being identified as sensitive.
In most cases this exchange of information is not a problem, but with sensitive taxa, it often is. The secondary institution may not know what are regarded as ‘sensitive taxa’ in the jurisdiction of the originating institution or may not have flagged that information. Sensitivity is not always information that can be distributed along with the collections, as it may not be known until much later that the species is endangered and/or sensitive. This issue is a difficult one, as simply labelling a taxon as sensitive may not be the answer: a taxon may be endangered in its native area (and thus sensitive) and may be a weed or pest in other areas, with locality information important for its control in both instances.
Identifying duplicates across institutions is not easy, as, especially for historic and legacy collections, it is often difficult to determine duplicate specimens. Some institutions, such as Centro de Referência em Informação Ambiental (CRIA) in Brazil in its speciesLink project and the Atlas of Living Australia, use matching across a number of fields such as collector number, date and locality, while GBIF is developing an algorithm for data-clustering. Currently, however, there is no universal global system available. The use of unique, persistent and resolvable Globally Unique Identifiers (GUIDs) (Page 2009, Richards 2010, Richards et al. 2011) will aid these processes in the longer-term, but the implementation of specimen-level GUIDs still seems some way off. A recent paper by Nelson et al. 2018 makes a number of recommendations on minting, managing and sharing GUIDs for herbarium specimens, but until such techniques are more widely adopted, identifying duplicates across institutions will remain an issue.