Skip to content

Evolving bdchecks: a biodiversity data quality checks framework

Martynas Jočys edited this page Apr 12, 2021 · 1 revision

Background

The heterogeneous nature of biodiversity data, and the exponential growth rate of data volume, results in problems in data completeness, consistency, and reliability. During the last two decades, various tools have been developed to improve data quality, yet, most data users still struggle to assess data quality and to compare quality issues between datasets. bdchecks is a set of two R packages which serve as a holistic system for performing, developing and managing various biodiversity data checks. bdchecks offers various features for different types of R users: (i) an interactive and user-friendly Shiny app for inexperienced R users; (ii) full command line functionality for more experienced R users; and (iii) an Admin app for advanced R users to easily edit, test add and manage their own collection of data checks. We successfully implemented ~ 50 core checks being developed by TDWG’s Biodiversity Data Quality Task Group. The objects of bdchecks were designed to match this standardized test structure.

Related work

We view bdchecks as a “factory” of data checks. Towards this end, its architecture enables the user to construct, document, test, and manage hundreds of different data checks, coupled with several user interfaces. To the best of our knowledge, this is an uncharted territory in R, which means that the entire infrastructure had to be built from the ground up. bdchecks is a core component of the bdverse - a family of R packages that form a general framework for facilitating biodiversity data science.

bdchecks is a core component of the bdverse - a family of R packages that form a general framework for facilitating biodiversity data science. One might even call it the beating heart of the bdverse. The more robust bdchecks is, the more powerful other bdverse features can become. This project focuses on the software engineering aspect of bdchecks and the development of new features and capabilities. The bdverse (biodiversity data universe) is a biodiversity data quality toolkit constructed as a family of R packages (https://bdverse.org/). It allows users with- and without programming capabilities, to conveniently and coherently employ R for data exploration, quality assessment, data cleaning, and standardization. bdverse is a hierarchical package system. Its architecture comprises six core functionality units, of which four are already operative, and two are under construction. These units are: (i) bddwc: a Darwin Core field name standardizer, which facilitates data inclusiveness from any biodiversity data aggregator. (ii) bdchecks: a biodiversity data quality checks system for performing, filtering, developing, and managing various biodiversity data checks. (iii) bdclean: a user-friendly data cleaning workflow system, composed of questionnaires (to collect user's specific needs) and data cleaning reports. (iv) bdvis: an interactive biodiversity data visualizations and dashboards system (under construction). (v) bdtools: an agile and modular tool framework for biodiversity data exploration (under construction). (vi) bdverse: main installation package (one package to rule them all) that also stores the Shiny apps launcher. Currently, bdverse contains five Shiny apps and eleven R packages.

Details of your coding project

A bdchecks object (i.e., a data check) is built as an S4 object (R based object-oriented programming system), which is well-suited for building large systems that can evolve over time (Wickham 2019). It is created using a YAML file that holds all the necessary metadata, and an R function file. The R documentation (roxygen2 comments) is being generated automatically from the metadata. In addition, a data test YAML file stores all the testing scenarios of each check, and each testing scenario is automatically converted to a unit test. A testing report, summarizing the expected result of each scenario vs its observed result, can be generated using the function perform_test_dc(). In addition to the bdchecks package, we developed two Shiny apps located in the bdchecks.app R package. A user Shiny app, and a bdchecks Admin app. The Admin app enables a convenient user interface for editing and managing numerous data checks.

The coding project key tasks are:

Expected impact

bdchecks may centralize available data checks, facilitate further development of novel data checks, improve user experience, and engage domain experts.

Skills required

R software engineering; object-oriented programming; R package development; TDD, Shiny app development. Major advantage: experience in working with biodiversity data.

Mentors

Students, please contact mentors below after completing at least one of the tests below.

  • Povilas Gibas [email protected] is bdchecks papa and a Ph.D. candidate in biomedical data analysis expected to graduate in 2021. Povilas joined bdverse in 2018 as Google Summer of Code student and since then, he is perfecting data-analysis workflows of bdchecks and bdDwC. He is interested in statistical computing, computational data-analysis workflows and data-visualization. When he is not working, you can find him on stackoverflow.org, where he is learning and helping others to learn new things about R.
  • Tomer Gueta [email protected] is the founding director of the bdverse project. He is a postdoctoral fellow at the Faculty of Civil and Environmental Engineering at the Technion, working with Prof. Yohay Carmel. His research deals with developing tools and methodologies for data-intensive biodiversity research. During the last three years, Tomer served as a GSoC mentor with the R project organization.
  • Vijay Barve [email protected] is the author and maintainer of bdvis and a key member in the bdverse development team. Vijay is a biodiversity data scientist and has been a GSoC student and mentor since 2012 with the R project organization. Vijay has contributed to several packages on CRAN.