The corpus is intended to meet the need for gold standard data to assist in the development and evaluation of natural language processing tools for biodiversity literature. A particular feature of this corpus is the presence of clean (re-keyed) and dirty (OCR) versions of the same text.
The primary contents are selected texts from four volumes of the Biologia Centrali-Americana. Each folder contains 50 pages from the volume covering the clean and dirty text versions of each, with their supporting annotation file.
Note: this corpus uses the brat stand off format for annotation.
Additional resources include:
- another volume from the BCA, the index to the volume of illustrations of Birds, though clean and dirty versions are presented in separate folders, and
- a text from Pensoft, automatically annotated on publication, though being a born-digital publication there is no OCR equivalent for this text.
The corpus can be downloaded from ViBRANT's git repository as an anonymous user with the following command:
$ git clone https://git.scratchpads.eu/git/vibrantcorpus.git
As with all content produced by the ViBRANT project, the corpus is released under Creative Commons CC0 licence.
This corpus was developed as part of the ViBRANT project.
ViBRANT was funded by the European Union 7th Framework Programme within the Research Infrastructures group.
Contract no. RI-261532. Period, Dec. 2010 to Nov. 2013.
Coordinator: Dr Vince Smith.
E-mail: [email protected]
Thanks also to Anna Weitzman and Chris Lyal of the INOTAXA project for making their project’s re‐keyed texts of the Biologi Centrali-Americana available for our research.
Thanks to Pensoft, and especially Lyubomir Penev, for developing a publishing process that makes articles available in a machine readable format, and for being passionately committed to open data.