-
Notifications
You must be signed in to change notification settings - Fork 0
/
ReadMe.html
51 lines (50 loc) · 3.42 KB
/
ReadMe.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>ReadMe</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<style type="text/css">
body {
color: navy;
font-family: Cambria, serif;
margin: 0 auto;
max-width: 960px;
}
blockquote {
border-left:.5em solid #eee;
padding: 0 2em;
margin-left:0;
max-width: 800px;
}
</style>
<base href='file:\\\C:\ViBRANT\Corpus\'/>
</head>
<body>
<h1>Brief Notes for ViBRANT Corpus</h1>
<h2>Introduction</h2>
<p>The corpus is intended to meet the need for gold standard data to assist in the development and evaluation of natural language processing tools for biodiversity literature. A particular feature of this corpus is the presence of clean (re-keyed) and dirty (OCR) versions of the same text. </p>
<h2>Contents</h2>
<p>The primary contents are selected texts from four volumes of the Biologia Centrali-Americana. Each folder contains 50 pages from the volume covering the clean and dirty text versions of each, with their supporting annotation file.</p>
<p>Note: this corpus uses the <a href="http://brat.nlplab.org/standoff.html">brat stand off</a> format for annotation.</p>
<p>Additional resources include:</p>
<ul>
<li>another volume from the BCA, the index to the volume of illustrations of Birds, though clean and dirty versions are presented in separate folders, and</li>
<li>a text from Pensoft, automatically annotated on publication, though being a born-digital publication there is no OCR equivalent for this text.
</li>
</ul>
<h2>Availability</h2>
<p>The corpus can be downloaded from <a href="https://git.scratchpads.eu/v">ViBRANT's git repository</a> as an anonymous user with the following command:<br />
<code>$ git clone https://git.scratchpads.eu/git/vibrantcorpus.git</code></p>
<h2>Licence</h2>
<p>As with all content produced by the ViBRANT project, the corpus is released under <a href="http://creativecommons.org/publicdomain/zero/1.0/">Creative Commons CC0 licence</a>.</p>
<h2>Acknowledgements</h2>
<p>This corpus was developed as part of the <a href="http://vbrant.eu">ViBRANT project</a>.<br />
ViBRANT was funded by the European Union 7th Framework Programme within the Research Infrastructures group.<br />
Contract no. RI-261532. Period, Dec. 2010 to Nov. 2013.<br />
Coordinator: <a href="mailto:vsmith.info">Dr Vince Smith</a>.<br />
E-mail: <a href="mailto:enquiries@vbrant.eu">enquiries@vbrant.eu</a></p>
<p>Thanks also to Anna Weitzman and Chris Lyal of the <a href="www.inotaxa.org">INOTAXA project</a> for making their project’s re‐keyed texts of the Biologi Centrali-Americana available for our research.</p>
<p>Thanks to <a href="http://www.pensoft.net/">Pensoft</a>, and especially Lyubomir Penev, for developing a publishing process that makes articles available in a machine readable format, and for being passionately committed to open data.</p>
</body>
</html>
<!-- This document was created with MarkdownPad, the Markdown editor for Windows (http://markdownpad.com) -->