Monotextor generates the final monolingual corpora in multiple formats. These files will be placed in permanentDir
folder and will have the following naming convention: {lang}.{prefix}.gz
, where {prefix}
corresponds to a descriptor of the corresponding format and/or granularity of the data.
The file that will be always generated (regardless of configuration) is {lang}.raw.gz
. This file comes in Moses format, i.e. tab-separated columns.
-
{lang}.raw.gz
: monolingual corpus that contains every sentence or paragraph. The file has has deduplication and the content is not filtered.This file contains columns added by different optional modules/features: paragraph identification, deferred, Monofixer and Monocleaner. In case some of these are not enabled, the corresponding columns will be omitted. The possible fields that may appear in this file are (in this order):
url text
- default columnsurl
is the source document of the texttext
is the content in{lang}
paragraph_id
- paragraph identification data- initial position of the sentence in the paragraph, and initial position of the paragraph in the document
deferred_hash
- deferred hash of the text- may be used to reconstruct the original corpus using Deferred crawling reconstructor
monofixer_hash monofixer_score
- Monofixer outputmonofixer_hash
tags duplicate or near-duplicate textmonofixer_score
rates quality of duplicate or near-duplicate text
monocleaner_lang_id monocleaner_score
- Monocleaner classifer outputmonocleaner_lang_id
is the lang which Monocleaner detects using FastSpellmonocleaner_score
is the fluency score of Monocleaner for the text
This file comes accompanied by the corresponding statistics file
{lang}.stats.raw
, which provides information the size of the corpus in MB and in number of tokens.
-
{lang}.sent.gz
: monolingual corpus with a granularity of sentences which is generated ifskipSentenceSplitting: false
or not provided. The content of the file is the same that{lang}.raw.gz
. -
{lang}.raw.paragraphs.gz
: monolingual corpus with a granularity of paragraphs which is generated ifskipSentenceSplitting: true
. The content of the file is the same that{lang}.raw.gz
.