Skip to content

KorAP/KorAP-XML-Krill

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NAME

korapxml2krill - Merge KorAP-XML data and create Krill documents

SYNOPSIS

$ korapxml2krill [archive|extract] --input <directory|archive> [options]

DESCRIPTION

KorAP::XML::Krill is a library to convert KorAP-XML documents to files compatible with the Krill indexer. The korapxml2krill command line tool is a simple wrapper of this library.

INSTALLATION

The preferred way to install KorAP::XML::Krill is to use cpanm.

$ cpanm https://github.com/KorAP/KorAP-XML-Krill.git

In case everything went well, the korapxml2krill tool will be available on your command line immediately. Minimum requirement for KorAP::XML::Krill is Perl 5.32. Optionally installing Archive::Tar::Builder speeds up archive building. Optional support for Sys::Info to calculate available cores is available. In addition to work with zip archives, the unzip tool needs to be present.

ARGUMENTS

$ korapxml2krill -z --input <directory> --output <filename>

Without arguments, korapxml2krill converts a directory of a single KorAP-XML document. It expects the input to point to the text level folder.

archive
$ korapxml2krill archive -z --input <directory|archive> --output <directory|tar>

Converts an archive of KorAP-XML documents. It expects a directory (pointing to the corpus level folder) or one or more zip files as input.

extract
$ korapxml2krill extract --input <archive> --output <directory> --sigle <SIGLE>

Extracts KorAP-XML documents from a zip file.

serial
$ korapxml2krill serial -i <archive1> -i <archive2> -o <directory> -cfg <config-file>

Convert archives sequentially. The inputs are not merged but treated as they are (so they may be premerged or globs). the --out directory is treated as the base directory where subdirectories are created based on the archive name. In case the --to-tar flag is given, the output will be a tar file.

slimlog
$ korapxml2krill slimlog <logfile> > <logfile-slim>

Filters out all useless aka succesfull information from logs, to simplify log checks. Expects no further options.

OPTIONS

--input|-i <directory|zip file>

Directory or zip file(s) of documents to convert.

Without arguments, korapxml2krill expects a folder of a single KorAP-XML document, while archive expects a KorAP-XML corpus folder or a zip file to batch process multiple files. extract expects zip files only.

archive supports multiple input zip files with the constraint that the first archive listed contains all primary data files and all meta data files.

-i file/news.zip -i file/news.malt.zip -i "#file/news.tt.zip"

Input may also be defined using BSD glob wildcards.

-i 'file/news*.zip'

The extended input array will be sorted in length order, so the shortest path needs to contain all primary data files and all meta data files.

(The directory structure follows the base directory format that may include a . root folder. In this case further archives lacking a . root folder need to be passed with a hash sign in front of the archive's name. This may require to quote the parameter.)

To support zip files, a version of unzip needs to be installed that is compatible with the archive file.

The root folder switch using the hash sign is experimental and may vanish in future versions.

--input-base|-ib <directory>

The base directory for inputs.

--output|-o <directory|file>

Output folder for archive processing or document name for single output (optional), writes to STDOUT by default (in case output is not mandatory due to further options).

--overwrite|-w

Overwrite files that already exist.

--token|-t <foundry>#<file>

Define the default tokenization by specifying the name of the foundry and optionally the name of the layer-file. Defaults to OpenNLP#tokens. This will directly take the file instead of running the layer implementation!

--base-sentences|-bs <foundry>#<layer>

Define the layer for base sentences. If given, this will be used instead of using Base#Sentences. Currently DeReKo#Structure and DGD#Structure are the only additional layers supported.

Defaults to unset.
--base-paragraphs|-bp <foundry>#<layer>

Define the layer for base paragraphs. If given, this will be used instead of using Base#Paragraphs. Currently DeReKo#Structure and DGD#Structure are the only additional layer supported.

Defaults to unset.
--base-pagebreaks|-bpb <foundry>#<layer>

Define the layer for base pagebreaks. Currently DeReKo#Structure is the only layer supported.

Defaults to unset.
--skip|-s <foundry>[#<layer>]

Skip specific annotations by specifying the foundry (and optionally the layer with a #-prefix), e.g. Mate or Mate#Morpho. Alternatively you can skip #ALL. Can be set multiple times.

--anno|-a <foundry>#<layer>

Convert specific annotations by specifying the foundry (and optionally the layer with a #-prefix), e.g. Mate or Mate#Morpho. Can be set multiple times.

--non-word-tokens|-nwt

Tokenize non-word tokens like word tokens (defined as matching /[\d\w]/). Useful to treat punctuations as tokens.

Defaults to unset.
--non-verbal-tokens|-nvt

Tokenize non-verbal tokens marked as in the primary data as the unicode symbol 'Black Vertical Rectangle' aka \x25ae.

Defaults to unset.
--jobs|-j

Define the number of spawned forks for concurrent jobs of archive processing. Defaults to 0 (everything runs in a single process).

If sequential-extraction is not set to true, this will also apply to extraction.

Pass -1, and the value will be set automatically to 5 times the number of available cores, in case Sys::Info is available and can read CPU count (see --job-count). Be aware, that the report of available cores may not work in certain conditions. Benchmarking the processing speed based on the number of jobs may be valuable.

This is experimental.

--job-count|-jc

Print job and core information that would be used if -1 was passed to --jobs.

--koral|-k

Version of the output format. Supported versions are: 0 for legacy serialization, 0.03 for serialization with metadata fields as key-values on the root object, 0.4 for serialization with metadata fields as a list of "@type":"koral:field" objects.

Currently defaults to 0.03.

--sequential-extraction|-se

Flag to indicate, if the jobs value also applies to extraction. Some systems may have problems with extracting multiple archives to the same folder at the same time. Can be flagged using --no-sequential-extraction as well. Defaults to false.

--meta|-m

Define the metadata parser to use. Defaults to I5. Metadata parsers can be defined in the KorAP::XML::Meta namespace. This is experimental.

--gzip|-z

Compress the output. Expects a defined output file in single processing.

--cache|-c

File to mmap a cache (using Cache::FastMmap). Defaults to korapxml2krill.cache in the calling directory.

--cache-size|-cs

Size of the cache. Defaults to 50m.

--cache-init|-ci

Initialize cache file. Can be flagged using --no-cache-init as well. Defaults to true.

--cache-delete|-cd

Delete cache file after processing. Can be flagged using --no-cache-delete as well. Defaults to true.

--config|-cfg

Configure the parameters of your call in a file of key-value pairs with whitespace separator

overwrite 1
token     DeReKo#Structure
...

Supported parameters are: overwrite, gzip, jobs, input-base, token, log, cache, cache-size, cache-init, cache-delete, meta, output, koral, temporary-extract, sequential-extraction, base-sentences, base-paragraphs, base-pagebreaks, skip (semicolon separated), sigle (semicolon separated), anno (semicolon separated).

Configuration parameters will always be overwritten by passed parameters.

--temporary-extract|-te

Only valid for the archive and serial commands.

This will first extract all files into a directory and then will archive. If the directory is given as :temp:, a temporary directory is used. This is especially useful to avoid massive unzipping and potential network latency.

--to-tar

Only valid for the archive command.

Writes the output into a tar archive.

--sigle|-sg

Extract the given texts. Can be set multiple times. Currently only supported on extract. Sigles have the structure Corpus/Document/Text. In case the Text path is omitted, the whole document will be extracted. On the document level, the postfix wildcard * is supported.

--lang

Preferred language for metadata fields. In case multiple titles are given (on any level) with different xml:lang attributes, the language given is preferred. Because titles may have different sources and different priorities, non-specific language titles may still be preferred in case the title source has a higher priority.

--log|-l

The Log::Any log level, defaults to ERROR.

--quiet

Silence all information (non-log) outputs.

--help|-h

Print help information.

--version|-v

Print version information.

PERFORMANCE

There are some ways to improve performance for large tasks:

First unpack

Using the archive or serial command on one or multiple zip files can be very slow, as it needs to unpack small portions every time. It's better to use --temporary-extract to unpack the whole archive first into a temprary directory and then read the extracted files. This is especially important for remote archives

Limit annotations

Per default, all supported annotation layers are sought. This can be limited by adding --skip '#ALL' and only listing the expected annotations with --anno.

Checking the parallel job count

By providing the number of parallel jobs using --jobs, the execution can be tailored to specific hardware environments.

ANNOTATION SUPPORT

KorAP::XML::Krill has built-in importer for some annotation foundries and layers developed in the KorAP project that are part of the KorAP preprocessing pipeline. The base foundry with paragraphs, sentences, and the text element are mandatory for Krill.

Base
  #Paragraphs
  #Sentences

Connexor
  #Morpho
  #Phrase
  #Sentences
  #Syntax

CoreNLP
  #Constituency
  #Morpho
  #NamedEntities
  #Sentences

CorpusExplorer
  #Morpho

CMC
  #Morpho

DeReKo
  #Structure

DGD
  #Morpho
  #Structure

DRuKoLa
  #Morpho

Glemm
  #Morpho

Gingko
  #Morpho

HNC
  #Morpho

LWC
  #Dependency

Malt
  #Dependency

MarMoT
  #Morpho

Mate
  #Dependency
  #Morpho

MDParser
  #Dependency

NKJP
  #Morpho
  #NamedEntities

OpenNLP
  #Morpho
  #Sentences

RWK
  #Morpho
  #Structure

Sgbr
  #Lemma
  #Morpho

Spacy
  #Morpho

Talismane
  #Dependency
  #Morpho

TreeTagger
  #Morpho
  #Sentences

UDPipe
  #Dependency
  #Morpho

XIP
  #Constituency
  #Morpho
  #Sentences

More importers are in preparation. New annotation importers can be defined in the KorAP::XML::Annotation namespace. See the built-in annotation importers as examples.

METADATA SUPPORT

KorAP::XML::Krill has built-in importer for some meta data variants that are part of the KorAP preprocessing pipeline.

I5

Meta data for all I5 files

Sgbr

Meta data from the Schreibgebrauch project

Gingko

Meta data from the Gingko project in addition to I5

ICC

Meta data for the ICC in addition to I5

NKJP

Meta data for the NKJP corpora

New meta data importers can be defined in the KorAP::XML::Meta namespace. See the built-in meta data importers as examples.

The I5 metadata definition is based on TEI-P5 and supports <xenoData> with <meta> elements like

<meta type="..." name="..." project="..." desc="...">...</meta>

that are directly translated to Krill objects. The supported values are:

type
string

String meta data value

keyword

String meta data value that can be given multiple times

text

String meta data value that is tokenized and can be searched as token sequences

date

Date meta data value (as "yyyy/mm/dd" with optional granularity)

integer

Numerical meta data value

attachment

Non-indexed meta data value (only retrievable)

uri

Non-indexed attached URI, takes the desc as the title for links

name

The key of the meta object that may be prefixed by corpus or doc, in case the <xenoData> information is located on these levels. The text level introduces no prefixes.

project (optional)

A prefixed namespace of the key

desc (optional)

A description of the key

text content

The value of the meta object

About KorAP-XML

KorAP-XML (Ba�ski et al. 2012) is an implementation of the KorAP data model (Ba�ski et al. 2013), where text data are stored physically separated from their interpretations (i.e. annotations). A text document in KorAP-XML therefore consists of several files containing primary data, metadata and annotations.

The structure of a single KorAP-XML document can be as follows:

- data.xml
- header.xml
  + base
    - tokens.xml
    - ...
  + struct
    - structure.xml
    - ...
  + corenlp
    - morpho.xml
    - constituency.xml
    - ...
  + tree_tagger
    - morpho.xml
    - ...
  - ...

The data.xml contains the primary data, the header.xml contains the metadata, and the annotation layers are stored in subfolders like base, struct or corenlp (so-called "foundries"; Ba�ski et al. 2013).

Metadata is available in the TEI-P5 variant I5 (Lüngen and Sperberg-McQueen 2012). See the documentation in KorAP::XML::Meta::I5 for translatable fields.

Annotations correspond to a variant of the TEI-P5 feature structures (TEI Consortium; Lee et al. 2004). Annotation feature structures refer to character sequences of the primary text inside the text element of the data.xml. A single annotation containing the lemma of a token can have the following structure:

<span from="0" to="3">
  <fs type="lex" xmlns="http://www.tei-c.org/ns/1.0">
    <f name="lex">
      <fs>
        <f name="lemma">zum</f>
      </fs>
    </f>
  </fs>
</span>

The from and to attributes are refering to the character span in the primary text. Depending on the kind of annotation (e.g. token-based, span-based, relation-based), the structure may vary. See KorAP::XML::Annotation::* for various annotation preprocessors.

Multiple KorAP-XML documents are organized on three levels following the "IDS Textmodell" (Lüngen and Sperberg-McQueen 2012): corpus > document > text. On each level metadata information can be stored, that korapxml2krill will merge to a single metadata object per text. A corpus is therefore structured as follows:

+ <corpus>
  - header.xml
  + <document>
    - header.xml
    + <text>
      - data.xml
      - header.xml
      - ...
  - ...

A single text can be identified by the concatenation of the corpus identifier, the document identifier and the text identifier. This identifier is called the text sigle (e.g. a text with the identifier 18486 in the document 060 in the corpus WPD17 has the text sigle WPD17/060/18486, see --sigle).

These corpora are often stored in zip files, with which korapxml2krill can deal with. Corpora may also be split in multiple zip archives (e.g. one zip file per foundry), which is also supported (see --input).

Examples for KorAP-XML files are included in KorAP::XML::Krill in form of a test suite. The resulting JSON format merges all annotation layers based on a single token stream.

References

Piotr Ba�ski, Cyril Belica, Helge Krause, Marc Kupietz, Carsten Schnober, Oliver Schonefeld, and Andreas Witt (2011): KorAP data model: first approximation, December.

Piotr Ba�ski, Peter M. Fischer, Elena Frick, Erik Ketzan, Marc Kupietz, Carsten Schnober, Oliver Schonefeld and Andreas Witt (2012): "The New IDS Corpus Analysis Platform: Challenges and Prospects", Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012). PDF

Piotr Ba�ski, Elena Frick, Michael Hanl, Marc Kupietz, Carsten Schnober and Andreas Witt (2013): "Robust corpus architecture: a new look at virtual collections and data access", Corpus Linguistics 2013. Abstract Book. Lancaster: UCREL, pp. 23-25. PDF

Kiyong Lee, Lou Burnard, Laurent Romary, Eric de la Clergerie, Thierry Declerck, Syd Bauman, Harry Bunt, Lionel Clément, Tomaz Erjavec, Azim Roussanaly and Claude Roux (2004): "Towards an international standard on featurestructure representation", Proceedings of the fourth International Conference on Language Resources and Evaluation (LREC 2004), pp. 373-376. PDF

Harald Lüngen and C. M. Sperberg-McQueen (2012): "A TEI P5 Document Grammar for the IDS Text Model", Journal of the Text Encoding Initiative, Issue 3 | November 2012. PDF

TEI Consortium, eds: "Feature Structures", Guidelines for Electronic Text Encoding and Interchange. html

AVAILABILITY

https://github.com/KorAP/KorAP-XML-Krill

COPYRIGHT AND LICENSE

Copyright (C) 2015-2024, IDS Mannheim

Author: Nils Diewald

Contributor: Eliza Margaretha, Marc Kupietz

KorAP::XML::Krill is developed as part of the KorAP Corpus Analysis Platform at the Leibniz Institute for the German Language (IDS), member of the Leibniz-Gemeinschaft.

This program is free software published under the BSD-2 License.

POD ERRORS

Hey! The above document had some coding errors, which are explained below:

Around line 376:

'=item' outside of any '=over'

Around line 395:

You forgot a '=back' before '=head1'