Skip to content

Latest commit

 

History

History
72 lines (51 loc) · 3.1 KB

README.rst

File metadata and controls

72 lines (51 loc) · 3.1 KB
http://applejack.science.ru.nl/lamabadge.php/colloquery Project Status: Inactive – The project has reached a stable, usable state but is no longer being actively developed; support/maintenance will be provided as time allows.

Colloquery

Colloquery is a web application to search for phrase translations, or collocations, as well as synonyms,in bilingual phrase translation tables.

It is developed for Van Dale by the Centre for Language and Speech Technology, Radboud University Nijmegen, and is licensed under the Affero GNU Public License.

Colloquery screenshot

Installation

First, clone this repository and edit settings.py.

Colloquery is not trivial to set-up and train, as it relies on numerous external dependencies:

On Debian/Ubuntu systems, these can be installed using sudo apt-get install python3 mongodb python3-mongoengine python3-django.

For the data generation step, the following additional dependencies are required:

To create phrase translation-tables in the first place, use the Moses training pipeline, which in turn invokes GIZA++:

Data Generation

  • Prepare your parallel corpus files. A parallel corpus consists of two plain-text UTF8 encoded files, one for the source language (corpus.fr in our example) and one for the target language (corpus.en). Make sure they are tokenised, lower-cased and contain one sentence per line (you can use ucto for this), sentences on the same line in the other file are considering translations.

  • Train a phrase translation table using Moses:

    $ /path/to/moses/scripts/training/train-model.perl -external-bin-dir /path/to/moses/bin -root-dir .  --parallel --corpus corpus --f fr --e en  --first-step 1 --last-step 8
    
  • Invoke the data generation pipeline of Colloquery, adjust the thresholds as needed (see ./manage.py generatedata --help). This assumes a running and properly configured MongoDB:

    ./manage.py generatedata --title "YourCorpus" --phrasetable corpus.fr-en.phrasetable --sourcelang fr --targetlang en --targetcorpus corpus.fr --sourcecorpus corpus.en --pst 0.2 --pts 0.2 --divergencethreshold 0.1 --freqthreshold 4
    

The Moses and data generation pipeline may take considerable time and system resources (most notably memory). Set sane thresholds to prevent the data from becoming unmanageably large.