Project Status: Inactive – The project has reached a stable, usable state but is no longer being actively developed; support/maintenance will be provided as time allows.

Colloquery

Colloquery is a web application to search for phrase translations, or collocations, as well as synonyms,in bilingual phrase translation tables.

It is developed for Van Dale by the Centre for Language and Speech Technology, Radboud University Nijmegen, and is licensed under the Affero GNU Public License.

Installation

First, clone this repository and edit settings.py.

Colloquery is not trivial to set-up and train, as it relies on numerous external dependencies:

Python 3
MongoDB
mongoengine
Django

On Debian/Ubuntu systems, these can be installed using sudo apt-get install python3 mongodb python3-mongoengine python3-django.

For the data generation step, the following additional dependencies are required:

colibri-core (shipped as part of LaMachine)
colibri-mt

To create phrase translation-tables in the first place, use the Moses training pipeline, which in turn invokes GIZA++:

Moses
GIZA++

Data Generation

Prepare your parallel corpus files. A parallel corpus consists of two plain-text UTF8 encoded files, one for the source language (corpus.fr in our example) and one for the target language (corpus.en). Make sure they are tokenised, lower-cased and contain one sentence per line (you can use ucto for this), sentences on the same line in the other file are considering translations.

Train a phrase translation table using Moses:

$ /path/to/moses/scripts/training/train-model.perl -external-bin-dir /path/to/moses/bin -root-dir .  --parallel --corpus corpus --f fr --e en  --first-step 1 --last-step 8

Invoke the data generation pipeline of Colloquery, adjust the thresholds as needed (see ./manage.py generatedata --help). This assumes a running and properly configured MongoDB:

./manage.py generatedata --title "YourCorpus" --phrasetable corpus.fr-en.phrasetable --sourcelang fr --targetlang en --targetcorpus corpus.fr --sourcecorpus corpus.en --pst 0.2 --pts 0.2 --divergencethreshold 0.1 --freqthreshold 4

The Moses and data generation pipeline may take considerable time and system resources (most notably memory). Set sane thresholds to prevent the data from becoming unmanageably large.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.rst

README.rst

Colloquery

Installation

Data Generation

Files

README.rst

Latest commit

History

README.rst

File metadata and controls

Colloquery

Installation

Data Generation