Colloquery is a web application to search for phrase translations, or collocations, as well as synonyms,in bilingual phrase translation tables.
It is developed for Van Dale by the Centre for Language and Speech Technology, Radboud University Nijmegen, and is licensed under the Affero GNU Public License.
First, clone this repository and edit settings.py
.
Colloquery is not trivial to set-up and train, as it relies on numerous external dependencies:
- Python 3
- MongoDB
- mongoengine
- Django
On Debian/Ubuntu systems, these can be installed using sudo apt-get install
python3 mongodb python3-mongoengine python3-django
.
For the data generation step, the following additional dependencies are required:
- colibri-core (shipped as part of LaMachine)
- colibri-mt
To create phrase translation-tables in the first place, use the Moses training pipeline, which in turn invokes GIZA++:
Prepare your parallel corpus files. A parallel corpus consists of two plain-text UTF8 encoded files, one for the source language (
corpus.fr
in our example) and one for the target language (corpus.en
). Make sure they are tokenised, lower-cased and contain one sentence per line (you can use ucto for this), sentences on the same line in the other file are considering translations.Train a phrase translation table using Moses:
$ /path/to/moses/scripts/training/train-model.perl -external-bin-dir /path/to/moses/bin -root-dir . --parallel --corpus corpus --f fr --e en --first-step 1 --last-step 8
Invoke the data generation pipeline of Colloquery, adjust the thresholds as needed (see
./manage.py generatedata --help
). This assumes a running and properly configured MongoDB:./manage.py generatedata --title "YourCorpus" --phrasetable corpus.fr-en.phrasetable --sourcelang fr --targetlang en --targetcorpus corpus.fr --sourcecorpus corpus.en --pst 0.2 --pts 0.2 --divergencethreshold 0.1 --freqthreshold 4
The Moses and data generation pipeline may take considerable time and system resources (most notably memory). Set sane thresholds to prevent the data from becoming unmanageably large.