Skip to content

Commit

Permalink
initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
rsennrich committed Apr 25, 2016
0 parents commit dd85e68
Show file tree
Hide file tree
Showing 22 changed files with 24,423 additions and 0 deletions.
21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
The MIT License (MIT)

Copyright (c) 2016 University of Edinburgh

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
52 changes: 52 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
Scripts for Edinburgh Neural MT systems for WMT 16
==================================================

This repository contains scripts and an example config used for the Edinburgh Neural MT submission (UEDIN-NMT)
for the shared translation task at the 2016 Workshops on Statistical Machine Translation (http://www.statmt.org/wmt16/).

The scripts will facilitate the reproduction of our results, and serve as additional documentation (along with the system description paper)


OVERVIEW
--------

- We built translation models with Nematus ( https://www.github.com/rsennrich/nematus )
- We used BPE as subword segmentation to achieve open-vocabulary translation ( https://github.com/rsennrich/subword-nmt )
- We automatically back-translated in-domain monolingual data into the source language to create additional training data. The data is publicly available here: http://statmt.org/rsennrich/wmt16_backtranslations/
- More details about our system will appear in the (upcoming) system description paper

SCRIPTS
-------

- preprocessing : preprocessing scripts for Romanian that we found helpful for translation quality.
we used the Moses tokenizer and truecaser for all language pairs.

- sample : sample scripts that we used for preprocessing, training and decoding. We used mostly the same settings for all translation directions,
with small differences in vocabulary size. Dropout was enabled for EN<->RO, but disabled otherwise.


- r2l : scripts for reranking the output of the (default) left-to-right decoder with a model that decodes from right-to-left.


LICENSE
-------

The scripts are available under the MIT License.

PUBLICATIONS
------------

The Edinburgh Neural MT submission to WMT 2016 is described in:

TBD

It is based on work described in the following publications:

Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio (2015):
Neural Machine Translation by Jointly Learning to Align and Translate, Proceedings of the International Conference on Learning Representations (ICLR).

Rico Sennrich, Barry Haddow, Alexandra Birch (2015):
Neural Machine Translation of Rare Words with Subword Units. arXiv preprint.

Rico Sennrich, Barry Haddow, Alexandra Birch (2015):
Improving Neural Machine Translation Models with Monolingual Data. arXiv preprint.
15 changes: 15 additions & 0 deletions preprocess/normalise-romanian.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#!/usr/bin/env python3

#
# Normalise Romanian s-comma and t-comma
#
# author: Barry Haddow

import io
import sys
istream = io.TextIOWrapper(sys.stdin.buffer, encoding='utf-8')

for line in istream:
line = line.replace("\u015e", "\u0218").replace("\u015f", "\u0219")
line = line.replace("\u0162", "\u021a").replace("\u0163", "\u021b")
print(line, end="")
18 changes: 18 additions & 0 deletions preprocess/remove-diacritics.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
#!/usr/bin/env python3

#
# Remove Romanian diacritics. Assumes s-comma and t-comma are normalised
#
# author: Barry Haddow

import io
import sys
istream = io.TextIOWrapper(sys.stdin.buffer, encoding='utf-8')

for line in istream:
line = line.replace("\u0218", "S").replace("\u0219", "s") #s-comma
line = line.replace("\u021a", "T").replace("\u021b", "t") #t-comma
line = line.replace("\u0102", "A").replace("\u0103", "a")
line = line.replace("\u00C2", "A").replace("\u00E2", "a")
line = line.replace("\u00CE", "I").replace("\u00EE", "i")
print(line, end="")
22 changes: 22 additions & 0 deletions r2l/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
RERANKING WITH RIGHT-TO-LEFT-MODELS (R2L)
-----------------------------------------

For English<->German and English->Czech, we trained separate models with reversed target order, and used those models for reranking.

To use reranking with reversed (right-to-left) models, do the following:

1. use reverse.py to reverse the word order on the target side of the training/dev set.
use the same vocabulary, and apply reverse.py *after* truecasing/BPE, to simplify the mapping from l2r to r2l and back.

2. train a separate model (or ensemble) with the reversed target side. I'll call this the r2l model.

3. at test time, produce an n-best list with the l2r model(s):

time THEANO_FLAGS=mode=FAST_RUN,floatX=float32,device=gpu,on_unused_input=warn python /path/to/nematus/nmt/translate.py -m model.npz -i test.bpe.de -o test.output.50best -k 50 -n -p 1 --n-best

4. reverse the outputs in the n-best list, and re-score with the r2l model(s).

python reverse_nbest.py < test.output.50best > test.output.50best.reversed

time THEANO_FLAGS=mode=FAST_RUN,floatX=float32,device=gpu,on_unused_input=warn python /path/to/nematus/nmt/rescore.py -m /path/to/r2l_model/model.npz -s test.bpe.de -i test.output.50best.reversed -o test.output.50best.rescored -b 80 -n
python rerank.py < test.output.50best.rescored | python reverse.py > test.output.reranked
44 changes: 44 additions & 0 deletions r2l/rerank.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
#!/usr/bin/python
# -*- coding: utf-8 -*-
# Author: Rico Sennrich

import sys
from collections import defaultdict

if __name__ == '__main__':

if len(sys.argv) > 1:
k = int(sys.argv[1])
else:
k = float('inf')

cur = 0
best_score = float('inf')
best_sent = ''
idx = 0
for line in sys.stdin:
num, sent, scores = line.split(' ||| ')

# new input sentence: print best translation of previous sentence, and reset stats
if int(num) > cur:
print best_sent
#print best_score
cur = int(num)
best_score = float('inf')
best_sent = ''
idx = 0

#only consider k-best hypotheses
if idx >= k:
continue

score = sum(map(float, scores.split()))
if score < best_score:
best_score = score
best_sent = sent.strip()

idx += 1

# end of file; print best translation of last sentence
print best_sent
# print best_score
7 changes: 7 additions & 0 deletions r2l/reverse.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#!/usr/bin/python
# -*- coding: utf-8 -*-

import sys

for line in sys.stdin:
sys.stdout.write(' '.join(reversed(line.split())) + '\n')
9 changes: 9 additions & 0 deletions r2l/reverse_nbest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
#!/usr/bin/python
# -*- coding: utf-8 -*-

import sys

for line in sys.stdin:
linesplit = line.split(' ||| ')
linesplit[1] = ' '.join(reversed(linesplit[1].split()))
sys.stdout.write(' ||| '.join(linesplit))
27 changes: 27 additions & 0 deletions sample/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
This directory contains some sample files and configuration scripts for training a simple neural MT model


INSTRUCTIONS
------------

all scripts contain variables that you will need to set to run the scripts.
For processing the sample data, only paths to different toolkits need to be set.
For processing new data, more changes will be necessary.

As a first step, preprocess the training data:

./preprocess.sh

Then, start training: on normal-size data sets, this will take about 1-2 weeks to converge.
Models are saved regularly, and you may want to interrupt this process without waiting for it to finish.

./train.sh

Given a model, preprocessed text can be translated thusly:

./translate.sh

Finally, you may want to post-process the translation output, namely merge BPE segments,
detruecase and detokenize:

./postprocess-test.sh < data/newsdev2016.output > data/newsdev2016.postprocessed
45 changes: 45 additions & 0 deletions sample/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
import numpy
import os
import sys

sys.path.append('/path/to/nematus/nmt')

from nmt import train

VOCAB_SIZE = 90000
SRC = "ro"
TGT = "en"
DATA_DIR = "data/"

from nmt import train


if __name__ == '__main__':
validerr = train(saveto='model/model.npz',
reload_=True,
dim_word=500,
dim=1024,
n_words=VOCAB_SIZE,
n_words_src=VOCAB_SIZE,
decay_c=0.,
clip_c=1.,
lrate=0.0001,
optimizer='adadelta',
maxlen=50,
batch_size=80,
valid_batch_size=80,
datasets=[DATA_DIR + '/corpus.bpe.' + SRC, DATA_DIR + '/corpus.bpe.' + TGT],
valid_datasets=[DATA_DIR + '/newsdev2016.bpe.' + SRC, DATA_DIR + '/newsdev2016.bpe.' + TGT],
dictionaries=[DATA_DIR + '/corpus.bpe.' + SRC + '.json',DATA_DIR + '/corpus.bpe.' + TGT + '.json'],
validFreq=10000,
dispFreq=1000,
saveFreq=30000,
sampleFreq=10000,
use_dropout=False,
dropout_embedding=0.2, # dropout for input embeddings (0: no dropout)
dropout_hidden=0.2, # dropout for hidden layers (0: no dropout)
dropout_source=0.1, # dropout source words (0: no dropout)
dropout_target=0.1, # dropout target words (0: no dropout)
overwrite=False,
external_validation_script='validate.sh')
print validerr
3 changes: 3 additions & 0 deletions sample/data/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
*.tok.*
*.tc.*
*.bpe.*
Loading

0 comments on commit dd85e68

Please sign in to comment.