initial commit

rsennrich · Apr 25, 2016 · dd85e68 · dd85e68
commit dd85e68
Show file tree

Hide file tree

Showing 22 changed files with 24,423 additions and 0 deletions.
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,21 @@
+The MIT License (MIT)
+
+Copyright (c) 2016 University of Edinburgh
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/README.md b/README.md
@@ -0,0 +1,52 @@
+Scripts for Edinburgh Neural MT systems for WMT 16
+==================================================
+
+This repository contains scripts and an example config used for the Edinburgh Neural MT submission (UEDIN-NMT)
+for the shared translation task at the 2016 Workshops on Statistical Machine Translation (http://www.statmt.org/wmt16/).
+
+The scripts will facilitate the reproduction of our results, and serve as additional documentation (along with the system description paper)
+
+
+OVERVIEW
+--------
+
+- We built translation models with Nematus ( https://www.github.com/rsennrich/nematus )
+- We used BPE as subword segmentation to achieve open-vocabulary translation ( https://github.com/rsennrich/subword-nmt )
+- We automatically back-translated in-domain monolingual data into the source language to create additional training data. The data is publicly available here: http://statmt.org/rsennrich/wmt16_backtranslations/
+- More details about our system will appear in the (upcoming) system description paper
+
+SCRIPTS
+-------
+
+- preprocessing : preprocessing scripts for Romanian that we found helpful for translation quality.
+                  we used the Moses tokenizer and truecaser for all language pairs.
+
+- sample : sample scripts that we used for preprocessing, training and decoding. We used mostly the same settings for all translation directions,
+           with small differences in vocabulary size. Dropout was enabled for EN<->RO, but disabled otherwise.
+
+
+- r2l : scripts for reranking the output of the (default) left-to-right decoder with a model that decodes from right-to-left.
+
+
+LICENSE
+-------
+
+The scripts are available under the MIT License.
+
+PUBLICATIONS
+------------
+
+The Edinburgh Neural MT submission to WMT 2016 is described in:
+
+ TBD
+
+It is based on work described in the following publications:
+
+Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio (2015):
+    Neural Machine Translation by Jointly Learning to Align and Translate, Proceedings of the International Conference on Learning Representations (ICLR).
+
+Rico Sennrich, Barry Haddow, Alexandra Birch (2015):
+    Neural Machine Translation of Rare Words with Subword Units. arXiv preprint.
+
+Rico Sennrich, Barry Haddow, Alexandra Birch (2015):
+    Improving Neural Machine Translation Models with Monolingual Data. arXiv preprint.
diff --git a/preprocess/normalise-romanian.py b/preprocess/normalise-romanian.py
@@ -0,0 +1,15 @@
+#!/usr/bin/env python3
+
+#
+# Normalise Romanian s-comma and t-comma
+#
+# author: Barry Haddow
+
+import io
+import sys
+istream = io.TextIOWrapper(sys.stdin.buffer, encoding='utf-8')
+
+for line in istream:
+  line = line.replace("\u015e", "\u0218").replace("\u015f", "\u0219")
+  line = line.replace("\u0162", "\u021a").replace("\u0163", "\u021b")
+  print(line, end="")
diff --git a/preprocess/remove-diacritics.py b/preprocess/remove-diacritics.py
@@ -0,0 +1,18 @@
+#!/usr/bin/env python3
+
+#
+# Remove Romanian diacritics. Assumes s-comma and t-comma are normalised
+#
+# author: Barry Haddow
+
+import io
+import sys
+istream = io.TextIOWrapper(sys.stdin.buffer, encoding='utf-8')
+
+for line in istream:
+  line = line.replace("\u0218", "S").replace("\u0219", "s") #s-comma
+  line = line.replace("\u021a", "T").replace("\u021b", "t") #t-comma
+  line = line.replace("\u0102", "A").replace("\u0103", "a")
+  line = line.replace("\u00C2", "A").replace("\u00E2", "a")
+  line = line.replace("\u00CE", "I").replace("\u00EE", "i")
+  print(line, end="")
diff --git a/r2l/README.md b/r2l/README.md
@@ -0,0 +1,22 @@
+RERANKING WITH RIGHT-TO-LEFT-MODELS (R2L)
+-----------------------------------------
+
+For English<->German and English->Czech, we trained separate models with reversed target order, and used those models for reranking.
+
+To use reranking with reversed (right-to-left) models, do the following:
+
+1. use reverse.py to reverse the word order on the target side of the training/dev set.
+use the same vocabulary, and apply reverse.py *after* truecasing/BPE, to simplify the mapping from l2r to r2l and back.
+
+2. train a separate model (or ensemble) with the reversed target side. I'll call this the r2l model.
+
+3. at test time, produce an n-best list with the l2r model(s):
+
+  time THEANO_FLAGS=mode=FAST_RUN,floatX=float32,device=gpu,on_unused_input=warn python /path/to/nematus/nmt/translate.py -m model.npz -i test.bpe.de -o test.output.50best -k 50 -n -p 1 --n-best
+
+4. reverse the outputs in the n-best list, and re-score with the r2l model(s).
+
+  python reverse_nbest.py < test.output.50best > test.output.50best.reversed
+
+  time THEANO_FLAGS=mode=FAST_RUN,floatX=float32,device=gpu,on_unused_input=warn python /path/to/nematus/nmt/rescore.py -m /path/to/r2l_model/model.npz -s test.bpe.de -i test.output.50best.reversed -o test.output.50best.rescored -b 80 -n
+  python rerank.py < test.output.50best.rescored | python reverse.py > test.output.reranked
diff --git a/r2l/rerank.py b/r2l/rerank.py
@@ -0,0 +1,44 @@
+#!/usr/bin/python
+# -*- coding: utf-8 -*-
+# Author: Rico Sennrich
+
+import sys
+from collections import defaultdict
+
+if __name__ == '__main__':
+
+    if len(sys.argv) > 1:
+       k = int(sys.argv[1])
+    else:
+       k = float('inf')
+
+    cur = 0
+    best_score = float('inf')
+    best_sent = ''
+    idx = 0
+    for line in sys.stdin:
+        num, sent, scores = line.split(' ||| ')
+
+        # new input sentence: print best translation of previous sentence, and reset stats
+        if int(num) > cur:
+            print best_sent
+            #print best_score
+            cur = int(num)
+            best_score = float('inf')
+            best_sent = ''
+            idx = 0
+
+        #only consider k-best hypotheses
+        if idx >= k:
+            continue
+
+        score = sum(map(float, scores.split()))
+        if score < best_score:
+            best_score = score
+            best_sent = sent.strip()
+
+        idx += 1
+
+    # end of file; print best translation of last sentence
+    print best_sent
+#    print best_score
diff --git a/r2l/reverse.py b/r2l/reverse.py
@@ -0,0 +1,7 @@
+#!/usr/bin/python
+# -*- coding: utf-8 -*-
+
+import sys
+
+for line in sys.stdin:
+    sys.stdout.write(' '.join(reversed(line.split())) + '\n')
diff --git a/r2l/reverse_nbest.py b/r2l/reverse_nbest.py
@@ -0,0 +1,9 @@
+#!/usr/bin/python
+# -*- coding: utf-8 -*-
+
+import sys
+
+for line in sys.stdin:
+    linesplit = line.split(' ||| ')
+    linesplit[1] = ' '.join(reversed(linesplit[1].split()))
+    sys.stdout.write(' ||| '.join(linesplit))
diff --git a/sample/README.md b/sample/README.md
@@ -0,0 +1,27 @@
+This directory contains some sample files and configuration scripts for training a simple neural MT model
+
+
+INSTRUCTIONS
+------------
+
+all scripts contain variables that you will need to set to run the scripts.
+For processing the sample data, only paths to different toolkits need to be set.
+For processing new data, more changes will be necessary.
+
+As a first step, preprocess the training data:
+
+  ./preprocess.sh
+
+Then, start training: on normal-size data sets, this will take about 1-2 weeks to converge.
+Models are saved regularly, and you may want to interrupt this process without waiting for it to finish.
+
+  ./train.sh
+
+Given a model, preprocessed text can be translated thusly:
+
+  ./translate.sh
+
+Finally, you may want to post-process the translation output, namely merge BPE segments,
+detruecase and detokenize:
+
+  ./postprocess-test.sh < data/newsdev2016.output > data/newsdev2016.postprocessed
diff --git a/sample/config.py b/sample/config.py
@@ -0,0 +1,45 @@
+import numpy
+import os
+import sys
+
+sys.path.append('/path/to/nematus/nmt')
+
+from nmt import train
+
+VOCAB_SIZE = 90000
+SRC = "ro"
+TGT = "en"
+DATA_DIR = "data/"
+
+from nmt import train
+
+
+if __name__ == '__main__':
+    validerr = train(saveto='model/model.npz',
+                    reload_=True,
+                    dim_word=500,
+                    dim=1024,
+                    n_words=VOCAB_SIZE,
+                    n_words_src=VOCAB_SIZE,
+                    decay_c=0.,
+                    clip_c=1.,
+                    lrate=0.0001,
+                    optimizer='adadelta',
+                    maxlen=50,
+                    batch_size=80,
+                    valid_batch_size=80,
+                    datasets=[DATA_DIR + '/corpus.bpe.' + SRC, DATA_DIR + '/corpus.bpe.' + TGT],
+                    valid_datasets=[DATA_DIR + '/newsdev2016.bpe.' + SRC, DATA_DIR + '/newsdev2016.bpe.' + TGT],
+                    dictionaries=[DATA_DIR + '/corpus.bpe.' + SRC + '.json',DATA_DIR + '/corpus.bpe.' + TGT + '.json'],
+                    validFreq=10000,
+                    dispFreq=1000,
+                    saveFreq=30000,
+                    sampleFreq=10000,
+                    use_dropout=False,
+                    dropout_embedding=0.2, # dropout for input embeddings (0: no dropout)
+                    dropout_hidden=0.2, # dropout for hidden layers (0: no dropout)
+                    dropout_source=0.1, # dropout source words (0: no dropout)
+                    dropout_target=0.1, # dropout target words (0: no dropout)
+                    overwrite=False,
+                    external_validation_script='validate.sh')
+    print validerr
diff --git a/sample/data/.gitignore b/sample/data/.gitignore
@@ -0,0 +1,3 @@
+*.tok.*
+*.tc.*
+*.bpe.*
-Original file line number
+Diff line change
@@ -0,0 +1,3 @@
+    *.tok.*
+    *.tc.*
+    *.bpe.*