-
Notifications
You must be signed in to change notification settings - Fork 64
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit dd85e68
Showing
22 changed files
with
24,423 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
The MIT License (MIT) | ||
|
||
Copyright (c) 2016 University of Edinburgh | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in all | ||
copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
Scripts for Edinburgh Neural MT systems for WMT 16 | ||
================================================== | ||
|
||
This repository contains scripts and an example config used for the Edinburgh Neural MT submission (UEDIN-NMT) | ||
for the shared translation task at the 2016 Workshops on Statistical Machine Translation (http://www.statmt.org/wmt16/). | ||
|
||
The scripts will facilitate the reproduction of our results, and serve as additional documentation (along with the system description paper) | ||
|
||
|
||
OVERVIEW | ||
-------- | ||
|
||
- We built translation models with Nematus ( https://www.github.com/rsennrich/nematus ) | ||
- We used BPE as subword segmentation to achieve open-vocabulary translation ( https://github.com/rsennrich/subword-nmt ) | ||
- We automatically back-translated in-domain monolingual data into the source language to create additional training data. The data is publicly available here: http://statmt.org/rsennrich/wmt16_backtranslations/ | ||
- More details about our system will appear in the (upcoming) system description paper | ||
|
||
SCRIPTS | ||
------- | ||
|
||
- preprocessing : preprocessing scripts for Romanian that we found helpful for translation quality. | ||
we used the Moses tokenizer and truecaser for all language pairs. | ||
|
||
- sample : sample scripts that we used for preprocessing, training and decoding. We used mostly the same settings for all translation directions, | ||
with small differences in vocabulary size. Dropout was enabled for EN<->RO, but disabled otherwise. | ||
|
||
|
||
- r2l : scripts for reranking the output of the (default) left-to-right decoder with a model that decodes from right-to-left. | ||
|
||
|
||
LICENSE | ||
------- | ||
|
||
The scripts are available under the MIT License. | ||
|
||
PUBLICATIONS | ||
------------ | ||
|
||
The Edinburgh Neural MT submission to WMT 2016 is described in: | ||
|
||
TBD | ||
|
||
It is based on work described in the following publications: | ||
|
||
Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio (2015): | ||
Neural Machine Translation by Jointly Learning to Align and Translate, Proceedings of the International Conference on Learning Representations (ICLR). | ||
|
||
Rico Sennrich, Barry Haddow, Alexandra Birch (2015): | ||
Neural Machine Translation of Rare Words with Subword Units. arXiv preprint. | ||
|
||
Rico Sennrich, Barry Haddow, Alexandra Birch (2015): | ||
Improving Neural Machine Translation Models with Monolingual Data. arXiv preprint. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
#!/usr/bin/env python3 | ||
|
||
# | ||
# Normalise Romanian s-comma and t-comma | ||
# | ||
# author: Barry Haddow | ||
|
||
import io | ||
import sys | ||
istream = io.TextIOWrapper(sys.stdin.buffer, encoding='utf-8') | ||
|
||
for line in istream: | ||
line = line.replace("\u015e", "\u0218").replace("\u015f", "\u0219") | ||
line = line.replace("\u0162", "\u021a").replace("\u0163", "\u021b") | ||
print(line, end="") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
#!/usr/bin/env python3 | ||
|
||
# | ||
# Remove Romanian diacritics. Assumes s-comma and t-comma are normalised | ||
# | ||
# author: Barry Haddow | ||
|
||
import io | ||
import sys | ||
istream = io.TextIOWrapper(sys.stdin.buffer, encoding='utf-8') | ||
|
||
for line in istream: | ||
line = line.replace("\u0218", "S").replace("\u0219", "s") #s-comma | ||
line = line.replace("\u021a", "T").replace("\u021b", "t") #t-comma | ||
line = line.replace("\u0102", "A").replace("\u0103", "a") | ||
line = line.replace("\u00C2", "A").replace("\u00E2", "a") | ||
line = line.replace("\u00CE", "I").replace("\u00EE", "i") | ||
print(line, end="") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
RERANKING WITH RIGHT-TO-LEFT-MODELS (R2L) | ||
----------------------------------------- | ||
|
||
For English<->German and English->Czech, we trained separate models with reversed target order, and used those models for reranking. | ||
|
||
To use reranking with reversed (right-to-left) models, do the following: | ||
|
||
1. use reverse.py to reverse the word order on the target side of the training/dev set. | ||
use the same vocabulary, and apply reverse.py *after* truecasing/BPE, to simplify the mapping from l2r to r2l and back. | ||
|
||
2. train a separate model (or ensemble) with the reversed target side. I'll call this the r2l model. | ||
|
||
3. at test time, produce an n-best list with the l2r model(s): | ||
|
||
time THEANO_FLAGS=mode=FAST_RUN,floatX=float32,device=gpu,on_unused_input=warn python /path/to/nematus/nmt/translate.py -m model.npz -i test.bpe.de -o test.output.50best -k 50 -n -p 1 --n-best | ||
|
||
4. reverse the outputs in the n-best list, and re-score with the r2l model(s). | ||
|
||
python reverse_nbest.py < test.output.50best > test.output.50best.reversed | ||
|
||
time THEANO_FLAGS=mode=FAST_RUN,floatX=float32,device=gpu,on_unused_input=warn python /path/to/nematus/nmt/rescore.py -m /path/to/r2l_model/model.npz -s test.bpe.de -i test.output.50best.reversed -o test.output.50best.rescored -b 80 -n | ||
python rerank.py < test.output.50best.rescored | python reverse.py > test.output.reranked |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
#!/usr/bin/python | ||
# -*- coding: utf-8 -*- | ||
# Author: Rico Sennrich | ||
|
||
import sys | ||
from collections import defaultdict | ||
|
||
if __name__ == '__main__': | ||
|
||
if len(sys.argv) > 1: | ||
k = int(sys.argv[1]) | ||
else: | ||
k = float('inf') | ||
|
||
cur = 0 | ||
best_score = float('inf') | ||
best_sent = '' | ||
idx = 0 | ||
for line in sys.stdin: | ||
num, sent, scores = line.split(' ||| ') | ||
|
||
# new input sentence: print best translation of previous sentence, and reset stats | ||
if int(num) > cur: | ||
print best_sent | ||
#print best_score | ||
cur = int(num) | ||
best_score = float('inf') | ||
best_sent = '' | ||
idx = 0 | ||
|
||
#only consider k-best hypotheses | ||
if idx >= k: | ||
continue | ||
|
||
score = sum(map(float, scores.split())) | ||
if score < best_score: | ||
best_score = score | ||
best_sent = sent.strip() | ||
|
||
idx += 1 | ||
|
||
# end of file; print best translation of last sentence | ||
print best_sent | ||
# print best_score |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
#!/usr/bin/python | ||
# -*- coding: utf-8 -*- | ||
|
||
import sys | ||
|
||
for line in sys.stdin: | ||
sys.stdout.write(' '.join(reversed(line.split())) + '\n') |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
#!/usr/bin/python | ||
# -*- coding: utf-8 -*- | ||
|
||
import sys | ||
|
||
for line in sys.stdin: | ||
linesplit = line.split(' ||| ') | ||
linesplit[1] = ' '.join(reversed(linesplit[1].split())) | ||
sys.stdout.write(' ||| '.join(linesplit)) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
This directory contains some sample files and configuration scripts for training a simple neural MT model | ||
|
||
|
||
INSTRUCTIONS | ||
------------ | ||
|
||
all scripts contain variables that you will need to set to run the scripts. | ||
For processing the sample data, only paths to different toolkits need to be set. | ||
For processing new data, more changes will be necessary. | ||
|
||
As a first step, preprocess the training data: | ||
|
||
./preprocess.sh | ||
|
||
Then, start training: on normal-size data sets, this will take about 1-2 weeks to converge. | ||
Models are saved regularly, and you may want to interrupt this process without waiting for it to finish. | ||
|
||
./train.sh | ||
|
||
Given a model, preprocessed text can be translated thusly: | ||
|
||
./translate.sh | ||
|
||
Finally, you may want to post-process the translation output, namely merge BPE segments, | ||
detruecase and detokenize: | ||
|
||
./postprocess-test.sh < data/newsdev2016.output > data/newsdev2016.postprocessed |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
import numpy | ||
import os | ||
import sys | ||
|
||
sys.path.append('/path/to/nematus/nmt') | ||
|
||
from nmt import train | ||
|
||
VOCAB_SIZE = 90000 | ||
SRC = "ro" | ||
TGT = "en" | ||
DATA_DIR = "data/" | ||
|
||
from nmt import train | ||
|
||
|
||
if __name__ == '__main__': | ||
validerr = train(saveto='model/model.npz', | ||
reload_=True, | ||
dim_word=500, | ||
dim=1024, | ||
n_words=VOCAB_SIZE, | ||
n_words_src=VOCAB_SIZE, | ||
decay_c=0., | ||
clip_c=1., | ||
lrate=0.0001, | ||
optimizer='adadelta', | ||
maxlen=50, | ||
batch_size=80, | ||
valid_batch_size=80, | ||
datasets=[DATA_DIR + '/corpus.bpe.' + SRC, DATA_DIR + '/corpus.bpe.' + TGT], | ||
valid_datasets=[DATA_DIR + '/newsdev2016.bpe.' + SRC, DATA_DIR + '/newsdev2016.bpe.' + TGT], | ||
dictionaries=[DATA_DIR + '/corpus.bpe.' + SRC + '.json',DATA_DIR + '/corpus.bpe.' + TGT + '.json'], | ||
validFreq=10000, | ||
dispFreq=1000, | ||
saveFreq=30000, | ||
sampleFreq=10000, | ||
use_dropout=False, | ||
dropout_embedding=0.2, # dropout for input embeddings (0: no dropout) | ||
dropout_hidden=0.2, # dropout for hidden layers (0: no dropout) | ||
dropout_source=0.1, # dropout source words (0: no dropout) | ||
dropout_target=0.1, # dropout target words (0: no dropout) | ||
overwrite=False, | ||
external_validation_script='validate.sh') | ||
print validerr |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
*.tok.* | ||
*.tc.* | ||
*.bpe.* |
Oops, something went wrong.