Skip to content

[PlanTL/medicine/neural machine translation/translation models] Files needed to use the Neural Machine Translation system for the Biomedical Domain.

License

Notifications You must be signed in to change notification settings

PlanTL-GOB-ES/Medical-Translator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 

Repository files navigation

Neural Machine Translation for the Biomedical Domain

Introduction

This package contains the files needed to use the Neural Machine Translation (NMT) system for the Biomedical Domain (WMT18 shared task). For the WMT19, see: https://github.com/PlanTL-SANIDAD/Medical-Translator-WMT19.

The available language directions for translation are:

  • English to Spanish
  • Spanish to English
  • English to Portuguese
  • Portuguese to English
  • Spanish to Portuguese
  • Portuguese to Spanish

Translation models must be downloaded from the Zenodo repository: https://doi.org/10.5281/zenodo.2204995

We used BPE (https://github.com/rsennrich/subword-nmt) and LSTM-based seq2seq architectures for all models.

Prerequisites

This package uses the Torch implementation of the OpenNMT system (http://opennmt.net/). Instructions to install the system are found in: http://opennmt.net/OpenNMT/installation/

Directory structure

Tokenize.sh 	 - Utility script to tokenize the input file using BPE (needed for translation)

In Zenodo you will find the following files:

enes_pt.bpe32000 	 - BPE encoding where source language are either EN/ES and target is PT
enpt_es.bpe32000 	 - BPE encoding where source language are either EN/PT and target is ES
espt_en.bpe32000 	 - BPE encoding where source language are either ES/PT and target is EN
onmt_enes_pt-4-1000-600_epoch11_62.74_release.t7 - OpenNMT model in release format (EN/ES) -> PT
onmt_enpt_es-4-1000-600_epoch11_60.38_release.t7 - OpenNMT model in release format (EN/PT) -> ES
onmt_espt_en-4-1000-600_epoch6_51.52_release.t7  - OpenNMT model in release format (ES/PT) -> EN
onmt_enes_pt-4-1000-600_epoch11_62.74.t7  - OpenNMT model in original format (EN/ES) -> PT
onmt_enpt_es-4-1000-600_epoch11_60.38_.t7 - OpenNMT model in original format (EN/PT) -> ES
onmt_espt_en-4-1000-600_epoch6_51.52.t7   - OpenNMT model in original format (ES/PT) -> EN

Usage

Tokenize.sh [options]

Options:

-d : Data directory where the BPE models and translation models are stored
-s : Language of the source file. Valid options are: en, es, or pt
-t : Language of the target file. Valid options are: en, es, or pt
-f : Path to the file that will be translated
-n : Number of parallel threads for tokenization
-o : Path to the OpenNMT installation directory

OpenNMT models:

  • Release format: Can be translated using GPU or CPU, cannot be modified or retrained
  • Original format: Can be translated only with GPU, can be modified or retrained using Torch OpenNMT

Examples

$ .Tokenize.sh -d /home/user/data -s en -t es -f /home/user/text.txt -n 4 -o /home/user/OpenNMT
$ cd /home/user/OpenNMT
$ th translate.lua -model /home/user/data/onmt_enpt_es-4-1000-600_epoch11_60.38_release.t7 -gpuid 1 -src /home/user/text.txt.tok -replace_unk true -detokenize_output true -output /home/user/text.translated

Contact

Felipe Soares ([email protected])

License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Copyright (c) 2018 Secretaría de Estado para el Avance Digital (SEAD)

About

[PlanTL/medicine/neural machine translation/translation models] Files needed to use the Neural Machine Translation system for the Biomedical Domain.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages