Skip to content

Latest commit

 

History

History
47 lines (29 loc) · 1.44 KB

Development.md

File metadata and controls

47 lines (29 loc) · 1.44 KB

The Tatoeba Challenge - Development Notes

Prerequistes

Required software:

Optional software:

  • terashuf: efficiently shuffle massive data sets
  • pigz: multithreaded gzip

Data:

  • local copy of all OPUS data (set OPUS_HOME in the Makefile)

Compiling the corpus

  • make sure that the scripts in scripts/ work as they should and that all software is properly installed
  • run make all to compile the entire corpus and readme-files (or better using parallel threads with, for example four paralle jobs using make -j 4 all)
  • upload the data to ObjectStorage using a-tools at CSC:
module load allas
allas-conf
make upload

The data set can also be compiled in various steps, for example test/dev sets and training data sets separately:

make -j testdata
make -j traindata
make subsets

TODO