Add Recipe to train a cometkiwi-like encoder model (which can be used…

… to score sentence pairs) (#53) * add cometkiwi recipe
eole-nlp · Jul 4, 2024 · e6ef35e · e6ef35e
1 parent b770072
commit e6ef35e
Show file tree

Hide file tree

Showing 7 changed files with 1,316 additions and 0 deletions.
diff --git a/recipes/cometkiwi/README.md b/recipes/cometkiwi/README.md
@@ -0,0 +1,88 @@
+# CometKiwi
+
+---
+**NOTE**
+
+This is NOT the exact replication of the Unbabel Comet Kiwi.
+
+What is common:
+We use the same base model xlm-roberta-xl or xlm-roberta-xxl encoders
+We use the same dataset (1720-da.mlqe)
+
+You can get the dataset to train the model below here: https://huggingface.co/eole-nlp/cometkiwi-xxl-eole/
+
+What is different:
+wmt23-cometkiwi-da-xl(or xxl) use Layerwise attention which brings complexity without significant better accuracy
+we use Gelu instead of Tanh in the Estimator
+
+Our scores for XL and XXL are in the same range when they are very different for Unbabel/wmt23-cometkiwi-XL or XXL
+
+
+To make your life easier, run these commands from the recipe directory (here `recipes/cometkiwi`).
+---
+
+## Retrieve and convert model
+
+### Set environment variables
+
+```
+export EOLE_MODEL_DIR=<where_to_store_models>
+```
+
+### Download and convert the base model
+
+```
+eole convert HF --model_dir facebook/xlm-roberta-xxl --output $EOLE_MODEL_DIR/xlm-roberta-xxl-eole
+```
+**NOTE**
+The facebook original model is stored in FP32 but we convert it to FP16 at conversion.
+
+XXL is a 10.7G params model hence will save a 21.4GB file on disk (safetensors format)
+XL is a 3.5GB params model hence will save a 7.0GGB file on disk
+
+After conversion in `$EOLE_MODEL_DIR` you will get the following files:
+config.json
+model.00.safetensors
+sentencepiece.bpe.model
+vocab.json
+vocab.txt
+
+The vocab.txt file contains 250000 entries (from sentencepiece) but the model was trained with extra tokens
+You need to concat the `added_vocab.txt` file to the `vocab.txt` file resulting in 250880 tokens
+
+
+## Training cometkiwi
+
+Training will happen in two steps (see cometkiwi-xxl-eole.yaml file)
+FYI the trained model can be downloaded here: https://huggingface.co/eole-nlp/cometkiwi-xxl-eole/
+
+STEP 1:
+We train from the converted xlm-roberta-xxl model but we keep everything frozen.
+We just add an Estimator layer that will be trained during 4000 steps
+To make sure we do not lose anything we will rename the pre-trained subfolder into `step_4000_pretrain`
+In this step we do NOT use LoRA but we use 4bit quant to make things easier and fit in a smaller cards
+
+STEP 2:
+We release the encoder to make weights trainable hence we need to use LoRA since the model is big and gradients would not fit in consumer grade cards.
+We train during 40000 steps
+
+For the two steps above, the training command is the same but make sure the yaml file is modified according to the instructions
+`eole train --config cometkiwi-xxl-eole.yaml`
+
+After this step we need to merge the LoRA weights into the original model with the following command:
+
+`eole model lora --base_model "./cometkiwi-xxl-eole/step_4000_pretrain/" --lora_weights "./cometkiwi-xxl-eole/" --output "./cometkiwi-xxl-eole/merged"`
+
+
+## Cometkiwi Inference
+
+Format the source / target file you want to score in the Comet format:
+
+`./combine.sh newstest2014.en newstest2014.de newstest2014.comb`
+
+Score the .comb file:
+
+`eole predict --config cometkiwi-xxl-inference.yaml --src newstest2014.comb --output newstest2014.scores --with_score`
+
+For now the scores are in the third column, so you can `cut -f3` the output file
+