Skip to content

Commit

Permalink
Add Recipe to train a cometkiwi-like encoder model (which can be used…
Browse files Browse the repository at this point in the history
… to score sentence pairs) (#53)

* add cometkiwi recipe
  • Loading branch information
vince62s authored Jul 4, 2024
1 parent b770072 commit e6ef35e
Show file tree
Hide file tree
Showing 7 changed files with 1,316 additions and 0 deletions.
88 changes: 88 additions & 0 deletions recipes/cometkiwi/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# CometKiwi

---
**NOTE**

This is NOT the exact replication of the Unbabel Comet Kiwi.

What is common:
We use the same base model xlm-roberta-xl or xlm-roberta-xxl encoders
We use the same dataset (1720-da.mlqe)

You can get the dataset to train the model below here: https://huggingface.co/eole-nlp/cometkiwi-xxl-eole/

What is different:
wmt23-cometkiwi-da-xl(or xxl) use Layerwise attention which brings complexity without significant better accuracy
we use Gelu instead of Tanh in the Estimator

Our scores for XL and XXL are in the same range when they are very different for Unbabel/wmt23-cometkiwi-XL or XXL


To make your life easier, run these commands from the recipe directory (here `recipes/cometkiwi`).
---

## Retrieve and convert model

### Set environment variables

```
export EOLE_MODEL_DIR=<where_to_store_models>
```

### Download and convert the base model

```
eole convert HF --model_dir facebook/xlm-roberta-xxl --output $EOLE_MODEL_DIR/xlm-roberta-xxl-eole
```
**NOTE**
The facebook original model is stored in FP32 but we convert it to FP16 at conversion.

XXL is a 10.7G params model hence will save a 21.4GB file on disk (safetensors format)
XL is a 3.5GB params model hence will save a 7.0GGB file on disk

After conversion in `$EOLE_MODEL_DIR` you will get the following files:
config.json
model.00.safetensors
sentencepiece.bpe.model
vocab.json
vocab.txt

The vocab.txt file contains 250000 entries (from sentencepiece) but the model was trained with extra tokens
You need to concat the `added_vocab.txt` file to the `vocab.txt` file resulting in 250880 tokens


## Training cometkiwi

Training will happen in two steps (see cometkiwi-xxl-eole.yaml file)
FYI the trained model can be downloaded here: https://huggingface.co/eole-nlp/cometkiwi-xxl-eole/

STEP 1:
We train from the converted xlm-roberta-xxl model but we keep everything frozen.
We just add an Estimator layer that will be trained during 4000 steps
To make sure we do not lose anything we will rename the pre-trained subfolder into `step_4000_pretrain`
In this step we do NOT use LoRA but we use 4bit quant to make things easier and fit in a smaller cards

STEP 2:
We release the encoder to make weights trainable hence we need to use LoRA since the model is big and gradients would not fit in consumer grade cards.
We train during 40000 steps

For the two steps above, the training command is the same but make sure the yaml file is modified according to the instructions
`eole train --config cometkiwi-xxl-eole.yaml`

After this step we need to merge the LoRA weights into the original model with the following command:

`eole model lora --base_model "./cometkiwi-xxl-eole/step_4000_pretrain/" --lora_weights "./cometkiwi-xxl-eole/" --output "./cometkiwi-xxl-eole/merged"`


## Cometkiwi Inference

Format the source / target file you want to score in the Comet format:

`./combine.sh newstest2014.en newstest2014.de newstest2014.comb`

Score the .comb file:

`eole predict --config cometkiwi-xxl-inference.yaml --src newstest2014.comb --output newstest2014.scores --with_score`

For now the scores are in the third column, so you can `cut -f3` the output file

Loading

0 comments on commit e6ef35e

Please sign in to comment.