-
Notifications
You must be signed in to change notification settings - Fork 13
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add Recipe to train a cometkiwi-like encoder model (which can be used…
… to score sentence pairs) (#53) * add cometkiwi recipe
- Loading branch information
Showing
7 changed files
with
1,316 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,88 @@ | ||
# CometKiwi | ||
|
||
--- | ||
**NOTE** | ||
|
||
This is NOT the exact replication of the Unbabel Comet Kiwi. | ||
|
||
What is common: | ||
We use the same base model xlm-roberta-xl or xlm-roberta-xxl encoders | ||
We use the same dataset (1720-da.mlqe) | ||
|
||
You can get the dataset to train the model below here: https://huggingface.co/eole-nlp/cometkiwi-xxl-eole/ | ||
|
||
What is different: | ||
wmt23-cometkiwi-da-xl(or xxl) use Layerwise attention which brings complexity without significant better accuracy | ||
we use Gelu instead of Tanh in the Estimator | ||
|
||
Our scores for XL and XXL are in the same range when they are very different for Unbabel/wmt23-cometkiwi-XL or XXL | ||
|
||
|
||
To make your life easier, run these commands from the recipe directory (here `recipes/cometkiwi`). | ||
--- | ||
|
||
## Retrieve and convert model | ||
|
||
### Set environment variables | ||
|
||
``` | ||
export EOLE_MODEL_DIR=<where_to_store_models> | ||
``` | ||
|
||
### Download and convert the base model | ||
|
||
``` | ||
eole convert HF --model_dir facebook/xlm-roberta-xxl --output $EOLE_MODEL_DIR/xlm-roberta-xxl-eole | ||
``` | ||
**NOTE** | ||
The facebook original model is stored in FP32 but we convert it to FP16 at conversion. | ||
|
||
XXL is a 10.7G params model hence will save a 21.4GB file on disk (safetensors format) | ||
XL is a 3.5GB params model hence will save a 7.0GGB file on disk | ||
|
||
After conversion in `$EOLE_MODEL_DIR` you will get the following files: | ||
config.json | ||
model.00.safetensors | ||
sentencepiece.bpe.model | ||
vocab.json | ||
vocab.txt | ||
|
||
The vocab.txt file contains 250000 entries (from sentencepiece) but the model was trained with extra tokens | ||
You need to concat the `added_vocab.txt` file to the `vocab.txt` file resulting in 250880 tokens | ||
|
||
|
||
## Training cometkiwi | ||
|
||
Training will happen in two steps (see cometkiwi-xxl-eole.yaml file) | ||
FYI the trained model can be downloaded here: https://huggingface.co/eole-nlp/cometkiwi-xxl-eole/ | ||
|
||
STEP 1: | ||
We train from the converted xlm-roberta-xxl model but we keep everything frozen. | ||
We just add an Estimator layer that will be trained during 4000 steps | ||
To make sure we do not lose anything we will rename the pre-trained subfolder into `step_4000_pretrain` | ||
In this step we do NOT use LoRA but we use 4bit quant to make things easier and fit in a smaller cards | ||
|
||
STEP 2: | ||
We release the encoder to make weights trainable hence we need to use LoRA since the model is big and gradients would not fit in consumer grade cards. | ||
We train during 40000 steps | ||
|
||
For the two steps above, the training command is the same but make sure the yaml file is modified according to the instructions | ||
`eole train --config cometkiwi-xxl-eole.yaml` | ||
|
||
After this step we need to merge the LoRA weights into the original model with the following command: | ||
|
||
`eole model lora --base_model "./cometkiwi-xxl-eole/step_4000_pretrain/" --lora_weights "./cometkiwi-xxl-eole/" --output "./cometkiwi-xxl-eole/merged"` | ||
|
||
|
||
## Cometkiwi Inference | ||
|
||
Format the source / target file you want to score in the Comet format: | ||
|
||
`./combine.sh newstest2014.en newstest2014.de newstest2014.comb` | ||
|
||
Score the .comb file: | ||
|
||
`eole predict --config cometkiwi-xxl-inference.yaml --src newstest2014.comb --output newstest2014.scores --with_score` | ||
|
||
For now the scores are in the third column, so you can `cut -f3` the output file | ||
|
Oops, something went wrong.