We use Flair for fine-tuning NER models on the
AjMC dataset from
HIPE-2022 Shared Task.
All models are fine-tuned on A10 (24GB) instances from Lambda Cloud using
Flair:
$ git clone -b support_byt5 https://github.com/flairNLP/flair.git && cd flair && pip3 install -e .
$ cd
Clone this repo for fine-tuning NER models:
$ git clone https://github.com/stefan-it/hmByT5.git
$ cd hmByT5/bench
We use a config-driven hyper-parameter search. The script flair-fine-tuner.py
can be used to
fine-tune NER models from our Model Zoo.
All configurations can be found under the configs/ajmc
folder in this repository.
Example command for hyper-parameter search for the hmByT5 model on English part of AjMC corpus:
$ python3 flair-fine-tuner.py ./configs-preliminary/ajmc/en/hmbyt5-small-flax-en.json
To get a nice overview of the results (incl. best hyper-parameter configuration), just run the log parsing script:
$ python3 flair-log-parser.py "hipe2022-flert-fine-tune-ajmc-first-pooling/en-hmbyt5-preliminary/byt5-small-english-bs*"
We test our pretrained language models on various datasets from HIPE-2020, HIPE-2022 and Europeana. The following table
shows an overview of used datasets.
The following table shows performance (averaged F1-score on development set, 5 runs) for all models:
Detailed results
We evaluated the hmByT5 model that was pretrained on English corpus for 200k steps:
Hyper-param Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
83.80 |
84.78 |
83.74 |
83.35 |
84.37 |
84.01 ± 0.50 |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
84.67 |
82.69 |
83.92 |
84.53 |
82.90 |
83.74 ± 0.82 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
82.12 |
83.82 |
83.37 |
83.00 |
83.70 |
83.20 ± 0.61 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
83.45 |
82.83 |
84.15 |
81.76 |
83.78 |
83.19 ± 0.84 |
It turns out, that the results are not on-par with current SOTA on the English AjMC corpus, see a comparison
here. Thus, we continue experiments with the Hugging Face
Transformers JAX/FLAX implementation to pretrain ByT5 models on TPU.
Results with the Hugging Face Transformers JAX/FLAX implementation are really promising.
We evaluated a hmByT5 model that was pretrained on English corpus for one epoch:
Hyper-param Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
84.35 |
84.51 |
85.21 |
87.01 |
87.17 |
85.65 ± 1.21 |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
85.78 |
85.03 |
86.40 |
85.48 |
84.47 |
85.43 ± 0.66 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
84.70 |
85.41 |
85.85 |
82.94 |
83.64 |
84.51 ± 1.08 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
83.23 |
85.95 |
85.41 |
83.02 |
84.16 |
84.35 ± 1.16 |
We use the previous hmbyt5-preliminary/byt5-small-english
model as initial checkpoint (incl. last learning rate and no
warm-up steps) and continue pretraining on the German corpus for one epoch:
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
85.37 |
85.75 |
86.7 |
86.26 |
84.62 |
85.74 ± 0.72 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
84.97 |
85.31 |
85.58 |
84.33 |
85.27 |
85.09 ± 0.43 |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
84.64 |
84.62 |
85.04 |
83.92 |
85.24 |
84.69 ± 0.45 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
84.5 |
84.62 |
83.41 |
85.21 |
83.55 |
84.26 ± 0.68 |
We use the previous English+German model as initial checkpoint (incl. last learning rate and no warm-up steps) and
continue pretraining on the French corpus for one epoch:
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
86.19 |
86.8 |
84.58 |
86.12 |
84.36 |
85.61 ± 0.96 |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
86.12 |
85.75 |
83.9 |
85.17 |
85.11 |
85.21 ± 0.75 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
85.37 |
85.24 |
84.9 |
84.56 |
84.81 |
84.98 ± 0.29 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
85.17 |
84 |
84.06 |
84.49 |
85.95 |
84.73 ± 0.74 |
We use the previous English+German+French model as initial checkpoint (incl. last learning rate and no warm-up steps) and
continue pretraining on the Finnish corpus for one epoch:
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
84.61 |
87.35 |
84.06 |
84.87 |
85.61 |
85.30 ± 1.14 |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
86.4 |
84.51 |
83.86 |
84.66 |
85.48 |
84.98 ± 0.88 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
84.73 |
84.83 |
84.66 |
84.5 |
85.44 |
84.83 ± 0.32 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
84.66 |
85.37 |
85.2 |
82.96 |
84.91 |
84.62 ± 0.86 |
We use the previous English+German+French+Finnish model as initial checkpoint
(incl. last learning rate and no warm-up steps) and continue pretraining on the Swedish corpus for one epoch:
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
84.21 |
86.02 |
85.85 |
86.19 |
84.75 |
85.4 ± 0.78 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
85.95 |
84.94 |
84.7 |
85.41 |
84.59 |
85.12 ± 0.5 |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
84.19 |
85.03 |
84.25 |
86.22 |
85.31 |
85.0 ± 0.75 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
84.65 |
84.26 |
83.92 |
85.61 |
83.96 |
84.48 ± 0.62 |
We use the previous English+German+French+Finnish+Swedish model as initial checkpoint
(incl. last learning rate and no warm-up steps) and continue pretraining on the Dutch corpus for one epoch:
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
84.35 |
85.34 |
85.71 |
86.46 |
85.68 |
85.51 ± 0.68 |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
86.02 |
86.12 |
84.16 |
84.62 |
86.23 |
85.43 ± 0.86 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
84.36 |
84.93 |
84.8 |
84.46 |
85.55 |
84.82 ± 0.42 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
84.67 |
84.59 |
84.56 |
84.63 |
85 |
84.69 ± 0.16 |
Results with JAX/FLAX implementation on the multilingual model (4GB of text per language) for one epoch:
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
84.66 |
84.10 |
81.79 |
83.45 |
83.47 |
83.49 ± 0.96 |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
83.99 |
82.85 |
82.44 |
84.57 |
83.49 |
83.47 ± 0.76 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
81.96 |
82.05 |
82.52 |
82.13 |
83.08 |
82.35 ± 0.41 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
83.10 |
81.73 |
82.46 |
81.44 |
82.44 |
82.23 ± 0.59 |
We use the previous 4GB model as initial checkpoint (incl. last learning rate and no warm-up steps) and
continue pretraining on the same corpus for an additional epoch. Pretraining is currently running.
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
84.19 |
83.62 |
84.88 |
83.47 |
83.16 |
83.86 ± 0.61 |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
83.57 |
83.59 |
82.37 |
85.58 |
81.73 |
83.37 ± 1.32 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
82.53 |
82.18 |
81.88 |
84.14 |
82.64 |
82.67 ± 0.78 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
82.44 |
79.4 |
83.04 |
81.91 |
82.49 |
81.86 ± 1.28 |
We use the previous 4GB model as initial checkpoint (incl. last learning rate and no warm-up steps) and
continue pretraining on the same corpus for an additional epoch:
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
84.07 |
81.92 |
84.63 |
84.04 |
82.77 |
83.49 ± 0.99 |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
84.27 |
84.06 |
84.71 |
82.11 |
80.85 |
83.2 ± 1.47 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
82.16 |
81.3 |
81.73 |
84.63 |
81.75 |
82.31 ± 1.19 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
80.92 |
81.54 |
83.35 |
82.94 |
82.09 |
82.17 ± 0.89 |
Results with JAX/FLAX implementation on the multilingual model for 560k steps (0.5 epochs):
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
85.04 |
80.71 |
82.57 |
82.92 |
85.17 |
83.28 ± 1.67 |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
83.7 |
82.12 |
83.12 |
84.16 |
83.1 |
83.24 ± 0.69 |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
83.83 |
83.76 |
82.64 |
84.1 |
81.49 |
83.16 ± 0.97 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
78.65 |
83.69 |
80.97 |
84.02 |
82.21 |
81.91 ± 1.96 |
Results with JAX/FLAX implementation on the multilingual model for 850k steps (1 epoch):
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
84.63 |
83.89 |
85.31 |
84.37 |
86.36 |
84.91 ± 0.86 |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
84.21 |
85.48 |
85.14 |
83.45 |
83.7 |
84.4 ± 0.79 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
83.18 |
85.82 |
82.24 |
82.41 |
83.35 |
83.4 ± 1.28 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
82.51 |
82.64 |
83.73 |
82.53 |
82.52 |
82.79 ± 0.47 |
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
86.91 |
87.26 |
86.57 |
87.98 |
87.62 |
87.27 ± 0.5 |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
87.08 |
86.47 |
86.02 |
86.77 |
87.43 |
86.75 ± 0.49 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
85.34 |
86.12 |
85.37 |
86.56 |
85.99 |
85.88 ± 0.47 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
85.58 |
85.04 |
85.44 |
84.36 |
85.92 |
85.27 ± 0.53 |
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
87.29 |
88.01 |
87.17 |
86.43 |
88.33 |
87.45 ± 0.67 |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
88.12 |
87.58 |
87.59 |
86.98 |
86.94 |
87.44 ± 0.44 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
86.78 |
86.63 |
85.85 |
86.64 |
85.82 |
86.34 ± 0.42 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
85.78 |
85.89 |
85.58 |
85.82 |
85.75 |
85.76 ± 0.1 |
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
86.6 |
86.84 |
88.06 |
88.25 |
86.47 |
87.24 ± 0.76 |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
86.13 |
86.54 |
87.98 |
86.67 |
87.53 |
86.97 ± 0.68 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
86.98 |
86.09 |
87.02 |
85.99 |
86.47 |
86.51 ± 0.43 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
86.75 |
86.33 |
84.53 |
85.78 |
85.85 |
85.85 ± 0.75 |
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
87.71 |
86.98 |
87.08 |
86.85 |
88.25 |
87.37 ± 0.53 |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
87.19 |
86.78 |
87.52 |
87.25 |
87.15 |
87.18 ± 0.24 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
86.19 |
86.23 |
86.47 |
87.15 |
86.91 |
86.59 ± 0.38 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
85.51 |
86.29 |
86.05 |
85.27 |
86.8 |
85.98 ± 0.55 |
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
87.12 |
86.88 |
87.43 |
87.17 |
86.98 |
87.12 ± 0.19 |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
88.35 |
86.33 |
87.25 |
86.33 |
86.78 |
87.01 ± 0.75 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
86.74 |
86.84 |
86.67 |
86.98 |
86.26 |
86.7 ± 0.24 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
85.85 |
85.47 |
86.54 |
85.85 |
84.1 |
85.56 ± 0.81 |
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
87.66 |
88.19 |
87.56 |
86.95 |
87.53 |
87.58 ± 0.39 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
87.66 |
86.84 |
87.5 |
86.53 |
87.23 |
87.15 ± 0.42 |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
86.33 |
87.19 |
87.85 |
86.9 |
86.81 |
87.02 ± 0.5 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
86.34 |
84.71 |
85.75 |
85.89 |
85.85 |
85.71 ± 0.54 |
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
88 |
88.7 |
87.45 |
87.09 |
86.99 |
87.65 ± 0.63 |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
87.68 |
87.74 |
87.02 |
87.72 |
87.14 |
87.46 ± 0.31 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
86.54 |
86.57 |
86.64 |
86.5 |
86.53 |
86.56 ± 0.05 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
86.06 |
86.23 |
85.44 |
86.63 |
86.67 |
86.21 ± 0.45 |
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
87.72 |
87.61 |
87.56 |
87.66 |
87.17 |
87.54 ± 0.19 |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
86.8 |
86.33 |
87.29 |
88.01 |
86.47 |
86.98 ± 0.61 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
85.44 |
86.95 |
85.92 |
85.51 |
86.27 |
86.02 ± 0.55 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
85.24 |
86.55 |
85.65 |
86.67 |
85.65 |
85.95 ± 0.56 |
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
87.92 |
87.48 |
87.93 |
86.95 |
86.6 |
87.38 ± 0.53 |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
87.15 |
86.81 |
87.25 |
87.74 |
87.02 |
87.19 ± 0.31 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
85.89 |
86.67 |
85.58 |
86.53 |
85.61 |
86.06 ± 0.46 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
86.02 |
85.61 |
85.99 |
86.16 |
85.75 |
85.91 ± 0.2 |
Results with JAX/FLAX implementation on the multilingual model for 560k steps (0.5 epochs):
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
87.3 |
88.16 |
86.63 |
86.74 |
86.06 |
86.98 ± 0.71 |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
85.92 |
86.78 |
85.88 |
87.22 |
86.4 |
86.44 ± 0.51 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
85.31 |
85.04 |
84.83 |
85.68 |
84.9 |
85.15 ± 0.31 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
85.71 |
84.86 |
80.23 |
85.89 |
85.68 |
84.47 ± 2.15 |
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
88.48 |
87.53 |
87.77 |
88.33 |
88 |
88.02 ± 0.35 |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
87.95 |
86.71 |
89.26 |
87.39 |
88.78 |
88.02 ± 0.92 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
85.61 |
86.3 |
86.9 |
86.6 |
85.95 |
86.27 ± 0.46 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
85.58 |
85.68 |
84.96 |
85.78 |
85.95 |
85.59 ± 0.34 |
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
85.39 |
84.58 |
85.11 |
83.21 |
83.92 |
84.44 ± 0.79 |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
84.54 |
83.71 |
83.19 |
83.67 |
83.52 |
83.73 ± 0.45 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
83.15 |
83.02 |
83.27 |
83.58 |
82.84 |
83.17 ± 0.25 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
82.88 |
82.9 |
82.63 |
83.59 |
83.38 |
83.08 ± 0.35 |
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
84.92 |
84.22 |
84.94 |
83.84 |
83.23 |
84.23 ± 0.65 |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
84.01 |
82.77 |
83.6 |
84.04 |
83.99 |
83.68 ± 0.48 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
84.42 |
82.89 |
82.9 |
83.48 |
82.34 |
83.21 ± 0.71 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
82.51 |
83.65 |
81.94 |
83.23 |
84.25 |
83.12 ± 0.82 |
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
84.33 |
84.82 |
84.85 |
83.08 |
84.86 |
84.39 ± 0.68 |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
84.62 |
84.42 |
84.01 |
83.67 |
83.98 |
84.14 ± 0.34 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
84.5 |
83.46 |
82.85 |
81.35 |
83.35 |
83.1 ± 1.03 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
83.69 |
83.23 |
82.61 |
82.69 |
82.99 |
83.04 ± 0.39 |
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
84.2 |
84.58 |
84.55 |
83.56 |
83.69 |
84.12 ± 0.42 |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
84.16 |
84.01 |
83.81 |
83.21 |
83.56 |
83.75 ± 0.34 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
81.94 |
82.52 |
84.3 |
84.32 |
84.03 |
83.42 ± 1.0 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
83.1 |
83.5 |
83.71 |
81.74 |
83.35 |
83.08 ± 0.7 |
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
84.42 |
84.92 |
84.57 |
84.21 |
83.91 |
84.41 ± 0.34 |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
85.07 |
84.22 |
83.62 |
83.91 |
83.13 |
83.99 ± 0.65 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
81.7 |
84.28 |
82.94 |
83.71 |
83.58 |
83.24 ± 0.88 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
82.29 |
83.56 |
83.25 |
83.44 |
83.07 |
83.12 ± 0.45 |
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
85.53 |
83.83 |
85.18 |
83.31 |
84.11 |
84.39 ± 0.83 |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
83.31 |
84.71 |
84.07 |
84.03 |
83.05 |
83.83 ± 0.59 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
84.28 |
84.07 |
83.25 |
82.29 |
82.85 |
83.35 ± 0.74 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
82.72 |
83.95 |
81.59 |
84.2 |
82.16 |
82.92 ± 1.01 |
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
85.25 |
83.98 |
84.63 |
82.56 |
84.39 |
84.16 ± 0.9 |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
83.63 |
83.56 |
82.4 |
82.57 |
84.44 |
83.32 ± 0.75 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
82.37 |
83.12 |
81.35 |
83.33 |
81.69 |
82.37 ± 0.77 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
81.94 |
82.5 |
81.47 |
82.88 |
81.29 |
82.02 ± 0.6 |
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
84.63 |
84.04 |
84.75 |
84.42 |
83.62 |
84.29 ± 0.41 |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
83.17 |
82.04 |
83.6 |
84.09 |
83.12 |
83.2 ± 0.68 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
82.24 |
83.29 |
81.74 |
82.62 |
82.34 |
82.45 ± 0.51 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
80.75 |
82.26 |
82.25 |
81.75 |
81.89 |
81.78 ± 0.55 |
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
84.11 |
83.98 |
85.32 |
83.98 |
84.11 |
84.3 ± 0.51 |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
84.47 |
83.15 |
84.17 |
83.73 |
82.85 |
83.67 ± 0.61 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
83.62 |
81.91 |
81.48 |
82.84 |
82.12 |
82.39 ± 0.75 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
83.04 |
81.24 |
81.64 |
82.43 |
82.19 |
82.11 ± 0.62 |
Results with JAX/FLAX implementation on the multilingual model for 560k steps (0.5 epochs):
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
84.11 |
83.08 |
84.96 |
83.48 |
81.8 |
83.49 ± 1.06 |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
83.31 |
81.64 |
82.12 |
82.91 |
83.42 |
82.68 ± 0.69 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
83.46 |
81.55 |
81.84 |
81.67 |
82.53 |
82.21 ± 0.71 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
81.44 |
82.53 |
80.64 |
83.29 |
82.27 |
82.03 ± 0.91 |
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
85.07 |
85.14 |
83.7 |
85.82 |
84.17 |
84.78 ± 0.75 |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
83.31 |
82.78 |
83.94 |
85.61 |
83.02 |
83.73 ± 1.02 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
82.99 |
84.89 |
83.46 |
82.41 |
83.92 |
83.53 ± 0.84 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
82.4 |
83.69 |
83.39 |
82.84 |
83.33 |
83.13 ± 0.46 |
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
77.73 |
74.68 |
79.05 |
77.71 |
75.63 |
76.96 ± 1.58 |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
78.19 |
76.76 |
75.85 |
77.61 |
74.95 |
76.67 ± 1.17 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
74.47 |
66.95 |
71.34 |
53.76 |
65.81 |
66.47 ± 7.07 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
46.39 |
64.97 |
66.53 |
60.81 |
71.22 |
61.98 ± 8.48 |
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
77.02 |
77.22 |
75.54 |
81.06 |
78.02 |
77.77 ± 1.83 |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
78.87 |
77.87 |
75.32 |
78.35 |
76.69 |
77.42 ± 1.27 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
65.82 |
64.64 |
78.35 |
72.93 |
71.91 |
70.73 ± 5.01 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
73.19 |
75.26 |
62.34 |
55.19 |
62.2 |
65.64 ± 7.5 |
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
57.61 |
55.51 |
52.95 |
57.72 |
53.53 |
55.46 ± 1.99 |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
55.95 |
54.99 |
46.34 |
53.25 |
56.73 |
53.45 ± 3.74 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
46.06 |
40.57 |
46.37 |
45.74 |
48.05 |
45.36 ± 2.52 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
44.68 |
42.19 |
42.15 |
41.84 |
44.21 |
43.01 ± 1.18 |
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
78.82 |
78.12 |
78.18 |
76.59 |
82.29 |
78.8 ± 1.89 |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
79.2 |
78.91 |
78.74 |
77.48 |
73.7 |
77.61 ± 2.04 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
74.46 |
64.86 |
71.25 |
63.91 |
65.23 |
67.94 ± 4.16 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
61.07 |
62.72 |
67.03 |
63.57 |
68.47 |
64.57 ± 2.75 |
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
80.59 |
80.29 |
78.83 |
80.15 |
79.85 |
79.94 ± 0.6 |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
77.31 |
76.76 |
79.56 |
79.85 |
79.42 |
78.58 ± 1.28 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
66.43 |
65.58 |
75.09 |
65.7 |
64.39 |
67.44 ± 3.88 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
65.57 |
73.51 |
65.59 |
64.29 |
66.67 |
67.13 ± 3.28 |
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
73.19 |
70 |
75.09 |
76.95 |
71.68 |
73.38 ± 2.45 |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
74.5 |
69.98 |
73.68 |
68.94 |
75.14 |
72.45 ± 2.5 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
61.96 |
61.34 |
60.54 |
59.71 |
56.58 |
60.03 ± 1.88 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
61.26 |
57.35 |
61.93 |
61.2 |
57.91 |
59.93 ± 1.9 |
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
85.01 |
87.32 |
86.41 |
86.71 |
86.91 |
86.47 ± 0.79 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
85.87 |
85.79 |
85.96 |
85.34 |
87.29 |
86.05 ± 0.66 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
85.41 |
85.91 |
87.2 |
85.14 |
86.46 |
86.02 ± 0.74 |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
86.88 |
86.94 |
85.77 |
85.48 |
84.73 |
85.96 ± 0.85 |
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
87.11 |
88.4 |
86.86 |
85.86 |
86.04 |
86.85 ± 0.91 |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
87.77 |
86.12 |
86.76 |
85.58 |
87.03 |
86.65 ± 0.75 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
86.57 |
85.16 |
86.84 |
86.98 |
86.05 |
86.32 ± 0.66 |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
85.99 |
86.67 |
86.12 |
84.72 |
87.97 |
86.29 ± 1.05 |
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
84.59 |
84.89 |
85.58 |
84.7 |
84.25 |
84.8 ± 0.44 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
84.41 |
84.63 |
83.83 |
84.89 |
84.75 |
84.5 ± 0.37 |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
84.14 |
82.28 |
85.45 |
85.87 |
84.2 |
84.39 ± 1.25 |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
83.77 |
82.37 |
84.31 |
85.34 |
83.18 |
83.79 ± 1.01 |
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
77.6 |
77.77 |
76.48 |
77.95 |
77.34 |
77.43 ± 0.51 |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
75.74 |
77.68 |
77.7 |
78.59 |
76.81 |
77.3 ± 0.96 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
77.55 |
77.32 |
76.77 |
76.67 |
76.68 |
77.0 ± 0.37 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
77.29 |
76.78 |
76.85 |
77.71 |
76.27 |
76.98 ± 0.49 |
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
77.47 |
78.04 |
77.02 |
78.02 |
76.68 |
77.45 ± 0.54 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
77.31 |
77.03 |
76.26 |
77.69 |
77.19 |
77.1 ± 0.47 |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
76.87 |
76.87 |
77.18 |
77.9 |
76.6 |
77.08 ± 0.45 |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
76.67 |
76.64 |
77.15 |
77.37 |
76.97 |
76.96 ± 0.28 |
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
75.34 |
76.69 |
75.84 |
75.44 |
76.52 |
75.97 ± 0.55 |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
76.38 |
75.37 |
74.72 |
75.54 |
75.15 |
75.43 ± 0.55 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
74.35 |
76.02 |
75.92 |
75.33 |
74.85 |
75.29 ± 0.63 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
74.82 |
74.93 |
74.89 |
75.82 |
74.02 |
74.9 ± 0.57 |
The previously pretrained hmByT5 models "accidentally" use a mean noise span length of 3, because this value is the
default one for T5. But the ByT5 paper mentions, that using a length of 3 would make pretraining tasks too easy, and
recommend a value of 20. We pretrained an English model with mean_noise_span_length=20
and fine-tuned it on English
AjMC dataset:
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
85.48 |
84.6 |
85.65 |
86.83 |
86.53 |
85.82 ± 0.79 |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
85.35 |
84.5 |
86.05 |
85.1 |
85.18 |
85.24 ± 0.5 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
84.14 |
83.45 |
84.4 |
84.9 |
85.82 |
84.54 ± 0.79 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
85.27 |
85.3 |
83.33 |
85.25 |
81.7 |
84.17 ± 1.45 |
For comparison the model using a length of 3 achieved 85.65 ± 1.21. So we can also see performance improvements when
using mean_noise_span_length=20
. The model can be found under
hmbyt5/byt5-small-historic-english-span20
.
We also pretrained a monolingual model for Dutch on the Delpher corpus with both mean_noise_span_length=3
and
mean_noise_span_length=20
.
Results with mean_noise_span_length=3
on ICDAR Europeana:
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
88.02 |
88.71 |
87.17 |
87 |
88.62 |
87.9 ± 0.71 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
87.1 |
86.72 |
87.15 |
88.29 |
87.35 |
87.32 ± 0.53 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
87.23 |
87.19 |
87.11 |
87.62 |
87.11 |
87.25 ± 0.19 |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
85.98 |
87.5 |
84.22 |
87.08 |
86.48 |
86.25 ± 1.14 |
Results with mean_noise_span_length=20
on ICDAR Europeana:
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
86.61 |
85.88 |
87.65 |
87.93 |
88.01 |
87.22 ± 0.83 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
87.88 |
87.56 |
85.62 |
86.52 |
87.03 |
86.92 ± 0.8 |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
86.17 |
85.87 |
87.77 |
86.58 |
87.96 |
86.87 ± 0.85 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
87.67 |
86.02 |
85.66 |
87 |
85.99 |
86.47 ± 0.75 |
For the Dutch model on ICDAR Europeana we do not see performance improvements when using mean_noise_span_length=20
.
The Dutch monolingual models can be found under:
In previous experiments, we only train model with ByT5 Small as initial checkpoint. For English and Dutch we trained
models with Base Architecture and compare results.
On English AjMC the following results can be achieved when using mean_noise_span_length=3
:
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
86.78 |
87.46 |
85.75 |
88.41 |
86.6 |
87.0 ± 0.89 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
86.79 |
86.29 |
86.67 |
87.14 |
85.82 |
86.54 ± 0.45 |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
87.04 |
87.34 |
86.63 |
84.09 |
87.04 |
86.43 ± 1.19 |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
86.87 |
86.43 |
86.88 |
85.15 |
85.25 |
86.12 ± 0.77 |
With mean_noise_span_length=20
:
Configuration |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
wsFalse-bs8-e10-lr0.00015-poolingfirst |
86.51 |
87.2 |
86.22 |
85.78 |
86.46 |
86.43 ± 0.46 |
wsFalse-bs4-e10-lr0.00016-poolingfirst |
86.12 |
87.04 |
87.01 |
85.25 |
86.74 |
86.43 ± 0.68 |
wsFalse-bs8-e10-lr0.00016-poolingfirst |
86.49 |
85.27 |
86.12 |
86.29 |
85.61 |
85.96 ± 0.45 |
wsFalse-bs4-e10-lr0.00015-poolingfirst |
86.33 |
86.05 |
84.48 |
85.68 |
86.16 |
85.74 ± 0.67 |
The English monolingual models can be found under: