Skip to content

Commit

Permalink
Release new GliNER models (#42)
Browse files Browse the repository at this point in the history
Reference: #40
  • Loading branch information
ljvmiranda921 authored Aug 9, 2024
1 parent 6d9924b commit 7c31f98
Show file tree
Hide file tree
Showing 9 changed files with 554 additions and 0 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ reproduction of results, and guides on usage.
> a citrus fruit native to the Philippines and used in traditional Filipino cuisine.
## 📰 News
- [2024-08-01] Released new NER-only models based on [GLiNER](https://github.com/urchade/GLiNER)! You can find the models in [this HuggingFace collection](https://huggingface.co/collections/ljvmiranda921/calamancy-models-for-tagalog-nlp-65629cc46ef2a1d0f9605c87). Span-Marker and calamanCy models are still superior, but GLiNER offers a lot of extensibility on unseen entity labels. You can find the training pipeline [here](https://github.com/ljvmiranda921/calamanCy/tree/master/models/v0.1.0-gliner).
- [2023-12-05] We released the paper [**calamanCy: A Tagalog Natural Language Processing Toolkit**](https://aclanthology.org/2023.nlposs-1.1/) and will be presented in the NLP-OSS workshop at EMNLP 2023! Feel free to check out the [Tagalog NLP collection in HuggingFace](https://huggingface.co/collections/ljvmiranda921/calamancy-models-for-tagalog-nlp-65629cc46ef2a1d0f9605c87).
- [2023-11-01] The named entity recognition (NER) dataset used to train the NER component of calamanCy has now a corresponding paper: [**Developing a Named Entity Recognition Dataset for Tagalog**](https://aclanthology.org/2023.nlposs-1.1/)! It will be presented in the SEALP workshop at IJCNLP-AACL 2023! The dataset is also available [in HuggingFace](https://huggingface.co/datasets/ljvmiranda921/tlunified-ner).

Expand Down
1 change: 1 addition & 0 deletions experiments/refresh_evals_0924/project.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
title: "Benchmarking new models on TLUnfied-NER data"
3 changes: 3 additions & 0 deletions experiments/refresh_evals_0924/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
spacy
spacy-llm==0.7.2
datasets
1 change: 1 addition & 0 deletions models/v0.1.0-gliner/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
metrics
120 changes: 120 additions & 0 deletions models/v0.1.0-gliner/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
<!-- WEASEL: AUTO-GENERATED DOCS START (do not remove) -->

# 🪐 Weasel Project: Release v0.1.0-gliner

This is a spaCy project that trains and evaluates new v0.1.0-gliner models.
[GliNER](https://github.com/urchade/GLiNER) (Generalist and Lightweight Model for Named Entity Recognition) is a powerful model capable of identifying any entity type using a BERT-like encoder.
In this project, we finetune the GliNER model using the TLUnified-NER dataset.

To replicate training, first you need to install the required dependencies:

```sh
pip install -r requirements.txt
```

## Training

To train a GliNER model, run the `finetune-gliner` workflow while passing the size:

```sh
# Available options: 'small', 'medium', 'large'
python -m spacy project run finetune-gliner . --vars.size small
```

The models are currently based on the [v2.5 version of GliNER](https://huggingface.co/collections/urchade/gliner-v25-66743e64ab975c859119d1eb).

## Evaluation

To perform evals, run the `eval-gliner` workflow while passing the size:

```sh
# Available options: 'small', 'medium', 'large'
python -m spacy project run eval-gliner . --vars.size small
```

This will evaluate on TLUnified-NER's test set ([Miranda, 2023](https://aclanthology.org/2023.sealp-1.2.pdf)) and the Tagalog subsets of
Universal NER ([Mayhew et al., 2024](https://aclanthology.org/2024.naacl-long.243/)).

The evaluation results for TLUnified-NER are shown in the table below (reported numbers are F1-scores):

| | PER | ORG | LOC | Overall |
|------------------|-------|-------|-------|---------|
| [tl_gliner_small](https://huggingface.co/ljvmiranda921/tl_gliner_small) | 86.76 | 78.72 | 86.78 | 84.83 |
| [tl_gliner_medium](https://huggingface.co/ljvmiranda921/tl_gliner_medium) | 87.46 | 79.71 | 86.75 | 85.40 |
| [tl_gliner_large](https://huggingface.co/ljvmiranda921/tl_gliner_large) | 86.75 | 80.20 | 86.76 | 85.72 |
| [tl_calamancy_trf](https://huggingface.co/ljvmiranda921/tl_calamancy_trf) | 91.95 | **84.84** | 88.92 | 88.03 |
| [span-marker](https://huggingface.co/tomaarsen/span-marker-roberta-tagalog-base-tlunified) | **92.57** | 82.04 | **90.56** | **89.62** |

In general, GliNER gets decent scores, but nothing beats regular finetuning on BERT-based models as seen in [tl_calamancy_trf](https://huggingface.co/ljvmiranda921/tl_calamancy_trf) and [span_marker](https://huggingface.co/tomaarsen/span-marker-roberta-tagalog-base-tlunified).
The performance on Universal NER is generally worse (the highest is around ~50%), compared to the reported results in the Universal NER paper (we finetuned on RoBERTa as well).
One possible reason is that the annotation guidelines for TULunified-NER are more loose, because we consider some entities that Universal NER ignores.
At the same time, the text distribution of the two datasets are widely different.

Nevertheless, I'm still releasing these GliNER models as they are very extensible to other entity types (and it's also nice to have a finetuned version of GliNER for Tagalog!).
I haven't done any extensive hyperparameter tuning here so it might be nice if someone can contribute better config parameters to bump up these scores.

## Citation

Please cite the following papers when using these models:

```
@misc{zaratiana2023gliner,
title={GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer},
author={Urchade Zaratiana and Nadi Tomeh and Pierre Holat and Thierry Charnois},
year={2023},
eprint={2311.08526},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```

```
@inproceedings{miranda-2023-calamancy,
title = "calaman{C}y: A {T}agalog Natural Language Processing Toolkit",
author = "Miranda, Lester James",
booktitle = "Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)",
month = dec,
year = "2023",
address = "Singapore, Singapore",
publisher = "Empirical Methods in Natural Language Processing",
url = "https://aclanthology.org/2023.nlposs-1.1",
pages = "1--7",
}
```

If you're using the NER dataset:

```
@inproceedings{miranda-2023-developing,
title = "Developing a Named Entity Recognition Dataset for {T}agalog",
author = "Miranda, Lester James",
booktitle = "Proceedings of the First Workshop in South East Asian Language Processing",
month = nov,
year = "2023",
address = "Nusa Dua, Bali, Indonesia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.sealp-1.2",
doi = "10.18653/v1/2023.sealp-1.2",
pages = "13--20",
}
```


## 📋 project.yml

The [`project.yml`](project.yml) defines the data assets required by the
project, as well as the available commands and workflows. For details, see the
[Weasel documentation](https://github.com/explosion/weasel).

### ⏯ Commands

The following commands are defined by the project. They
can be executed using [`weasel run [name]`](https://github.com/explosion/weasel/tree/main/docs/cli.md#rocket-run).
Commands are only re-run if their inputs have changed.

| Command | Description |
| --- | --- |
| `finetune-gliner` | Finetune the GliNER model using TLUnified-NER |
| `eval-gliner` | Evaluate trained GliNER models on the TLUnified-NER and Universal NER test sets |

<!-- WEASEL: AUTO-GENERATED DOCS END (do not remove) -->
127 changes: 127 additions & 0 deletions models/v0.1.0-gliner/evaluate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
from pathlib import Path
from typing import Dict, Iterable, Optional
from copy import deepcopy

import spacy
import torch
import typer
import srsly
from datasets import Dataset, load_dataset
from spacy.scorer import Scorer
from spacy.tokens import Doc, Span
from spacy.training import Example
from wasabi import msg


def main(
# fmt: off
output_path: Path = typer.Argument(..., help="Path to store the metrics in JSON format."),
model_name: str = typer.Option("ljvmiranda921/tl_gliner_small", show_default=True, help="GliNER model to use for evaluation."),
dataset: str = typer.Option("ljvmiranda921/tlunified-ner", help="Dataset to evaluate upon."),
threshold: float = typer.Option(0.5, help="The threshold of the GliNER model (controls the degree to which a hit is considered an entity)."),
dataset_config: Optional[str] = typer.Option(None, help="Configuration for loading the dataset."),
chunk_size: int = typer.Option(250, help="Size of the text chunk to be processed at once."),
label_map: str = typer.Option("person::PER,organization::ORG,location::LOC", help="Mapping between GliNER labels and the dataset's actual labels (separated by a double-colon '::')."),
# fmt: on
):
label_map: Dict[str, str] = process_labels(label_map)
msg.text(f"Using label map: {label_map}")

msg.info("Processing test dataset")
ds = load_dataset(dataset, dataset_config, split="test", trust_remote_code=True)
ref_docs = convert_hf_to_spacy_docs(ds)

msg.info("Loading GliNER model")
nlp = spacy.blank("tl")
nlp.add_pipe(
"gliner_spacy",
config={
"gliner_model": model_name,
"chunk_size": chunk_size,
"labels": list(label_map.keys()),
"threshold": threshold,
"style": "ent",
"map_location": "cuda" if torch.cuda.is_available() else "cpu",
},
)
msg.text("Getting predictions")
docs = deepcopy(ref_docs)
pred_docs = list(nlp.pipe(docs))
pred_docs = [update_entity_labels(doc, label_map) for doc in pred_docs]

# Get the scores
examples = [
Example(reference=ref, predicted=pred) for ref, pred in zip(ref_docs, pred_docs)
]
scores = Scorer.score_spans(examples, "ents")

msg.info(f"Results for {dataset} ({model_name})")
msg.text(scores)
srsly.write_json(output_path, data=scores, indent=2)
msg.good(f"Saving outputs to {output_path}")


def process_labels(label_map: str) -> Dict[str, str]:
return {m.split("::")[0]: m.split("::")[1] for m in label_map.split(",")}


def convert_hf_to_spacy_docs(dataset: "Dataset") -> Iterable[Doc]:
nlp = spacy.blank("tl")
examples = dataset.to_list()
entity_types = {
idx: feature.split("-")[1]
for idx, feature in enumerate(dataset.features["ner_tags"].feature.names)
if feature != "O" # don't include empty
}
msg.text(f"Using entity types: {entity_types}")

docs = []
for example in examples:
tokens = example["tokens"]
ner_tags = example["ner_tags"]
doc = Doc(nlp.vocab, words=tokens)

entities = []
start_idx = None
entity_type = None

for idx, tag in enumerate(ner_tags):
if tag in entity_types:
if start_idx is None:
start_idx = idx
entity_type = entity_types[tag]
elif entity_type != entity_types.get(tag, None):
entities.append(Span(doc, start_idx, idx, label=entity_type))
start_idx = idx
entity_type = entity_types[tag]
else:
if start_idx is not None:
entities.append(Span(doc, start_idx, idx, label=entity_type))
start_idx = None

if start_idx is not None:
entities.append(Span(doc, start_idx, len(tokens), label=entity_type))
doc.ents = entities
docs.append(doc)

return docs


def update_entity_labels(doc: Doc, label_mapping: Dict[str, str]) -> Doc:
updated_ents = []
for ent in doc.ents:
new_label = label_mapping.get(ent.label_.lower(), ent.label_)
updated_span = Span(doc, ent.start, ent.end, label=new_label)
updated_ents.append(updated_span)

new_doc = Doc(
doc.vocab,
words=[token.text for token in doc],
spaces=[token.whitespace_ for token in doc],
)
new_doc.ents = updated_ents
return new_doc


if __name__ == "__main__":
typer.run(main)
Loading

0 comments on commit 7c31f98

Please sign in to comment.