Release new GliNER models (#42)

Reference: #40
ljvmiranda921 · Aug 9, 2024 · 7c31f98 · 7c31f98
1 parent 6d9924b
commit 7c31f98
Show file tree

Hide file tree

Showing 9 changed files with 554 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -18,6 +18,7 @@ reproduction of results, and guides on usage.
 > a citrus fruit native to the Philippines and used in traditional Filipino cuisine.
 
 ## 📰 News
+- [2024-08-01] Released new NER-only models based on [GLiNER](https://github.com/urchade/GLiNER)! You can find the models in [this HuggingFace collection](https://huggingface.co/collections/ljvmiranda921/calamancy-models-for-tagalog-nlp-65629cc46ef2a1d0f9605c87). Span-Marker and calamanCy models are still superior, but GLiNER offers a lot of extensibility on unseen entity labels. You can find the training pipeline [here](https://github.com/ljvmiranda921/calamanCy/tree/master/models/v0.1.0-gliner).  
 - [2023-12-05] We released the paper [**calamanCy: A Tagalog Natural Language Processing Toolkit**](https://aclanthology.org/2023.nlposs-1.1/) and will be presented in the NLP-OSS workshop at EMNLP 2023! Feel free to check out the [Tagalog NLP collection in HuggingFace](https://huggingface.co/collections/ljvmiranda921/calamancy-models-for-tagalog-nlp-65629cc46ef2a1d0f9605c87).
 - [2023-11-01] The named entity recognition (NER) dataset used to train the NER component of calamanCy has now a corresponding paper: [**Developing a Named Entity Recognition Dataset for Tagalog**](https://aclanthology.org/2023.nlposs-1.1/)! It will be presented in the SEALP workshop at IJCNLP-AACL 2023! The dataset is also available [in HuggingFace](https://huggingface.co/datasets/ljvmiranda921/tlunified-ner).
 

diff --git a/experiments/refresh_evals_0924/project.yml b/experiments/refresh_evals_0924/project.yml
@@ -0,0 +1 @@
+title: "Benchmarking new models on TLUnfied-NER data"
diff --git a/experiments/refresh_evals_0924/requirements.txt b/experiments/refresh_evals_0924/requirements.txt
@@ -0,0 +1,3 @@
+spacy
+spacy-llm==0.7.2
+datasets
diff --git a/models/v0.1.0-gliner/.gitignore b/models/v0.1.0-gliner/.gitignore
@@ -0,0 +1 @@
+metrics
diff --git a/models/v0.1.0-gliner/README.md b/models/v0.1.0-gliner/README.md
@@ -0,0 +1,120 @@
+<!-- WEASEL: AUTO-GENERATED DOCS START (do not remove) -->
+
+# 🪐 Weasel Project: Release v0.1.0-gliner
+
+This is a spaCy project that trains and evaluates new v0.1.0-gliner models.
+[GliNER](https://github.com/urchade/GLiNER) (Generalist and Lightweight Model for Named Entity Recognition) is a powerful model capable of identifying any entity type using a BERT-like encoder.
+In this project, we finetune the GliNER model using the TLUnified-NER dataset.
+
+To replicate training, first you need to install the required dependencies:
+
+```sh
+pip install -r requirements.txt
+```
+
+## Training
+
+To train a GliNER model, run the `finetune-gliner` workflow while passing the size:
+
+```sh
+# Available options: 'small', 'medium', 'large'
+python -m spacy project run finetune-gliner . --vars.size small
+```
+
+The models are currently based on the [v2.5 version of GliNER](https://huggingface.co/collections/urchade/gliner-v25-66743e64ab975c859119d1eb).
+
+## Evaluation
+
+To perform evals, run the `eval-gliner` workflow while passing the size:
+
+```sh
+# Available options: 'small', 'medium', 'large'
+python -m spacy project run eval-gliner . --vars.size small
+```
+
+This will evaluate on TLUnified-NER's test set ([Miranda, 2023](https://aclanthology.org/2023.sealp-1.2.pdf)) and the Tagalog subsets of
+Universal NER ([Mayhew et al., 2024](https://aclanthology.org/2024.naacl-long.243/)).
+
+The evaluation results for TLUnified-NER are shown in the table below (reported numbers are F1-scores):
+
+|                  | PER   | ORG   | LOC   | Overall |
+|------------------|-------|-------|-------|---------|
+| [tl_gliner_small](https://huggingface.co/ljvmiranda921/tl_gliner_small)  | 86.76 | 78.72 | 86.78 | 84.83   |
+| [tl_gliner_medium](https://huggingface.co/ljvmiranda921/tl_gliner_medium) | 87.46 | 79.71 | 86.75 | 85.40   |
+| [tl_gliner_large](https://huggingface.co/ljvmiranda921/tl_gliner_large)  | 86.75 | 80.20 | 86.76 | 85.72   |
+| [tl_calamancy_trf](https://huggingface.co/ljvmiranda921/tl_calamancy_trf) | 91.95 | **84.84** | 88.92 | 88.03   |
+| [span-marker](https://huggingface.co/tomaarsen/span-marker-roberta-tagalog-base-tlunified)      | **92.57** | 82.04 | **90.56** | **89.62**   |
+
+In general, GliNER gets decent scores, but nothing beats regular finetuning on BERT-based models as seen in [tl_calamancy_trf](https://huggingface.co/ljvmiranda921/tl_calamancy_trf) and [span_marker](https://huggingface.co/tomaarsen/span-marker-roberta-tagalog-base-tlunified).
+The performance on Universal NER is generally worse (the highest is around ~50%), compared to the reported results in the Universal NER paper (we finetuned on RoBERTa as well).
+One possible reason is that the annotation guidelines for TULunified-NER are more loose, because we consider some entities that Universal NER ignores.
+At the same time, the text distribution of the two datasets are widely different.
+
+Nevertheless, I'm still releasing these GliNER models as they are very extensible to other entity types (and it's also nice to have a finetuned version of GliNER for Tagalog!).
+I haven't done any extensive hyperparameter tuning here so it might be nice if someone can contribute better config parameters to bump up these scores.
+
+## Citation
+
+Please cite the following papers when using these models:
+
+```
+@misc{zaratiana2023gliner,
+    title={GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer}, 
+    author={Urchade Zaratiana and Nadi Tomeh and Pierre Holat and Thierry Charnois},
+    year={2023},
+    eprint={2311.08526},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+```
+
+```
+@inproceedings{miranda-2023-calamancy,
+  title = "calaman{C}y: A {T}agalog Natural Language Processing Toolkit",
+  author = "Miranda, Lester James",
+  booktitle = "Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)",
+  month = dec,
+  year = "2023",
+  address = "Singapore, Singapore",
+  publisher = "Empirical Methods in Natural Language Processing",
+  url = "https://aclanthology.org/2023.nlposs-1.1",
+  pages = "1--7",
+} 
+```
+
+If you're using the NER dataset:
+
+```
+@inproceedings{miranda-2023-developing,
+  title = "Developing a Named Entity Recognition Dataset for {T}agalog",
+  author = "Miranda, Lester James",
+  booktitle = "Proceedings of the First Workshop in South East Asian Language Processing",
+  month = nov,
+  year = "2023",
+  address = "Nusa Dua, Bali, Indonesia",
+  publisher = "Association for Computational Linguistics",
+  url = "https://aclanthology.org/2023.sealp-1.2",
+  doi = "10.18653/v1/2023.sealp-1.2",
+  pages = "13--20",
+}
+```
+
+
+## 📋 project.yml
+
+The [`project.yml`](project.yml) defines the data assets required by the
+project, as well as the available commands and workflows. For details, see the
+[Weasel documentation](https://github.com/explosion/weasel).
+
+### ⏯ Commands
+
+The following commands are defined by the project. They
+can be executed using [`weasel run [name]`](https://github.com/explosion/weasel/tree/main/docs/cli.md#rocket-run).
+Commands are only re-run if their inputs have changed.
+
+| Command | Description |
+| --- | --- |
+| `finetune-gliner` | Finetune the GliNER model using TLUnified-NER |
+| `eval-gliner` | Evaluate trained GliNER models on the TLUnified-NER and Universal NER test sets |
+
+<!-- WEASEL: AUTO-GENERATED DOCS END (do not remove) -->
diff --git a/models/v0.1.0-gliner/evaluate.py b/models/v0.1.0-gliner/evaluate.py
@@ -0,0 +1,127 @@
+from pathlib import Path
+from typing import Dict, Iterable, Optional
+from copy import deepcopy
+
+import spacy
+import torch
+import typer
+import srsly
+from datasets import Dataset, load_dataset
+from spacy.scorer import Scorer
+from spacy.tokens import Doc, Span
+from spacy.training import Example
+from wasabi import msg
+
+
+def main(
+    # fmt: off
+    output_path: Path = typer.Argument(..., help="Path to store the metrics in JSON format."),
+    model_name: str = typer.Option("ljvmiranda921/tl_gliner_small", show_default=True, help="GliNER model to use for evaluation."),
+    dataset: str = typer.Option("ljvmiranda921/tlunified-ner", help="Dataset to evaluate upon."),
+    threshold: float = typer.Option(0.5, help="The threshold of the GliNER model (controls the degree to which a hit is considered an entity)."),
+    dataset_config: Optional[str] = typer.Option(None, help="Configuration for loading the dataset."),
+    chunk_size: int = typer.Option(250, help="Size of the text chunk to be processed at once."),
+    label_map: str = typer.Option("person::PER,organization::ORG,location::LOC", help="Mapping between GliNER labels and the dataset's actual labels (separated by a double-colon '::')."),
+    # fmt: on
+):
+    label_map: Dict[str, str] = process_labels(label_map)
+    msg.text(f"Using label map: {label_map}")
+
+    msg.info("Processing test dataset")
+    ds = load_dataset(dataset, dataset_config, split="test", trust_remote_code=True)
+    ref_docs = convert_hf_to_spacy_docs(ds)
+
+    msg.info("Loading GliNER model")
+    nlp = spacy.blank("tl")
+    nlp.add_pipe(
+        "gliner_spacy",
+        config={
+            "gliner_model": model_name,
+            "chunk_size": chunk_size,
+            "labels": list(label_map.keys()),
+            "threshold": threshold,
+            "style": "ent",
+            "map_location": "cuda" if torch.cuda.is_available() else "cpu",
+        },
+    )
+    msg.text("Getting predictions")
+    docs = deepcopy(ref_docs)
+    pred_docs = list(nlp.pipe(docs))
+    pred_docs = [update_entity_labels(doc, label_map) for doc in pred_docs]
+
+    # Get the scores
+    examples = [
+        Example(reference=ref, predicted=pred) for ref, pred in zip(ref_docs, pred_docs)
+    ]
+    scores = Scorer.score_spans(examples, "ents")
+
+    msg.info(f"Results for {dataset} ({model_name})")
+    msg.text(scores)
+    srsly.write_json(output_path, data=scores, indent=2)
+    msg.good(f"Saving outputs to {output_path}")
+
+
+def process_labels(label_map: str) -> Dict[str, str]:
+    return {m.split("::")[0]: m.split("::")[1] for m in label_map.split(",")}
+
+
+def convert_hf_to_spacy_docs(dataset: "Dataset") -> Iterable[Doc]:
+    nlp = spacy.blank("tl")
+    examples = dataset.to_list()
+    entity_types = {
+        idx: feature.split("-")[1]
+        for idx, feature in enumerate(dataset.features["ner_tags"].feature.names)
+        if feature != "O"  # don't include empty
+    }
+    msg.text(f"Using entity types: {entity_types}")
+
+    docs = []
+    for example in examples:
+        tokens = example["tokens"]
+        ner_tags = example["ner_tags"]
+        doc = Doc(nlp.vocab, words=tokens)
+
+        entities = []
+        start_idx = None
+        entity_type = None
+
+        for idx, tag in enumerate(ner_tags):
+            if tag in entity_types:
+                if start_idx is None:
+                    start_idx = idx
+                    entity_type = entity_types[tag]
+                elif entity_type != entity_types.get(tag, None):
+                    entities.append(Span(doc, start_idx, idx, label=entity_type))
+                    start_idx = idx
+                    entity_type = entity_types[tag]
+            else:
+                if start_idx is not None:
+                    entities.append(Span(doc, start_idx, idx, label=entity_type))
+                    start_idx = None
+
+        if start_idx is not None:
+            entities.append(Span(doc, start_idx, len(tokens), label=entity_type))
+        doc.ents = entities
+        docs.append(doc)
+
+    return docs
+
+
+def update_entity_labels(doc: Doc, label_mapping: Dict[str, str]) -> Doc:
+    updated_ents = []
+    for ent in doc.ents:
+        new_label = label_mapping.get(ent.label_.lower(), ent.label_)
+        updated_span = Span(doc, ent.start, ent.end, label=new_label)
+        updated_ents.append(updated_span)
+
+    new_doc = Doc(
+        doc.vocab,
+        words=[token.text for token in doc],
+        spaces=[token.whitespace_ for token in doc],
+    )
+    new_doc.ents = updated_ents
+    return new_doc
+
+
+if __name__ == "__main__":
+    typer.run(main)
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		title: "Benchmarking new models on TLUnfied-NER data"