-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Showing
9 changed files
with
554 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
title: "Benchmarking new models on TLUnfied-NER data" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
spacy | ||
spacy-llm==0.7.2 | ||
datasets |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
metrics |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,120 @@ | ||
<!-- WEASEL: AUTO-GENERATED DOCS START (do not remove) --> | ||
|
||
# 🪐 Weasel Project: Release v0.1.0-gliner | ||
|
||
This is a spaCy project that trains and evaluates new v0.1.0-gliner models. | ||
[GliNER](https://github.com/urchade/GLiNER) (Generalist and Lightweight Model for Named Entity Recognition) is a powerful model capable of identifying any entity type using a BERT-like encoder. | ||
In this project, we finetune the GliNER model using the TLUnified-NER dataset. | ||
|
||
To replicate training, first you need to install the required dependencies: | ||
|
||
```sh | ||
pip install -r requirements.txt | ||
``` | ||
|
||
## Training | ||
|
||
To train a GliNER model, run the `finetune-gliner` workflow while passing the size: | ||
|
||
```sh | ||
# Available options: 'small', 'medium', 'large' | ||
python -m spacy project run finetune-gliner . --vars.size small | ||
``` | ||
|
||
The models are currently based on the [v2.5 version of GliNER](https://huggingface.co/collections/urchade/gliner-v25-66743e64ab975c859119d1eb). | ||
|
||
## Evaluation | ||
|
||
To perform evals, run the `eval-gliner` workflow while passing the size: | ||
|
||
```sh | ||
# Available options: 'small', 'medium', 'large' | ||
python -m spacy project run eval-gliner . --vars.size small | ||
``` | ||
|
||
This will evaluate on TLUnified-NER's test set ([Miranda, 2023](https://aclanthology.org/2023.sealp-1.2.pdf)) and the Tagalog subsets of | ||
Universal NER ([Mayhew et al., 2024](https://aclanthology.org/2024.naacl-long.243/)). | ||
|
||
The evaluation results for TLUnified-NER are shown in the table below (reported numbers are F1-scores): | ||
|
||
| | PER | ORG | LOC | Overall | | ||
|------------------|-------|-------|-------|---------| | ||
| [tl_gliner_small](https://huggingface.co/ljvmiranda921/tl_gliner_small) | 86.76 | 78.72 | 86.78 | 84.83 | | ||
| [tl_gliner_medium](https://huggingface.co/ljvmiranda921/tl_gliner_medium) | 87.46 | 79.71 | 86.75 | 85.40 | | ||
| [tl_gliner_large](https://huggingface.co/ljvmiranda921/tl_gliner_large) | 86.75 | 80.20 | 86.76 | 85.72 | | ||
| [tl_calamancy_trf](https://huggingface.co/ljvmiranda921/tl_calamancy_trf) | 91.95 | **84.84** | 88.92 | 88.03 | | ||
| [span-marker](https://huggingface.co/tomaarsen/span-marker-roberta-tagalog-base-tlunified) | **92.57** | 82.04 | **90.56** | **89.62** | | ||
|
||
In general, GliNER gets decent scores, but nothing beats regular finetuning on BERT-based models as seen in [tl_calamancy_trf](https://huggingface.co/ljvmiranda921/tl_calamancy_trf) and [span_marker](https://huggingface.co/tomaarsen/span-marker-roberta-tagalog-base-tlunified). | ||
The performance on Universal NER is generally worse (the highest is around ~50%), compared to the reported results in the Universal NER paper (we finetuned on RoBERTa as well). | ||
One possible reason is that the annotation guidelines for TULunified-NER are more loose, because we consider some entities that Universal NER ignores. | ||
At the same time, the text distribution of the two datasets are widely different. | ||
|
||
Nevertheless, I'm still releasing these GliNER models as they are very extensible to other entity types (and it's also nice to have a finetuned version of GliNER for Tagalog!). | ||
I haven't done any extensive hyperparameter tuning here so it might be nice if someone can contribute better config parameters to bump up these scores. | ||
|
||
## Citation | ||
|
||
Please cite the following papers when using these models: | ||
|
||
``` | ||
@misc{zaratiana2023gliner, | ||
title={GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer}, | ||
author={Urchade Zaratiana and Nadi Tomeh and Pierre Holat and Thierry Charnois}, | ||
year={2023}, | ||
eprint={2311.08526}, | ||
archivePrefix={arXiv}, | ||
primaryClass={cs.CL} | ||
} | ||
``` | ||
|
||
``` | ||
@inproceedings{miranda-2023-calamancy, | ||
title = "calaman{C}y: A {T}agalog Natural Language Processing Toolkit", | ||
author = "Miranda, Lester James", | ||
booktitle = "Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)", | ||
month = dec, | ||
year = "2023", | ||
address = "Singapore, Singapore", | ||
publisher = "Empirical Methods in Natural Language Processing", | ||
url = "https://aclanthology.org/2023.nlposs-1.1", | ||
pages = "1--7", | ||
} | ||
``` | ||
|
||
If you're using the NER dataset: | ||
|
||
``` | ||
@inproceedings{miranda-2023-developing, | ||
title = "Developing a Named Entity Recognition Dataset for {T}agalog", | ||
author = "Miranda, Lester James", | ||
booktitle = "Proceedings of the First Workshop in South East Asian Language Processing", | ||
month = nov, | ||
year = "2023", | ||
address = "Nusa Dua, Bali, Indonesia", | ||
publisher = "Association for Computational Linguistics", | ||
url = "https://aclanthology.org/2023.sealp-1.2", | ||
doi = "10.18653/v1/2023.sealp-1.2", | ||
pages = "13--20", | ||
} | ||
``` | ||
|
||
|
||
## 📋 project.yml | ||
|
||
The [`project.yml`](project.yml) defines the data assets required by the | ||
project, as well as the available commands and workflows. For details, see the | ||
[Weasel documentation](https://github.com/explosion/weasel). | ||
|
||
### ⏯ Commands | ||
|
||
The following commands are defined by the project. They | ||
can be executed using [`weasel run [name]`](https://github.com/explosion/weasel/tree/main/docs/cli.md#rocket-run). | ||
Commands are only re-run if their inputs have changed. | ||
|
||
| Command | Description | | ||
| --- | --- | | ||
| `finetune-gliner` | Finetune the GliNER model using TLUnified-NER | | ||
| `eval-gliner` | Evaluate trained GliNER models on the TLUnified-NER and Universal NER test sets | | ||
|
||
<!-- WEASEL: AUTO-GENERATED DOCS END (do not remove) --> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,127 @@ | ||
from pathlib import Path | ||
from typing import Dict, Iterable, Optional | ||
from copy import deepcopy | ||
|
||
import spacy | ||
import torch | ||
import typer | ||
import srsly | ||
from datasets import Dataset, load_dataset | ||
from spacy.scorer import Scorer | ||
from spacy.tokens import Doc, Span | ||
from spacy.training import Example | ||
from wasabi import msg | ||
|
||
|
||
def main( | ||
# fmt: off | ||
output_path: Path = typer.Argument(..., help="Path to store the metrics in JSON format."), | ||
model_name: str = typer.Option("ljvmiranda921/tl_gliner_small", show_default=True, help="GliNER model to use for evaluation."), | ||
dataset: str = typer.Option("ljvmiranda921/tlunified-ner", help="Dataset to evaluate upon."), | ||
threshold: float = typer.Option(0.5, help="The threshold of the GliNER model (controls the degree to which a hit is considered an entity)."), | ||
dataset_config: Optional[str] = typer.Option(None, help="Configuration for loading the dataset."), | ||
chunk_size: int = typer.Option(250, help="Size of the text chunk to be processed at once."), | ||
label_map: str = typer.Option("person::PER,organization::ORG,location::LOC", help="Mapping between GliNER labels and the dataset's actual labels (separated by a double-colon '::')."), | ||
# fmt: on | ||
): | ||
label_map: Dict[str, str] = process_labels(label_map) | ||
msg.text(f"Using label map: {label_map}") | ||
|
||
msg.info("Processing test dataset") | ||
ds = load_dataset(dataset, dataset_config, split="test", trust_remote_code=True) | ||
ref_docs = convert_hf_to_spacy_docs(ds) | ||
|
||
msg.info("Loading GliNER model") | ||
nlp = spacy.blank("tl") | ||
nlp.add_pipe( | ||
"gliner_spacy", | ||
config={ | ||
"gliner_model": model_name, | ||
"chunk_size": chunk_size, | ||
"labels": list(label_map.keys()), | ||
"threshold": threshold, | ||
"style": "ent", | ||
"map_location": "cuda" if torch.cuda.is_available() else "cpu", | ||
}, | ||
) | ||
msg.text("Getting predictions") | ||
docs = deepcopy(ref_docs) | ||
pred_docs = list(nlp.pipe(docs)) | ||
pred_docs = [update_entity_labels(doc, label_map) for doc in pred_docs] | ||
|
||
# Get the scores | ||
examples = [ | ||
Example(reference=ref, predicted=pred) for ref, pred in zip(ref_docs, pred_docs) | ||
] | ||
scores = Scorer.score_spans(examples, "ents") | ||
|
||
msg.info(f"Results for {dataset} ({model_name})") | ||
msg.text(scores) | ||
srsly.write_json(output_path, data=scores, indent=2) | ||
msg.good(f"Saving outputs to {output_path}") | ||
|
||
|
||
def process_labels(label_map: str) -> Dict[str, str]: | ||
return {m.split("::")[0]: m.split("::")[1] for m in label_map.split(",")} | ||
|
||
|
||
def convert_hf_to_spacy_docs(dataset: "Dataset") -> Iterable[Doc]: | ||
nlp = spacy.blank("tl") | ||
examples = dataset.to_list() | ||
entity_types = { | ||
idx: feature.split("-")[1] | ||
for idx, feature in enumerate(dataset.features["ner_tags"].feature.names) | ||
if feature != "O" # don't include empty | ||
} | ||
msg.text(f"Using entity types: {entity_types}") | ||
|
||
docs = [] | ||
for example in examples: | ||
tokens = example["tokens"] | ||
ner_tags = example["ner_tags"] | ||
doc = Doc(nlp.vocab, words=tokens) | ||
|
||
entities = [] | ||
start_idx = None | ||
entity_type = None | ||
|
||
for idx, tag in enumerate(ner_tags): | ||
if tag in entity_types: | ||
if start_idx is None: | ||
start_idx = idx | ||
entity_type = entity_types[tag] | ||
elif entity_type != entity_types.get(tag, None): | ||
entities.append(Span(doc, start_idx, idx, label=entity_type)) | ||
start_idx = idx | ||
entity_type = entity_types[tag] | ||
else: | ||
if start_idx is not None: | ||
entities.append(Span(doc, start_idx, idx, label=entity_type)) | ||
start_idx = None | ||
|
||
if start_idx is not None: | ||
entities.append(Span(doc, start_idx, len(tokens), label=entity_type)) | ||
doc.ents = entities | ||
docs.append(doc) | ||
|
||
return docs | ||
|
||
|
||
def update_entity_labels(doc: Doc, label_mapping: Dict[str, str]) -> Doc: | ||
updated_ents = [] | ||
for ent in doc.ents: | ||
new_label = label_mapping.get(ent.label_.lower(), ent.label_) | ||
updated_span = Span(doc, ent.start, ent.end, label=new_label) | ||
updated_ents.append(updated_span) | ||
|
||
new_doc = Doc( | ||
doc.vocab, | ||
words=[token.text for token in doc], | ||
spaces=[token.whitespace_ for token in doc], | ||
) | ||
new_doc.ents = updated_ents | ||
return new_doc | ||
|
||
|
||
if __name__ == "__main__": | ||
typer.run(main) |
Oops, something went wrong.