diff --git a/README.md b/README.md index 9ff4898..270c053 100644 --- a/README.md +++ b/README.md @@ -18,6 +18,7 @@ reproduction of results, and guides on usage. > a citrus fruit native to the Philippines and used in traditional Filipino cuisine. ## 📰 News +- [2024-08-01] Released new NER-only models based on [GLiNER](https://github.com/urchade/GLiNER)! You can find the models in [this HuggingFace collection](https://huggingface.co/collections/ljvmiranda921/calamancy-models-for-tagalog-nlp-65629cc46ef2a1d0f9605c87). Span-Marker and calamanCy models are still superior, but GLiNER offers a lot of extensibility on unseen entity labels. You can find the training pipeline [here](https://github.com/ljvmiranda921/calamanCy/tree/master/models/v0.1.0-gliner). - [2023-12-05] We released the paper [**calamanCy: A Tagalog Natural Language Processing Toolkit**](https://aclanthology.org/2023.nlposs-1.1/) and will be presented in the NLP-OSS workshop at EMNLP 2023! Feel free to check out the [Tagalog NLP collection in HuggingFace](https://huggingface.co/collections/ljvmiranda921/calamancy-models-for-tagalog-nlp-65629cc46ef2a1d0f9605c87). - [2023-11-01] The named entity recognition (NER) dataset used to train the NER component of calamanCy has now a corresponding paper: [**Developing a Named Entity Recognition Dataset for Tagalog**](https://aclanthology.org/2023.nlposs-1.1/)! It will be presented in the SEALP workshop at IJCNLP-AACL 2023! The dataset is also available [in HuggingFace](https://huggingface.co/datasets/ljvmiranda921/tlunified-ner). diff --git a/experiments/refresh_evals_0924/project.yml b/experiments/refresh_evals_0924/project.yml new file mode 100644 index 0000000..0accd19 --- /dev/null +++ b/experiments/refresh_evals_0924/project.yml @@ -0,0 +1 @@ +title: "Benchmarking new models on TLUnfied-NER data" diff --git a/experiments/refresh_evals_0924/requirements.txt b/experiments/refresh_evals_0924/requirements.txt new file mode 100644 index 0000000..f7ddeff --- /dev/null +++ b/experiments/refresh_evals_0924/requirements.txt @@ -0,0 +1,3 @@ +spacy +spacy-llm==0.7.2 +datasets \ No newline at end of file diff --git a/models/v0.1.0-gliner/.gitignore b/models/v0.1.0-gliner/.gitignore new file mode 100644 index 0000000..6f8aca1 --- /dev/null +++ b/models/v0.1.0-gliner/.gitignore @@ -0,0 +1 @@ +metrics \ No newline at end of file diff --git a/models/v0.1.0-gliner/README.md b/models/v0.1.0-gliner/README.md new file mode 100644 index 0000000..13e12e6 --- /dev/null +++ b/models/v0.1.0-gliner/README.md @@ -0,0 +1,120 @@ + + +# 🪐 Weasel Project: Release v0.1.0-gliner + +This is a spaCy project that trains and evaluates new v0.1.0-gliner models. +[GliNER](https://github.com/urchade/GLiNER) (Generalist and Lightweight Model for Named Entity Recognition) is a powerful model capable of identifying any entity type using a BERT-like encoder. +In this project, we finetune the GliNER model using the TLUnified-NER dataset. + +To replicate training, first you need to install the required dependencies: + +```sh +pip install -r requirements.txt +``` + +## Training + +To train a GliNER model, run the `finetune-gliner` workflow while passing the size: + +```sh +# Available options: 'small', 'medium', 'large' +python -m spacy project run finetune-gliner . --vars.size small +``` + +The models are currently based on the [v2.5 version of GliNER](https://huggingface.co/collections/urchade/gliner-v25-66743e64ab975c859119d1eb). + +## Evaluation + +To perform evals, run the `eval-gliner` workflow while passing the size: + +```sh +# Available options: 'small', 'medium', 'large' +python -m spacy project run eval-gliner . --vars.size small +``` + +This will evaluate on TLUnified-NER's test set ([Miranda, 2023](https://aclanthology.org/2023.sealp-1.2.pdf)) and the Tagalog subsets of +Universal NER ([Mayhew et al., 2024](https://aclanthology.org/2024.naacl-long.243/)). + +The evaluation results for TLUnified-NER are shown in the table below (reported numbers are F1-scores): + +| | PER | ORG | LOC | Overall | +|------------------|-------|-------|-------|---------| +| [tl_gliner_small](https://huggingface.co/ljvmiranda921/tl_gliner_small) | 86.76 | 78.72 | 86.78 | 84.83 | +| [tl_gliner_medium](https://huggingface.co/ljvmiranda921/tl_gliner_medium) | 87.46 | 79.71 | 86.75 | 85.40 | +| [tl_gliner_large](https://huggingface.co/ljvmiranda921/tl_gliner_large) | 86.75 | 80.20 | 86.76 | 85.72 | +| [tl_calamancy_trf](https://huggingface.co/ljvmiranda921/tl_calamancy_trf) | 91.95 | **84.84** | 88.92 | 88.03 | +| [span-marker](https://huggingface.co/tomaarsen/span-marker-roberta-tagalog-base-tlunified) | **92.57** | 82.04 | **90.56** | **89.62** | + +In general, GliNER gets decent scores, but nothing beats regular finetuning on BERT-based models as seen in [tl_calamancy_trf](https://huggingface.co/ljvmiranda921/tl_calamancy_trf) and [span_marker](https://huggingface.co/tomaarsen/span-marker-roberta-tagalog-base-tlunified). +The performance on Universal NER is generally worse (the highest is around ~50%), compared to the reported results in the Universal NER paper (we finetuned on RoBERTa as well). +One possible reason is that the annotation guidelines for TULunified-NER are more loose, because we consider some entities that Universal NER ignores. +At the same time, the text distribution of the two datasets are widely different. + +Nevertheless, I'm still releasing these GliNER models as they are very extensible to other entity types (and it's also nice to have a finetuned version of GliNER for Tagalog!). +I haven't done any extensive hyperparameter tuning here so it might be nice if someone can contribute better config parameters to bump up these scores. + +## Citation + +Please cite the following papers when using these models: + +``` +@misc{zaratiana2023gliner, + title={GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer}, + author={Urchade Zaratiana and Nadi Tomeh and Pierre Holat and Thierry Charnois}, + year={2023}, + eprint={2311.08526}, + archivePrefix={arXiv}, + primaryClass={cs.CL} +} +``` + +``` +@inproceedings{miranda-2023-calamancy, + title = "calaman{C}y: A {T}agalog Natural Language Processing Toolkit", + author = "Miranda, Lester James", + booktitle = "Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)", + month = dec, + year = "2023", + address = "Singapore, Singapore", + publisher = "Empirical Methods in Natural Language Processing", + url = "https://aclanthology.org/2023.nlposs-1.1", + pages = "1--7", +} +``` + +If you're using the NER dataset: + +``` +@inproceedings{miranda-2023-developing, + title = "Developing a Named Entity Recognition Dataset for {T}agalog", + author = "Miranda, Lester James", + booktitle = "Proceedings of the First Workshop in South East Asian Language Processing", + month = nov, + year = "2023", + address = "Nusa Dua, Bali, Indonesia", + publisher = "Association for Computational Linguistics", + url = "https://aclanthology.org/2023.sealp-1.2", + doi = "10.18653/v1/2023.sealp-1.2", + pages = "13--20", +} +``` + + +## 📋 project.yml + +The [`project.yml`](project.yml) defines the data assets required by the +project, as well as the available commands and workflows. For details, see the +[Weasel documentation](https://github.com/explosion/weasel). + +### ⏯ Commands + +The following commands are defined by the project. They +can be executed using [`weasel run [name]`](https://github.com/explosion/weasel/tree/main/docs/cli.md#rocket-run). +Commands are only re-run if their inputs have changed. + +| Command | Description | +| --- | --- | +| `finetune-gliner` | Finetune the GliNER model using TLUnified-NER | +| `eval-gliner` | Evaluate trained GliNER models on the TLUnified-NER and Universal NER test sets | + + \ No newline at end of file diff --git a/models/v0.1.0-gliner/evaluate.py b/models/v0.1.0-gliner/evaluate.py new file mode 100644 index 0000000..40ec7ed --- /dev/null +++ b/models/v0.1.0-gliner/evaluate.py @@ -0,0 +1,127 @@ +from pathlib import Path +from typing import Dict, Iterable, Optional +from copy import deepcopy + +import spacy +import torch +import typer +import srsly +from datasets import Dataset, load_dataset +from spacy.scorer import Scorer +from spacy.tokens import Doc, Span +from spacy.training import Example +from wasabi import msg + + +def main( + # fmt: off + output_path: Path = typer.Argument(..., help="Path to store the metrics in JSON format."), + model_name: str = typer.Option("ljvmiranda921/tl_gliner_small", show_default=True, help="GliNER model to use for evaluation."), + dataset: str = typer.Option("ljvmiranda921/tlunified-ner", help="Dataset to evaluate upon."), + threshold: float = typer.Option(0.5, help="The threshold of the GliNER model (controls the degree to which a hit is considered an entity)."), + dataset_config: Optional[str] = typer.Option(None, help="Configuration for loading the dataset."), + chunk_size: int = typer.Option(250, help="Size of the text chunk to be processed at once."), + label_map: str = typer.Option("person::PER,organization::ORG,location::LOC", help="Mapping between GliNER labels and the dataset's actual labels (separated by a double-colon '::')."), + # fmt: on +): + label_map: Dict[str, str] = process_labels(label_map) + msg.text(f"Using label map: {label_map}") + + msg.info("Processing test dataset") + ds = load_dataset(dataset, dataset_config, split="test", trust_remote_code=True) + ref_docs = convert_hf_to_spacy_docs(ds) + + msg.info("Loading GliNER model") + nlp = spacy.blank("tl") + nlp.add_pipe( + "gliner_spacy", + config={ + "gliner_model": model_name, + "chunk_size": chunk_size, + "labels": list(label_map.keys()), + "threshold": threshold, + "style": "ent", + "map_location": "cuda" if torch.cuda.is_available() else "cpu", + }, + ) + msg.text("Getting predictions") + docs = deepcopy(ref_docs) + pred_docs = list(nlp.pipe(docs)) + pred_docs = [update_entity_labels(doc, label_map) for doc in pred_docs] + + # Get the scores + examples = [ + Example(reference=ref, predicted=pred) for ref, pred in zip(ref_docs, pred_docs) + ] + scores = Scorer.score_spans(examples, "ents") + + msg.info(f"Results for {dataset} ({model_name})") + msg.text(scores) + srsly.write_json(output_path, data=scores, indent=2) + msg.good(f"Saving outputs to {output_path}") + + +def process_labels(label_map: str) -> Dict[str, str]: + return {m.split("::")[0]: m.split("::")[1] for m in label_map.split(",")} + + +def convert_hf_to_spacy_docs(dataset: "Dataset") -> Iterable[Doc]: + nlp = spacy.blank("tl") + examples = dataset.to_list() + entity_types = { + idx: feature.split("-")[1] + for idx, feature in enumerate(dataset.features["ner_tags"].feature.names) + if feature != "O" # don't include empty + } + msg.text(f"Using entity types: {entity_types}") + + docs = [] + for example in examples: + tokens = example["tokens"] + ner_tags = example["ner_tags"] + doc = Doc(nlp.vocab, words=tokens) + + entities = [] + start_idx = None + entity_type = None + + for idx, tag in enumerate(ner_tags): + if tag in entity_types: + if start_idx is None: + start_idx = idx + entity_type = entity_types[tag] + elif entity_type != entity_types.get(tag, None): + entities.append(Span(doc, start_idx, idx, label=entity_type)) + start_idx = idx + entity_type = entity_types[tag] + else: + if start_idx is not None: + entities.append(Span(doc, start_idx, idx, label=entity_type)) + start_idx = None + + if start_idx is not None: + entities.append(Span(doc, start_idx, len(tokens), label=entity_type)) + doc.ents = entities + docs.append(doc) + + return docs + + +def update_entity_labels(doc: Doc, label_mapping: Dict[str, str]) -> Doc: + updated_ents = [] + for ent in doc.ents: + new_label = label_mapping.get(ent.label_.lower(), ent.label_) + updated_span = Span(doc, ent.start, ent.end, label=new_label) + updated_ents.append(updated_span) + + new_doc = Doc( + doc.vocab, + words=[token.text for token in doc], + spaces=[token.whitespace_ for token in doc], + ) + new_doc.ents = updated_ents + return new_doc + + +if __name__ == "__main__": + typer.run(main) diff --git a/models/v0.1.0-gliner/project.yml b/models/v0.1.0-gliner/project.yml new file mode 100644 index 0000000..a7b60e1 --- /dev/null +++ b/models/v0.1.0-gliner/project.yml @@ -0,0 +1,160 @@ +title: "Release v0.1.0-gliner" +description: | + This is a spaCy project that trains and evaluates new v0.1.0-gliner models. + [GliNER](https://github.com/urchade/GLiNER) (Generalist and Lightweight Model for Named Entity Recognition) is a powerful model capable of identifying any entity type using a BERT-like encoder. + In this project, we finetune the GliNER model using the TLUnified-NER dataset. + + To replicate training, first you need to install the required dependencies: + + ```sh + pip install -r requirements.txt + ``` + + ## Training + + To train a GliNER model, run the `finetune-gliner` workflow while passing the size: + + ```sh + # Available options: 'small', 'medium', 'large' + python -m spacy project run finetune-gliner . --vars.size small + ``` + + The models are currently based on the [v2.5 version of GliNER](https://huggingface.co/collections/urchade/gliner-v25-66743e64ab975c859119d1eb). + + ## Evaluation + + To perform evals, run the `eval-gliner` workflow while passing the size: + + ```sh + # Available options: 'small', 'medium', 'large' + python -m spacy project run eval-gliner . --vars.size small + ``` + + This will evaluate on TLUnified-NER's test set ([Miranda, 2023](https://aclanthology.org/2023.sealp-1.2.pdf)) and the Tagalog subsets of + Universal NER ([Mayhew et al., 2024](https://aclanthology.org/2024.naacl-long.243/)). + + The evaluation results for TLUnified-NER are shown in the table below (reported numbers are F1-scores): + + | | PER | ORG | LOC | Overall | + |------------------|-------|-------|-------|---------| + | [tl_gliner_small](https://huggingface.co/ljvmiranda921/tl_gliner_small) | 86.76 | 78.72 | 86.78 | 84.83 | + | [tl_gliner_medium](https://huggingface.co/ljvmiranda921/tl_gliner_medium) | 87.46 | 79.71 | 86.75 | 85.40 | + | [tl_gliner_large](https://huggingface.co/ljvmiranda921/tl_gliner_large) | 86.75 | 80.20 | 86.76 | 85.72 | + | [tl_calamancy_trf](https://huggingface.co/ljvmiranda921/tl_calamancy_trf) | 91.95 | **84.84** | 88.92 | 88.03 | + | [span-marker](https://huggingface.co/tomaarsen/span-marker-roberta-tagalog-base-tlunified) | **92.57** | 82.04 | **90.56** | **89.62** | + + In general, GliNER gets decent scores, but nothing beats regular finetuning on BERT-based models as seen in [tl_calamancy_trf](https://huggingface.co/ljvmiranda921/tl_calamancy_trf) and [span_marker](https://huggingface.co/tomaarsen/span-marker-roberta-tagalog-base-tlunified). + The performance on Universal NER is generally worse (the highest is around ~50%), compared to the reported results in the Universal NER paper (we finetuned on RoBERTa as well). + One possible reason is that the annotation guidelines for TULunified-NER are more loose, because we consider some entities that Universal NER ignores. + At the same time, the text distribution of the two datasets are widely different. + + Nevertheless, I'm still releasing these GliNER models as they are very extensible to other entity types (and it's also nice to have a finetuned version of GliNER for Tagalog!). + I haven't done any extensive hyperparameter tuning here so it might be nice if someone can contribute better config parameters to bump up these scores. + + ## Citation + + Please cite the following papers when using these models: + + ``` + @misc{zaratiana2023gliner, + title={GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer}, + author={Urchade Zaratiana and Nadi Tomeh and Pierre Holat and Thierry Charnois}, + year={2023}, + eprint={2311.08526}, + archivePrefix={arXiv}, + primaryClass={cs.CL} + } + ``` + + ``` + @inproceedings{miranda-2023-calamancy, + title = "calaman{C}y: A {T}agalog Natural Language Processing Toolkit", + author = "Miranda, Lester James", + booktitle = "Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)", + month = dec, + year = "2023", + address = "Singapore, Singapore", + publisher = "Empirical Methods in Natural Language Processing", + url = "https://aclanthology.org/2023.nlposs-1.1", + pages = "1--7", + } + ``` + + If you're using the NER dataset: + + ``` + @inproceedings{miranda-2023-developing, + title = "Developing a Named Entity Recognition Dataset for {T}agalog", + author = "Miranda, Lester James", + booktitle = "Proceedings of the First Workshop in South East Asian Language Processing", + month = nov, + year = "2023", + address = "Nusa Dua, Bali, Indonesia", + publisher = "Association for Computational Linguistics", + url = "https://aclanthology.org/2023.sealp-1.2", + doi = "10.18653/v1/2023.sealp-1.2", + pages = "13--20", + } + ``` + +vars: + version: 0.1.0 + # Training + size: small + num_steps: 10000 + batch_size: 8 + +directories: + - "checkpoints" + - "models" + - "metrics" + +env: + HF_TOKEN: HF_TOKEN + TOKENIZERS_PARALLELISM: TOKENIZERS_PARALLELISM + +commands: + - name: "finetune-gliner" + help: "Finetune the GliNER model using TLUnified-NER" + script: + - mkdir -p models/gliner_${vars.size} + - mkdir -p checkpoints/ckpt_gliner_${vars.size} + - >- + python train.py + gliner-community/gliner_${vars.size}-v2.5 + models/gliner_${vars.size} + --checkpoint-dir checkpoints/ckpt_gliner_${vars.size} + --push-to-hub ljvmiranda921/tl_gliner_${vars.size} + --num-steps ${vars.num_steps} + --batch-size ${vars.batch_size} + --dataset ljvmiranda921/tlunified-ner + outputs: + - models/gliner_${vars.size} + - checkpoints/ckpt_gliner_${vars.size} + + - name: "eval-gliner" + help: "Evaluate trained GliNER models on the TLUnified-NER and Universal NER test sets" + script: + # TLUnified-NER + - >- + python evaluate.py + metrics/model___tl_gliner_${vars.size}_dataset___ljvmiranda921-tlunified-ner.json + --model-name ljvmiranda921/tl_gliner_${vars.size} + --dataset ljvmiranda921/tlunified-ner + --label-map person::PER,location::LOC,organization::ORG + # Universal NER (tl_trg) + - >- + python evaluate.py + metrics/model___tl_gliner_${vars.size}_dataset___universalner-universal_ner.json + --model-name ljvmiranda921/tl_gliner_${vars.size} + --dataset universalner/universal_ner + --dataset-config tl_trg + --label-map person::PER,location::LOC,organization::ORG + # Universal NER (tl_ugnayan) + - >- + python evaluate.py + metrics/model___tl_gliner_${vars.size}_dataset___universalner-universal_ner.json + --model-name ljvmiranda921/tl_gliner_${vars.size} + --dataset universalner/universal_ner + --dataset-config tl_ugnayan + --label-map person::PER,location::LOC,organization::ORG diff --git a/models/v0.1.0-gliner/requirements.txt b/models/v0.1.0-gliner/requirements.txt new file mode 100644 index 0000000..603c273 --- /dev/null +++ b/models/v0.1.0-gliner/requirements.txt @@ -0,0 +1,7 @@ +gliner==0.2.8 +accelerate +spacy +gliner-spacy +huggingface_hub +datasets +conllu \ No newline at end of file diff --git a/models/v0.1.0-gliner/train.py b/models/v0.1.0-gliner/train.py new file mode 100644 index 0000000..8ae46a7 --- /dev/null +++ b/models/v0.1.0-gliner/train.py @@ -0,0 +1,134 @@ +import os +from pathlib import Path +from typing import Optional + +import torch +import typer +from datasets import load_dataset +from gliner import GLiNER +from gliner.data_processing.collator import DataCollator +from gliner.training import Trainer, TrainingArguments +from wasabi import msg + + +def main( + # fmt: off + base_model: str = typer.Argument(..., help="Base model used for training."), + output_dir: Path = typer.Argument(..., help="Path to store the output model."), + checkpoint_dir: Path = typer.Option(Path("checkpoints"), help="Path for storing checkpoints."), + push_to_hub: Optional[str] = typer.Option(None, help="If set, will upload the trained model to the provided Huggingface model namespace."), + num_steps: int = typer.Option(500, help="Number of steps to run training."), + batch_size: int = typer.Option(8, help="Batch size used for training."), + dataset: str = typer.Option("ljvmiranda/tlunified-ner", help="Path to the TLUnified-NER dataset."), + # fmt: on +): + + if push_to_hub: + api_token = os.getenv("HF_TOKEN") + if not api_token: + msg.fail("HF_TOKEN is missing! Won't be able to --push-to-hub", exits=1) + + # Load and Format the dataset + msg.info(f"Formatting the {dataset} dataset") + ds = load_dataset(dataset) + + def format_to_gliner(example): + id2label = { + 1: "person", + 2: "person", + 3: "organization", + 4: "organization", + 5: "location", + 6: "location", + } + + tokens = example["tokens"] + ner_tags = example["ner_tags"] + + ner = [] + current_entity = None + for idx, tag in enumerate(ner_tags): + if tag in id2label: + if current_entity is None: + current_entity = [idx, idx, id2label[tag]] + elif ( + tag == ner_tags[current_entity[0]] + or tag == ner_tags[current_entity[0]] + 1 + ): + current_entity[1] = idx + else: + ner.append(current_entity) + current_entity = [idx, idx, id2label[tag]] + else: + if current_entity is not None: + ner.append(current_entity) + current_entity = None + + if current_entity is not None: + ner.append(current_entity) + + return {"tokenized_text": tokens, "ner": ner} + + train_dataset = [format_to_gliner(eg) for eg in ds["train"].to_list()] + eval_dataset = [format_to_gliner(eg) for eg in ds["validation"].to_list()] + + # Perform training + device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") + model = GLiNER.from_pretrained(base_model) + + data_collator = DataCollator( + model.config, + data_processor=model.data_processor, + prepare_labels=True, + ) + model.to(device) + + data_size = len(train_dataset) + num_batches = data_size // batch_size + num_epochs = max(1, num_steps // num_batches) + + msg.info( + f"Finetuning the {base_model} model, saving checkpoints to {checkpoint_dir}" + ) + + training_args = TrainingArguments( + output_dir=str(checkpoint_dir), + learning_rate=5e-6, + weight_decay=0.01, + others_lr=1e-5, + others_weight_decay=0.01, + lr_scheduler_type="linear", # cosine + warmup_ratio=0.1, + per_device_train_batch_size=batch_size, + per_device_eval_batch_size=batch_size, + num_train_epochs=num_epochs, + evaluation_strategy="steps", + save_steps=num_steps * 2, + save_total_limit=10, + dataloader_num_workers=0, + use_cpu=False, + report_to="none", + load_best_model_at_end=True, + ) + + trainer = Trainer( + model=model, + args=training_args, + train_dataset=train_dataset, + eval_dataset=eval_dataset, + tokenizer=model.data_processor.transformer_tokenizer, + data_collator=data_collator, + ) + + trainer.train() + trainer.save_model(str(output_dir)) + msg.good(f"Best model saved to {output_dir}") + + if push_to_hub: + msg.info(f"Pushing model to HuggingFace Hub") + model = GLiNER.from_pretrained(output_dir) + model.push_to_hub(push_to_hub, token=api_token) + + +if __name__ == "__main__": + typer.run(main)