Releases: ljvmiranda921/calamanCy
Release 0.2.0
Hi everyone, I am excited to release the v0.2.0 models for calamanCy!
[See full blogpost] This has been a long time coming as I've been preparing for this release since the end of 2023. I am excited to highlight three features for this version:
-
Improved syntactic parsing from a larger treebank. Before, we're training our dependency parser and morphological annotation models using a smaller treebank (~150 examples combined). Now, we have access to UD-NewsCrawl, an expert-annotated treebank with 100x more examples! This allows us to train better syntactic parsing models for dependency parsing, POS tagging, and morphological annotation!
-
Updated spaCy components. Due to the larger treebank, we now have the means to train a lemmatizer using spaCy's neural edit-tree lemmatization approach.
This lemmatizer removes the need to handcraft rules and rely solely on statistical methods.
In addition, thetl_calamancy_trf
pipeline now uses the modern mDeBERTa-v3 pretrained model as its base. -
New NER evaluations. New datasets have been built since the last release of calamanCy and I've incorporated them here. This includes Universal NER (Mayhew et al., 2024) and TF-NERD (Ramos et al., 2024). I've also removed the TRG and Ugnayan treebanks from the training set and treated them as test sets (as they should be).
You can find all the models in this HuggingFace collection:
Model | Pipelines | Description |
---|---|---|
tl_calamancy_md (214 MB) | tok2vec, tagger, trainable_lemmatizer, morphologizer, parser, ner | CPU-optimized Tagalog NLP model. Using floret vectors (50k keys) |
tl_calamancy_lg (482 MB) | tok2vec, tagger, trainable_lemmatizer, morphologizer, parser, ner | CPU-optimized large Tagalog NLP model. Using fastText vectors (714k) |
tl_calamancy_trf (1.7 GB) | transformer, tagger, trainable_lemmatizer, morphologizer, parser, ner | GPU-optimized transformer Tagalog NLP model. Uses mdeberta-v3-base as context vectors. |
Full Changelog: 0.2.0...0.2.0
Release 0.1.0
Hi everyone, I'm happy to share the first minor release of calamanCy!
This release adds our first tl_calamancy
models with varying sizes to suit any performance or accuracy requirements. The table below shows more information about these pipelines.
Models
The models are also hosted on Huggingface, but you can also use the calamancy
library to download and access them.
Model | Pipelines | Description |
---|---|---|
tl_calamancy_md (73.7 MB) | tok2vec, tagger, morphologizer, parser, ner | CPU-optimized Tagalog NLP model. Pretrained using the TLUnified dataset. Using floret vectors (50k keys) |
tl_calamancy_lg (431.9 MB) | tok2vec, tagger, morphologizer, parser, ner | CPU-optimized large Tagalog NLP model. Pretrained using the TLUnified dataset. Using fastText vectors (714k keys) |
tl_calamancy_trf (775.6 MB) | transformer, tagger, parser, ner | GPU-optimized transformer Tagalog NLP model. Uses roberta-tagalog-base as context vectors. |
Data sources
The table below shows the data sources used to train the pipelines. Note that the Ugnayan treebank is not licensed for commercial use while TLUnified is under GNU GPL. Please consider these licenses when using the calamanCy pipelines in your application. I'd definitely want to gain access to commercial-friendly datasets (or develop my own). If you have any leads or just wanna help out, feel free to contact me by e-mail (ljvmiranda at gmail dot com)!
Source | Authors | License |
---|---|---|
TLUnified Dataset | Jan Christian Blaise Cruz and Charibeth Cheng | GNU GPL 3.0 |
UD_Tagalog-TRG | Stephanie Samson, Daniel Zeman, and Mary Ann Tan | CC BY-SA 3.0 |
UD_Tagalog-Ugnayan | Angelina Aquino | CC BY-NC_SA 4.0 |
Next steps
For the past few months, I found two annotators and did a small annotation project to re-annotate TLUnified. I learned a lot about this process and I'll be sharing my learnings in a blog post very soon. In the medium-term, I want to re-annotate TLUnified again with more fine-grained entity types and perhaps create our own treebank.
I am still in the process of testing these models so expect a few more patch releases in the future. I'm quite ahead of my self-imposed August deadline, but I want to release early and often so here it goes. If you found any issues, feel free to post them in the Issue tracker.
Full Changelog: https://github.com/ljvmiranda921/calamanCy/commits/0.1.0