Release Release 0.2.0 · ljvmiranda921/calamanCy

Hi everyone, I am excited to release the v0.2.0 models for calamanCy!

[See full blogpost] This has been a long time coming as I've been preparing for this release since the end of 2023. I am excited to highlight three features for this version:

Improved syntactic parsing from a larger treebank. Before, we're training our dependency parser and morphological annotation models using a smaller treebank (~150 examples combined). Now, we have access to UD-NewsCrawl, an expert-annotated treebank with 100x more examples! This allows us to train better syntactic parsing models for dependency parsing, POS tagging, and morphological annotation!
Updated spaCy components. Due to the larger treebank, we now have the means to train a lemmatizer using spaCy's neural edit-tree lemmatization approach.
This lemmatizer removes the need to handcraft rules and rely solely on statistical methods.
In addition, the tl_calamancy_trf pipeline now uses the modern mDeBERTa-v3 pretrained model as its base.
New NER evaluations. New datasets have been built since the last release of calamanCy and I've incorporated them here. This includes Universal NER (Mayhew et al., 2024) and TF-NERD (Ramos et al., 2024). I've also removed the TRG and Ugnayan treebanks from the training set and treated them as test sets (as they should be).

You can find all the models in this HuggingFace collection:

Model	Pipelines	Description
tl_calamancy_md (214 MB)	tok2vec, tagger, trainable_lemmatizer, morphologizer, parser, ner	CPU-optimized Tagalog NLP model. Using floret vectors (50k keys)
tl_calamancy_lg (482 MB)	tok2vec, tagger, trainable_lemmatizer, morphologizer, parser, ner	CPU-optimized large Tagalog NLP model. Using fastText vectors (714k)
tl_calamancy_trf (1.7 GB)	transformer, tagger, trainable_lemmatizer, morphologizer, parser, ner	GPU-optimized transformer Tagalog NLP model. Uses mdeberta-v3-base as context vectors.

Full Changelog: 0.2.0...0.2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 0.2.0