Releases · ljvmiranda921/calamanCy

Model	Pipelines	Description
tl_calamancy_md (214 MB)	tok2vec, tagger, trainable_lemmatizer, morphologizer, parser, ner	CPU-optimized Tagalog NLP model. Using floret vectors (50k keys)
tl_calamancy_lg (482 MB)	tok2vec, tagger, trainable_lemmatizer, morphologizer, parser, ner	CPU-optimized large Tagalog NLP model. Using fastText vectors (714k)
tl_calamancy_trf (1.7 GB)	transformer, tagger, trainable_lemmatizer, morphologizer, parser, ner	GPU-optimized transformer Tagalog NLP model. Uses mdeberta-v3-base as context vectors.

Model

Pipelines

Description

tl_calamancy_md (214 MB)

tok2vec, tagger, trainable_lemmatizer, morphologizer, parser, ner

CPU-optimized Tagalog NLP model. Using floret vectors (50k keys)

tl_calamancy_lg (482 MB)

tok2vec, tagger, trainable_lemmatizer, morphologizer, parser, ner

CPU-optimized large Tagalog NLP model. Using fastText vectors (714k)

tl_calamancy_trf (1.7 GB)

transformer, tagger, trainable_lemmatizer, morphologizer, parser, ner

GPU-optimized transformer Tagalog NLP model. Uses mdeberta-v3-base as context vectors.

Hi everyone, I'm happy to share the first minor release of calamanCy!

This release adds our first tl_calamancy models with varying sizes to suit any performance or accuracy requirements. The table below shows more information about these pipelines.

Models

The models are also hosted on Huggingface, but you can also use the calamancy library to download and access them.

Model	Pipelines	Description
tl_calamancy_md (73.7 MB)	tok2vec, tagger, morphologizer, parser, ner	CPU-optimized Tagalog NLP model. Pretrained using the TLUnified dataset. Using floret vectors (50k keys)
tl_calamancy_lg (431.9 MB)	tok2vec, tagger, morphologizer, parser, ner	CPU-optimized large Tagalog NLP model. Pretrained using the TLUnified dataset. Using fastText vectors (714k keys)
tl_calamancy_trf (775.6 MB)	transformer, tagger, parser, ner	GPU-optimized transformer Tagalog NLP model. Uses roberta-tagalog-base as context vectors.

Data sources

The table below shows the data sources used to train the pipelines. Note that the Ugnayan treebank is not licensed for commercial use while TLUnified is under GNU GPL. Please consider these licenses when using the calamanCy pipelines in your application. I'd definitely want to gain access to commercial-friendly datasets (or develop my own). If you have any leads or just wanna help out, feel free to contact me by e-mail (ljvmiranda at gmail dot com)!

Source	Authors	License
TLUnified Dataset	Jan Christian Blaise Cruz and Charibeth Cheng	GNU GPL 3.0
UD_Tagalog-TRG	Stephanie Samson, Daniel Zeman, and Mary Ann Tan	CC BY-SA 3.0
UD_Tagalog-Ugnayan	Angelina Aquino	CC BY-NC_SA 4.0

Next steps

For the past few months, I found two annotators and did a small annotation project to re-annotate TLUnified. I learned a lot about this process and I'll be sharing my learnings in a blog post very soon. In the medium-term, I want to re-annotate TLUnified again with more fine-grained entity types and perhaps create our own treebank.

I am still in the process of testing these models so expect a few more patch releases in the future. I'm quite ahead of my self-imposed August deadline, but I want to release early and often so here it goes. If you found any issues, feel free to post them in the Issue tracker.

Full Changelog: https://github.com/ljvmiranda921/calamanCy/commits/0.1.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Models

Data sources

Next steps

Releases: ljvmiranda921/calamanCy

Release 0.2.0

Release 0.1.0

Models

Data sources

Next steps