Skip to content

Commit

Permalink
[WIP] Update
Browse files Browse the repository at this point in the history
  • Loading branch information
ljvmiranda921 committed Jan 4, 2025
1 parent 21e43a9 commit b1f5003
Showing 1 changed file with 167 additions and 1 deletion.
168 changes: 167 additions & 1 deletion models/v0.2.0/project.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,178 @@ description: |
test sets (as intended).
- **Better evaluations**: Aside from evaluating our dependency parser and POS tagger on UD-TRG and UD-Ugnayan, we have also included Universal NER ([Mayhew et al., 2023](https://arxiv.org/abs/2311.09122)) as our test set for evaluating the NER component.
- **Improved base model for tl_calamancy_trf**: Based on internal evaluations, we are now using [mDeBERTa-v3 (base)](https://huggingface.co/microsoft/mdeberta-v3-base) as our source of context-sensitive vectors for tl_calamancy_trf.
- **No more pretraining**: We found that pretraining doesn't really offer huge performance gains (0-1%) given the huge effort and time needed to do it. Hence, for ease of training the whole pipeline, we removed it from the calamanCy recipe.
The namespaces for the latest models remain the same.
The legacy models will have an explicit version number in their HuggingFace repositories.
Please see [this HuggingFace collection](https://huggingface.co/collections/ljvmiranda921/calamancy-models-for-tagalog-nlp-65629cc46ef2a1d0f9605c87) for more information.
## Set-up
You can use this project to replicate the pipelines shipped by the project.
First, you need to install the required dependencies:
```
pip install -r requirements.txt
```
Then run the set-up commands:
```
python -m spacy project assets
python -m spacy project run setup
```
This step downloads all assets and prepares all the datasets and binaries for
training use. You can then train a pipeline by passing its name to the spaCy
project command. For example, if you wish to train `tl_calamancy_md`, you can
execute the corresponding workflow like so:
```
python -m spacy project run tl-calamancy-md
```
## Model information
The table below shows an overview of the calamanCy models in this project. For more information,
I suggest checking the [language pipeline metadata](https://spacy.io/api/language#meta).
| Model | Pipelines | Description |
|-----------------------------|---------------------------------------------|--------------------------------------------------------------------------------------------------------------|
| tl_calamancy_md () | tok2vec, tagger, trainable_lemmatizer, morphologizer, parser, ner | CPU-optimized Tagalog NLP model. Pretrained using the TLUnified dataset. Using floret vectors (50k keys) |
| tl_calamancy_lg () | tok2vec, tagger, trainable_lemmatizer, morphologizer, parser, ner | CPU-optimized large Tagalog NLP model. Pretrained using the TLUnified dataset. Using fastText vectors (714k) |
| tl_calamancy_trf () | transformer, tagger, trainable_lemmatizer, morphologizer, parser, ner | GPU-optimized transformer Tagalog NLP model. Uses mdeberta-v3-base as context vectors. |
vars:
# Versioning
version: 0.2.0
# Training
lang: "tl"
gpu_id: 0

directories:
- "assets"
- "configs"
- "corpus"
- "packages"
- "scripts"
- "training"
- "vectors"

assets:
- dest: assets/corpus.tar.gz
description: "Annotated TLUnified corpora in spaCy format with train, dev, and test splits."
url: "https://storage.googleapis.com/ljvmiranda/calamanCy/tl_tlunified_gold/v${vars.dataset_version}/corpus.tar.gz"
- dest: assets/tl_newscrawl-ud-train.conllu
description: "Train dataset for NewsCrawl"
url: https://raw.githubusercontent.com/UniversalDependencies/UD_Tagalog-NewsCrawl/refs/heads/dev/tl_newscrawl-ud-train.conllu
- dest: assets/tl_newscrawl-ud-dev.conllu
description: "Dev dataset for NewsCrawl"
url: https://raw.githubusercontent.com/UniversalDependencies/UD_Tagalog-NewsCrawl/refs/heads/dev/tl_newscrawl-ud-dev.conllu
- dest: assets/tl_newscrawl-ud-test.conllu
description: "Test dataset for NewsCrawl"
url: https://raw.githubusercontent.com/UniversalDependencies/UD_Tagalog-NewsCrawl/refs/heads/dev/tl_newscrawl-ud-test.conllu
- dest: assets/tl_trg-ud-test.conllu
description: "Test dataset for TRG"
url: https://raw.githubusercontent.com/UniversalDependencies/UD_Tagalog-TRG/refs/heads/master/tl_trg-ud-test.conllu
- dest: assets/tl_ugnayan-ud-test.conllu
description: "Test dataset for Ugnayan"
url: https://raw.githubusercontent.com/UniversalDependencies/UD_Tagalog-Ugnayan/refs/heads/master/tl_ugnayan-ud-test.conllu
- dest: "assets/fasttext.tl.gz"
description: "Tagalog fastText vectors provided from the fastText website (trained from CommonCrawl and Wikipedia)."
url: "https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.tl.300.vec.gz"
- dest: "assets/floret"
description: "Floret repository for training floret and fastText models."
git:
repo: "https://github.com/explosion/floret"
branch: "main"
path: ""

workflows:
setup:
- "setup-finetuning-data"
- "setup-fasttext-vectors"
- "build-floret"

commands:
- name: "setup-finetuning-data"
help: "Prepare the Tagalog corpora used for training various spaCy components"
script:
# ner: Extract Tagalog corpora
- mkdir -p corpus/ner
- "tar -xzvf assets/corpus.tar.gz -C corpus/ner"
# parser, tagger, morph: Convert treebank into spaCy format
- mkdir -p corpus/treebank
- >-
python -m spacy convert
assets/tl_newscrawl-ud-train.conllu corpus/treebank
--converter conllu
--morphology
--merge-subtokens
- >-
python -m spacy convert
assets/tl_newscrawl-ud-dev.conllu corpus/treebank
--converter conllu
--morphology
--merge-subtokens
- >-
python -m spacy convert
assets/tl_newscrawl-ud-test.conllu corpus/treebank
--converter conllu
--n-sents 1
--morphology
--merge-subtokens
- >-
python -m spacy convert
assets/tl_ugnayan-ud-test.conllu corpus/treebank
--converter conllu
--n-sents 1
--morphology
--merge-subtokens
- >-
python -m spacy convert
assets/tl_trg-ud-test.conllu corpus/treebank
--converter conllu
--n-sents 1
--morphology
--merge-subtokens
deps:
- assets/corpus.tar.gz
- assets/tl_newscrawl-ud-train.conllu
- assets/tl_newscrawl-ud-dev.conllu
- assets/tl_newscrawl-ud-test.conllu
- assets/tl_ugnayan-ud-test.conllu
- assets/tl_trg-ud-test.conllu
outputs:
- corpus/ner/train.spacy
- corpus/ner/dev.spacy
- corpus/ner/test.spacy
- corpus/treebank/tl_newscrawl-ud-train.spacy
- corpus/treebank/tl_newscrawl-ud-dev.spacy
- corpus/treebank/tl_newscrawl-ud-test.spacy
- corpus/treebank/tl_ugnayan-ud-test.spacy
- corpus/treebank/tl_trg-ud-test.spacy

- name: "setup-fasttext-vectors"
help: "Make fastText vectors spaCy compatible"
script:
- gzip -d -f assets/fasttext.tl.gz
- mkdir -p vectors/fasttext-tl
- >-
python -m spacy init vectors
tl assets/fasttext.tl vectors/fasttext-tl
deps:
- assets/fasttext.tl.gz
outputs:
- vectors/fasttext-tl

- name: "build-floret"
help: "Build floret binary for training fastText / floret vectors"
script:
- make -C assets/floret
- chmod +x assets/floret/floret
deps:
- assets/floret
outputs:
- assets/floret/floret

0 comments on commit b1f5003

Please sign in to comment.