Setup initial spaCy project for v0.2.0

ljvmiranda921 · Jan 5, 2025 · bd55668 · bd55668
1 parent 5dddd9c
commit bd55668
Show file tree

Hide file tree

Showing 9 changed files with 1,262 additions and 0 deletions.
diff --git a/models/v0.2.0/README.md b/models/v0.2.0/README.md
@@ -0,0 +1,118 @@
+<!-- WEASEL: AUTO-GENERATED DOCS START (do not remove) -->
+
+# 🪐 Weasel Project: Release v0.2.0
+
+This is a spaCy project that trains the v0.2.0 models for calamanCy.
+Here are some of the major changes in this release:
+
+- **Included trainable lemmatizer in the pipeline**: instead of a rules-based
+lemmatizer, we are now using the [neural edit-tree
+lemmatizer](https://explosion.ai/blog/edit-tree-lemmatizer).
+- **Trained on UD-NewsCrawl**: this is a major update, as we are now training
+our parser, tagger, and morphologizer components on the larger
+[UD-NewsCrawl](https://huggingface.co/datasets/UD-Filipino/UD_Tagalog-NewsCrawl)
+treebank.  Our training dataset has now increased from 150+ to 15,000! From
+this point forward, we will be using the UD-TRG and UD-Ugnayan treebanks as
+test sets (as intended).
+- **Better evaluations**: Aside from evaluating our dependency parser and POS tagger on UD-TRG and UD-Ugnayan, we have also included Universal NER ([Mayhew et al., 2023](https://arxiv.org/abs/2311.09122)) as our test set for evaluating the NER component.
+- **Improved base model for tl_calamancy_trf**: Based on internal evaluations, we are now using [mDeBERTa-v3 (base)](https://huggingface.co/microsoft/mdeberta-v3-base) as our source of context-sensitive vectors for tl_calamancy_trf.
+- **Simpler pipelines, no more pretraining**: We found that pretraining doesn't really offer huge performance gains (0-1%) given the huge effort and time needed to do it. Hence, for ease of training the whole pipeline, we removed it from the calamanCy recipe.
+
+The namespaces for the latest models remain the same. 
+The legacy models will have an explicit version number in their HuggingFace repositories.
+Please see [this HuggingFace collection](https://huggingface.co/collections/ljvmiranda921/calamancy-models-for-tagalog-nlp-65629cc46ef2a1d0f9605c87) for more information.
+
+## Set-up
+
+You can use this project to replicate the pipelines shipped by the project.
+First, you need to install the required dependencies:
+
+```
+pip install -r requirements.txt
+```
+
+Then run the set-up commands:
+
+```
+python -m spacy project assets
+python -m spacy project run setup
+```
+
+This step downloads all assets and prepares all the datasets and binaries for
+training use. For example, if you want to train `tl_calamancy_md`, run the following comand:
+
+```
+bash scripts/tl_calamancy_md.sh
+```
+
+
+## Model information
+
+The table below shows an overview of the calamanCy models in this project. For more information,
+I suggest checking the [language pipeline metadata](https://spacy.io/api/language#meta).
+
+
+| Model                       | Pipelines                                   | Description                                                                                                  |
+|-----------------------------|---------------------------------------------|--------------------------------------------------------------------------------------------------------------|
+| tl_calamancy_md ()   | tok2vec, tagger, trainable_lemmatizer, morphologizer, parser, ner | CPU-optimized Tagalog NLP model. Pretrained using the TLUnified dataset. Using floret vectors (50k keys)     |
+| tl_calamancy_lg ()  | tok2vec, tagger, trainable_lemmatizer, morphologizer, parser, ner | CPU-optimized large Tagalog NLP model. Pretrained using the TLUnified dataset. Using fastText vectors (714k) |
+| tl_calamancy_trf () | transformer, tagger, trainable_lemmatizer, morphologizer, parser, ner            | GPU-optimized transformer Tagalog NLP model. Uses mdeberta-v3-base as context vectors.                   |
+
+
+## 📋 project.yml
+
+The [`project.yml`](project.yml) defines the data assets required by the
+project, as well as the available commands and workflows. For details, see the
+[Weasel documentation](https://github.com/explosion/weasel).
+
+### ⏯ Commands
+
+The following commands are defined by the project. They
+can be executed using [`weasel run [name]`](https://github.com/explosion/weasel/tree/main/docs/cli.md#rocket-run).
+Commands are only re-run if their inputs have changed.
+
+| Command | Description |
+| --- | --- |
+| `setup-finetuning-data` | Prepare the Tagalog corpora used for training various spaCy components |
+| `setup-fasttext-vectors` | Make fastText vectors spaCy compatible |
+| `build-floret` | Build floret binary for training fastText / floret vectors |
+| `train-vectors-md` | Train medium-sized word vectors (200 dims, 200k keys) using the floret binary. |
+| `train-parser` | Train a trainable_lemmatizer, parser, tagger, and morphologizer using the Universal Dependencies treebanks |
+| `train-parser-trf` | Train a trainable_lemmatizer, parser, tagger, and morphologizer using the Universal Dependencies treebanks |
+| `train-ner` | Train ner component |
+| `train-ner-trf` | Train ner component |
+| `assemble` | Assemble pipelines to create a single spaCy piepline |
+| `assemble-trf` | Assemble pipelines to create a single spaCy piepline |
+
+### ⏭ Workflows
+
+The following workflows are defined by the project. They
+can be executed using [`weasel run [name]`](https://github.com/explosion/weasel/tree/main/docs/cli.md#rocket-run)
+and will run the specified commands in order. Commands are only re-run if their
+inputs have changed.
+
+| Workflow | Steps |
+| --- | --- |
+| `setup` | `setup-finetuning-data` &rarr; `setup-fasttext-vectors` &rarr; `build-floret` &rarr; `train-vectors-md` |
+| `tl-calamancy` | `train-parser` &rarr; `train-ner` &rarr; `assemble` |
+| `tl-calamancy-trf` | `train-parser-trf` &rarr; `train-ner-trf` &rarr; `assemble` |
+
+### 🗂 Assets
+
+The following assets are defined by the project. They can
+be fetched by running [`weasel assets`](https://github.com/explosion/weasel/tree/main/docs/cli.md#open_file_folder-assets)
+in the project directory.
+
+| File | Source | Description |
+| --- | --- | --- |
+| `assets/tlunified_raw_text.txt` | URL | Pre-converted raw text from TLUnified in JSONL format (1.1 GB). |
+| `assets/corpus.tar.gz` | URL | Annotated TLUnified corpora in spaCy format with train, dev, and test splits. |
+| `assets/tl_newscrawl-ud-train.conllu` | URL | Train dataset for NewsCrawl |
+| `assets/tl_newscrawl-ud-dev.conllu` | URL | Dev dataset for NewsCrawl |
+| `assets/tl_newscrawl-ud-test.conllu` | URL | Test dataset for NewsCrawl |
+| `assets/tl_trg-ud-test.conllu` | URL | Test dataset for TRG |
+| `assets/tl_ugnayan-ud-test.conllu` | URL | Test dataset for Ugnayan |
+| `assets/fasttext.tl.gz` | URL | Tagalog fastText vectors provided from the fastText website (trained from CommonCrawl and Wikipedia). |
+| `assets/floret` | Git | Floret repository for training floret and fastText models. |
+
+<!-- WEASEL: AUTO-GENERATED DOCS END (do not remove) -->
diff --git a/models/v0.2.0/configs/assemble.cfg b/models/v0.2.0/configs/assemble.cfg
@@ -0,0 +1,38 @@
+[paths]
+parser_model = null
+ner_model = null
+
+[nlp]
+lang = "tl"
+pipeline = ["tok2vec", "trainable_lemmatizer", "tagger", "morphologizer", "parser", "ner"]
+tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
+
+[initialize]
+vectors = ${paths.parser_model}
+
+[components]
+
+[components.tok2vec]
+source = ${paths.parser_model}
+component = "tok2vec"
+
+[components.trainable_lemmatizer]
+source = ${paths.parser_model}
+component = "trainable_lemmatizer"
+
+[components.tagger]
+source = ${paths.parser_model}
+component = "tagger"
+
+[components.morphologizer]
+source = ${paths.parser_model}
+component = "morphologizer"
+
+[components.parser]
+source = ${paths.parser_model}
+component = "parser"
+
+[components.ner]
+source = ${paths.ner_model}
+component = "ner"
+replace_listeners = ["model.tok2vec"]
diff --git a/models/v0.2.0/configs/assemble_trf.cfg b/models/v0.2.0/configs/assemble_trf.cfg
@@ -0,0 +1,35 @@
+[paths]
+parser_model = null
+ner_model = null
+
+[nlp]
+lang = "tl"
+pipeline = ["transformer", "trainable_lemmatizer", "tagger", "morphologizer", "parser", "ner"]
+tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
+
+[components]
+
+[components.transformer]
+source = ${paths.parser_model}
+component = "transformer"
+
+[components.trainable_lemmatizer]
+source = ${paths.parser_model}
+component = "trainable_lemmatizer"
+
+[components.tagger]
+source = ${paths.parser_model}
+component = "tagger"
+
+[components.morphologizer]
+source = ${paths.parser_model}
+component = "morphologizer"
+
+[components.parser]
+source = ${paths.parser_model}
+component = "parser"
+
+[components.ner]
+source = ${paths.ner_model}
+component = "ner"
+replace_listeners = ["model.tok2vec"]
diff --git a/models/v0.2.0/configs/ner.cfg b/models/v0.2.0/configs/ner.cfg
@@ -0,0 +1,145 @@
+[paths]
+train = null
+dev = null
+vectors = null
+init_tok2vec = null
+
+[system]
+gpu_allocator = null
+seed = 0
+
+[nlp]
+lang = "tl"
+pipeline = ["tok2vec","ner"]
+batch_size = 1000
+disabled = []
+before_creation = null
+after_creation = null
+after_pipeline_creation = null
+tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
+vectors = {"@vectors":"spacy.Vectors.v1"}
+
+[components]
+
+[components.ner]
+factory = "ner"
+incorrect_spans_key = null
+moves = null
+scorer = {"@scorers":"spacy.ner_scorer.v1"}
+update_with_oracle_cut_size = 100
+
+[components.ner.model]
+@architectures = "spacy.TransitionBasedParser.v2"
+state_type = "ner"
+extra_state_tokens = false
+hidden_width = 64
+maxout_pieces = 2
+use_upper = true
+nO = null
+
+[components.ner.model.tok2vec]
+@architectures = "spacy.Tok2VecListener.v1"
+width = ${components.tok2vec.model.encode.width}
+upstream = "*"
+
+[components.tok2vec]
+factory = "tok2vec"
+
+[components.tok2vec.model]
+@architectures = "spacy.Tok2Vec.v2"
+
+[components.tok2vec.model.embed]
+@architectures = "spacy.MultiHashEmbed.v2"
+width = ${components.tok2vec.model.encode.width}
+attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
+rows = [5000,1000,2500,2500]
+include_static_vectors = true
+
+[components.tok2vec.model.encode]
+@architectures = "spacy.MaxoutWindowEncoder.v2"
+width = 256
+depth = 8
+window_size = 1
+maxout_pieces = 3
+
+[corpora]
+
+[corpora.dev]
+@readers = "spacy.Corpus.v1"
+path = ${paths.dev}
+max_length = 0
+gold_preproc = false
+limit = 0
+augmenter = null
+
+[corpora.train]
+@readers = "spacy.Corpus.v1"
+path = ${paths.train}
+max_length = 0
+gold_preproc = false
+limit = 0
+augmenter = null
+
+[training]
+dev_corpus = "corpora.dev"
+train_corpus = "corpora.train"
+seed = ${system.seed}
+gpu_allocator = ${system.gpu_allocator}
+dropout = 0.1
+accumulate_gradient = 1
+patience = 1600
+max_epochs = 0
+max_steps = 20000
+eval_frequency = 200
+frozen_components = []
+annotating_components = []
+before_to_disk = null
+before_update = null
+
+[training.batcher]
+@batchers = "spacy.batch_by_words.v1"
+discard_oversize = false
+tolerance = 0.2
+get_length = null
+
+[training.batcher.size]
+@schedules = "compounding.v1"
+start = 100
+stop = 1000
+compound = 1.001
+t = 0.0
+
+[training.logger]
+@loggers = "spacy.ConsoleLogger.v1"
+progress_bar = false
+
+[training.optimizer]
+@optimizers = "Adam.v1"
+beta1 = 0.9
+beta2 = 0.999
+L2_is_weight_decay = true
+L2 = 0.01
+grad_clip = 1.0
+use_averages = false
+eps = 0.00000001
+learn_rate = 0.001
+
+[training.score_weights]
+ents_f = 1.0
+ents_p = 0.0
+ents_r = 0.0
+ents_per_type = null
+
+[pretraining]
+
+[initialize]
+vectors = ${paths.vectors}
+init_tok2vec = ${paths.init_tok2vec}
+vocab_data = null
+lookups = null
+before_init = null
+after_init = null
+
+[initialize.components]
+
+[initialize.tokenizer]