Skip to content

Commit

Permalink
Setup initial spaCy project for v0.2.0
Browse files Browse the repository at this point in the history
  • Loading branch information
ljvmiranda921 committed Jan 5, 2025
1 parent 5dddd9c commit bd55668
Show file tree
Hide file tree
Showing 9 changed files with 1,262 additions and 0 deletions.
118 changes: 118 additions & 0 deletions models/v0.2.0/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
<!-- WEASEL: AUTO-GENERATED DOCS START (do not remove) -->

# 🪐 Weasel Project: Release v0.2.0

This is a spaCy project that trains the v0.2.0 models for calamanCy.
Here are some of the major changes in this release:

- **Included trainable lemmatizer in the pipeline**: instead of a rules-based
lemmatizer, we are now using the [neural edit-tree
lemmatizer](https://explosion.ai/blog/edit-tree-lemmatizer).
- **Trained on UD-NewsCrawl**: this is a major update, as we are now training
our parser, tagger, and morphologizer components on the larger
[UD-NewsCrawl](https://huggingface.co/datasets/UD-Filipino/UD_Tagalog-NewsCrawl)
treebank. Our training dataset has now increased from 150+ to 15,000! From
this point forward, we will be using the UD-TRG and UD-Ugnayan treebanks as
test sets (as intended).
- **Better evaluations**: Aside from evaluating our dependency parser and POS tagger on UD-TRG and UD-Ugnayan, we have also included Universal NER ([Mayhew et al., 2023](https://arxiv.org/abs/2311.09122)) as our test set for evaluating the NER component.
- **Improved base model for tl_calamancy_trf**: Based on internal evaluations, we are now using [mDeBERTa-v3 (base)](https://huggingface.co/microsoft/mdeberta-v3-base) as our source of context-sensitive vectors for tl_calamancy_trf.
- **Simpler pipelines, no more pretraining**: We found that pretraining doesn't really offer huge performance gains (0-1%) given the huge effort and time needed to do it. Hence, for ease of training the whole pipeline, we removed it from the calamanCy recipe.

The namespaces for the latest models remain the same.
The legacy models will have an explicit version number in their HuggingFace repositories.
Please see [this HuggingFace collection](https://huggingface.co/collections/ljvmiranda921/calamancy-models-for-tagalog-nlp-65629cc46ef2a1d0f9605c87) for more information.

## Set-up

You can use this project to replicate the pipelines shipped by the project.
First, you need to install the required dependencies:

```
pip install -r requirements.txt
```

Then run the set-up commands:

```
python -m spacy project assets
python -m spacy project run setup
```

This step downloads all assets and prepares all the datasets and binaries for
training use. For example, if you want to train `tl_calamancy_md`, run the following comand:

```
bash scripts/tl_calamancy_md.sh
```


## Model information

The table below shows an overview of the calamanCy models in this project. For more information,
I suggest checking the [language pipeline metadata](https://spacy.io/api/language#meta).


| Model | Pipelines | Description |
|-----------------------------|---------------------------------------------|--------------------------------------------------------------------------------------------------------------|
| tl_calamancy_md () | tok2vec, tagger, trainable_lemmatizer, morphologizer, parser, ner | CPU-optimized Tagalog NLP model. Pretrained using the TLUnified dataset. Using floret vectors (50k keys) |
| tl_calamancy_lg () | tok2vec, tagger, trainable_lemmatizer, morphologizer, parser, ner | CPU-optimized large Tagalog NLP model. Pretrained using the TLUnified dataset. Using fastText vectors (714k) |
| tl_calamancy_trf () | transformer, tagger, trainable_lemmatizer, morphologizer, parser, ner | GPU-optimized transformer Tagalog NLP model. Uses mdeberta-v3-base as context vectors. |


## 📋 project.yml

The [`project.yml`](project.yml) defines the data assets required by the
project, as well as the available commands and workflows. For details, see the
[Weasel documentation](https://github.com/explosion/weasel).

### ⏯ Commands

The following commands are defined by the project. They
can be executed using [`weasel run [name]`](https://github.com/explosion/weasel/tree/main/docs/cli.md#rocket-run).
Commands are only re-run if their inputs have changed.

| Command | Description |
| --- | --- |
| `setup-finetuning-data` | Prepare the Tagalog corpora used for training various spaCy components |
| `setup-fasttext-vectors` | Make fastText vectors spaCy compatible |
| `build-floret` | Build floret binary for training fastText / floret vectors |
| `train-vectors-md` | Train medium-sized word vectors (200 dims, 200k keys) using the floret binary. |
| `train-parser` | Train a trainable_lemmatizer, parser, tagger, and morphologizer using the Universal Dependencies treebanks |
| `train-parser-trf` | Train a trainable_lemmatizer, parser, tagger, and morphologizer using the Universal Dependencies treebanks |
| `train-ner` | Train ner component |
| `train-ner-trf` | Train ner component |
| `assemble` | Assemble pipelines to create a single spaCy piepline |
| `assemble-trf` | Assemble pipelines to create a single spaCy piepline |

### ⏭ Workflows

The following workflows are defined by the project. They
can be executed using [`weasel run [name]`](https://github.com/explosion/weasel/tree/main/docs/cli.md#rocket-run)
and will run the specified commands in order. Commands are only re-run if their
inputs have changed.

| Workflow | Steps |
| --- | --- |
| `setup` | `setup-finetuning-data` &rarr; `setup-fasttext-vectors` &rarr; `build-floret` &rarr; `train-vectors-md` |
| `tl-calamancy` | `train-parser` &rarr; `train-ner` &rarr; `assemble` |
| `tl-calamancy-trf` | `train-parser-trf` &rarr; `train-ner-trf` &rarr; `assemble` |

### 🗂 Assets

The following assets are defined by the project. They can
be fetched by running [`weasel assets`](https://github.com/explosion/weasel/tree/main/docs/cli.md#open_file_folder-assets)
in the project directory.

| File | Source | Description |
| --- | --- | --- |
| `assets/tlunified_raw_text.txt` | URL | Pre-converted raw text from TLUnified in JSONL format (1.1 GB). |
| `assets/corpus.tar.gz` | URL | Annotated TLUnified corpora in spaCy format with train, dev, and test splits. |
| `assets/tl_newscrawl-ud-train.conllu` | URL | Train dataset for NewsCrawl |
| `assets/tl_newscrawl-ud-dev.conllu` | URL | Dev dataset for NewsCrawl |
| `assets/tl_newscrawl-ud-test.conllu` | URL | Test dataset for NewsCrawl |
| `assets/tl_trg-ud-test.conllu` | URL | Test dataset for TRG |
| `assets/tl_ugnayan-ud-test.conllu` | URL | Test dataset for Ugnayan |
| `assets/fasttext.tl.gz` | URL | Tagalog fastText vectors provided from the fastText website (trained from CommonCrawl and Wikipedia). |
| `assets/floret` | Git | Floret repository for training floret and fastText models. |

<!-- WEASEL: AUTO-GENERATED DOCS END (do not remove) -->
38 changes: 38 additions & 0 deletions models/v0.2.0/configs/assemble.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
[paths]
parser_model = null
ner_model = null

[nlp]
lang = "tl"
pipeline = ["tok2vec", "trainable_lemmatizer", "tagger", "morphologizer", "parser", "ner"]
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[initialize]
vectors = ${paths.parser_model}

[components]

[components.tok2vec]
source = ${paths.parser_model}
component = "tok2vec"

[components.trainable_lemmatizer]
source = ${paths.parser_model}
component = "trainable_lemmatizer"

[components.tagger]
source = ${paths.parser_model}
component = "tagger"

[components.morphologizer]
source = ${paths.parser_model}
component = "morphologizer"

[components.parser]
source = ${paths.parser_model}
component = "parser"

[components.ner]
source = ${paths.ner_model}
component = "ner"
replace_listeners = ["model.tok2vec"]
35 changes: 35 additions & 0 deletions models/v0.2.0/configs/assemble_trf.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
[paths]
parser_model = null
ner_model = null

[nlp]
lang = "tl"
pipeline = ["transformer", "trainable_lemmatizer", "tagger", "morphologizer", "parser", "ner"]
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.transformer]
source = ${paths.parser_model}
component = "transformer"

[components.trainable_lemmatizer]
source = ${paths.parser_model}
component = "trainable_lemmatizer"

[components.tagger]
source = ${paths.parser_model}
component = "tagger"

[components.morphologizer]
source = ${paths.parser_model}
component = "morphologizer"

[components.parser]
source = ${paths.parser_model}
component = "parser"

[components.ner]
source = ${paths.ner_model}
component = "ner"
replace_listeners = ["model.tok2vec"]
145 changes: 145 additions & 0 deletions models/v0.2.0/configs/ner.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "tl"
pipeline = ["tok2vec","ner"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
vectors = {"@vectors":"spacy.Vectors.v1"}

[components]

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,1000,2500,2500]
include_static_vectors = true

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 256
depth = 8
window_size = 1
maxout_pieces = 3

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0
ents_per_type = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]
Loading

0 comments on commit bd55668

Please sign in to comment.