-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Setup initial spaCy project for v0.2.0
- Loading branch information
1 parent
5dddd9c
commit bd55668
Showing
9 changed files
with
1,262 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,118 @@ | ||
<!-- WEASEL: AUTO-GENERATED DOCS START (do not remove) --> | ||
|
||
# 🪐 Weasel Project: Release v0.2.0 | ||
|
||
This is a spaCy project that trains the v0.2.0 models for calamanCy. | ||
Here are some of the major changes in this release: | ||
|
||
- **Included trainable lemmatizer in the pipeline**: instead of a rules-based | ||
lemmatizer, we are now using the [neural edit-tree | ||
lemmatizer](https://explosion.ai/blog/edit-tree-lemmatizer). | ||
- **Trained on UD-NewsCrawl**: this is a major update, as we are now training | ||
our parser, tagger, and morphologizer components on the larger | ||
[UD-NewsCrawl](https://huggingface.co/datasets/UD-Filipino/UD_Tagalog-NewsCrawl) | ||
treebank. Our training dataset has now increased from 150+ to 15,000! From | ||
this point forward, we will be using the UD-TRG and UD-Ugnayan treebanks as | ||
test sets (as intended). | ||
- **Better evaluations**: Aside from evaluating our dependency parser and POS tagger on UD-TRG and UD-Ugnayan, we have also included Universal NER ([Mayhew et al., 2023](https://arxiv.org/abs/2311.09122)) as our test set for evaluating the NER component. | ||
- **Improved base model for tl_calamancy_trf**: Based on internal evaluations, we are now using [mDeBERTa-v3 (base)](https://huggingface.co/microsoft/mdeberta-v3-base) as our source of context-sensitive vectors for tl_calamancy_trf. | ||
- **Simpler pipelines, no more pretraining**: We found that pretraining doesn't really offer huge performance gains (0-1%) given the huge effort and time needed to do it. Hence, for ease of training the whole pipeline, we removed it from the calamanCy recipe. | ||
|
||
The namespaces for the latest models remain the same. | ||
The legacy models will have an explicit version number in their HuggingFace repositories. | ||
Please see [this HuggingFace collection](https://huggingface.co/collections/ljvmiranda921/calamancy-models-for-tagalog-nlp-65629cc46ef2a1d0f9605c87) for more information. | ||
|
||
## Set-up | ||
|
||
You can use this project to replicate the pipelines shipped by the project. | ||
First, you need to install the required dependencies: | ||
|
||
``` | ||
pip install -r requirements.txt | ||
``` | ||
|
||
Then run the set-up commands: | ||
|
||
``` | ||
python -m spacy project assets | ||
python -m spacy project run setup | ||
``` | ||
|
||
This step downloads all assets and prepares all the datasets and binaries for | ||
training use. For example, if you want to train `tl_calamancy_md`, run the following comand: | ||
|
||
``` | ||
bash scripts/tl_calamancy_md.sh | ||
``` | ||
|
||
|
||
## Model information | ||
|
||
The table below shows an overview of the calamanCy models in this project. For more information, | ||
I suggest checking the [language pipeline metadata](https://spacy.io/api/language#meta). | ||
|
||
|
||
| Model | Pipelines | Description | | ||
|-----------------------------|---------------------------------------------|--------------------------------------------------------------------------------------------------------------| | ||
| tl_calamancy_md () | tok2vec, tagger, trainable_lemmatizer, morphologizer, parser, ner | CPU-optimized Tagalog NLP model. Pretrained using the TLUnified dataset. Using floret vectors (50k keys) | | ||
| tl_calamancy_lg () | tok2vec, tagger, trainable_lemmatizer, morphologizer, parser, ner | CPU-optimized large Tagalog NLP model. Pretrained using the TLUnified dataset. Using fastText vectors (714k) | | ||
| tl_calamancy_trf () | transformer, tagger, trainable_lemmatizer, morphologizer, parser, ner | GPU-optimized transformer Tagalog NLP model. Uses mdeberta-v3-base as context vectors. | | ||
|
||
|
||
## 📋 project.yml | ||
|
||
The [`project.yml`](project.yml) defines the data assets required by the | ||
project, as well as the available commands and workflows. For details, see the | ||
[Weasel documentation](https://github.com/explosion/weasel). | ||
|
||
### ⏯ Commands | ||
|
||
The following commands are defined by the project. They | ||
can be executed using [`weasel run [name]`](https://github.com/explosion/weasel/tree/main/docs/cli.md#rocket-run). | ||
Commands are only re-run if their inputs have changed. | ||
|
||
| Command | Description | | ||
| --- | --- | | ||
| `setup-finetuning-data` | Prepare the Tagalog corpora used for training various spaCy components | | ||
| `setup-fasttext-vectors` | Make fastText vectors spaCy compatible | | ||
| `build-floret` | Build floret binary for training fastText / floret vectors | | ||
| `train-vectors-md` | Train medium-sized word vectors (200 dims, 200k keys) using the floret binary. | | ||
| `train-parser` | Train a trainable_lemmatizer, parser, tagger, and morphologizer using the Universal Dependencies treebanks | | ||
| `train-parser-trf` | Train a trainable_lemmatizer, parser, tagger, and morphologizer using the Universal Dependencies treebanks | | ||
| `train-ner` | Train ner component | | ||
| `train-ner-trf` | Train ner component | | ||
| `assemble` | Assemble pipelines to create a single spaCy piepline | | ||
| `assemble-trf` | Assemble pipelines to create a single spaCy piepline | | ||
|
||
### ⏭ Workflows | ||
|
||
The following workflows are defined by the project. They | ||
can be executed using [`weasel run [name]`](https://github.com/explosion/weasel/tree/main/docs/cli.md#rocket-run) | ||
and will run the specified commands in order. Commands are only re-run if their | ||
inputs have changed. | ||
|
||
| Workflow | Steps | | ||
| --- | --- | | ||
| `setup` | `setup-finetuning-data` → `setup-fasttext-vectors` → `build-floret` → `train-vectors-md` | | ||
| `tl-calamancy` | `train-parser` → `train-ner` → `assemble` | | ||
| `tl-calamancy-trf` | `train-parser-trf` → `train-ner-trf` → `assemble` | | ||
|
||
### 🗂 Assets | ||
|
||
The following assets are defined by the project. They can | ||
be fetched by running [`weasel assets`](https://github.com/explosion/weasel/tree/main/docs/cli.md#open_file_folder-assets) | ||
in the project directory. | ||
|
||
| File | Source | Description | | ||
| --- | --- | --- | | ||
| `assets/tlunified_raw_text.txt` | URL | Pre-converted raw text from TLUnified in JSONL format (1.1 GB). | | ||
| `assets/corpus.tar.gz` | URL | Annotated TLUnified corpora in spaCy format with train, dev, and test splits. | | ||
| `assets/tl_newscrawl-ud-train.conllu` | URL | Train dataset for NewsCrawl | | ||
| `assets/tl_newscrawl-ud-dev.conllu` | URL | Dev dataset for NewsCrawl | | ||
| `assets/tl_newscrawl-ud-test.conllu` | URL | Test dataset for NewsCrawl | | ||
| `assets/tl_trg-ud-test.conllu` | URL | Test dataset for TRG | | ||
| `assets/tl_ugnayan-ud-test.conllu` | URL | Test dataset for Ugnayan | | ||
| `assets/fasttext.tl.gz` | URL | Tagalog fastText vectors provided from the fastText website (trained from CommonCrawl and Wikipedia). | | ||
| `assets/floret` | Git | Floret repository for training floret and fastText models. | | ||
|
||
<!-- WEASEL: AUTO-GENERATED DOCS END (do not remove) --> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
[paths] | ||
parser_model = null | ||
ner_model = null | ||
|
||
[nlp] | ||
lang = "tl" | ||
pipeline = ["tok2vec", "trainable_lemmatizer", "tagger", "morphologizer", "parser", "ner"] | ||
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"} | ||
|
||
[initialize] | ||
vectors = ${paths.parser_model} | ||
|
||
[components] | ||
|
||
[components.tok2vec] | ||
source = ${paths.parser_model} | ||
component = "tok2vec" | ||
|
||
[components.trainable_lemmatizer] | ||
source = ${paths.parser_model} | ||
component = "trainable_lemmatizer" | ||
|
||
[components.tagger] | ||
source = ${paths.parser_model} | ||
component = "tagger" | ||
|
||
[components.morphologizer] | ||
source = ${paths.parser_model} | ||
component = "morphologizer" | ||
|
||
[components.parser] | ||
source = ${paths.parser_model} | ||
component = "parser" | ||
|
||
[components.ner] | ||
source = ${paths.ner_model} | ||
component = "ner" | ||
replace_listeners = ["model.tok2vec"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
[paths] | ||
parser_model = null | ||
ner_model = null | ||
|
||
[nlp] | ||
lang = "tl" | ||
pipeline = ["transformer", "trainable_lemmatizer", "tagger", "morphologizer", "parser", "ner"] | ||
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"} | ||
|
||
[components] | ||
|
||
[components.transformer] | ||
source = ${paths.parser_model} | ||
component = "transformer" | ||
|
||
[components.trainable_lemmatizer] | ||
source = ${paths.parser_model} | ||
component = "trainable_lemmatizer" | ||
|
||
[components.tagger] | ||
source = ${paths.parser_model} | ||
component = "tagger" | ||
|
||
[components.morphologizer] | ||
source = ${paths.parser_model} | ||
component = "morphologizer" | ||
|
||
[components.parser] | ||
source = ${paths.parser_model} | ||
component = "parser" | ||
|
||
[components.ner] | ||
source = ${paths.ner_model} | ||
component = "ner" | ||
replace_listeners = ["model.tok2vec"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,145 @@ | ||
[paths] | ||
train = null | ||
dev = null | ||
vectors = null | ||
init_tok2vec = null | ||
|
||
[system] | ||
gpu_allocator = null | ||
seed = 0 | ||
|
||
[nlp] | ||
lang = "tl" | ||
pipeline = ["tok2vec","ner"] | ||
batch_size = 1000 | ||
disabled = [] | ||
before_creation = null | ||
after_creation = null | ||
after_pipeline_creation = null | ||
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"} | ||
vectors = {"@vectors":"spacy.Vectors.v1"} | ||
|
||
[components] | ||
|
||
[components.ner] | ||
factory = "ner" | ||
incorrect_spans_key = null | ||
moves = null | ||
scorer = {"@scorers":"spacy.ner_scorer.v1"} | ||
update_with_oracle_cut_size = 100 | ||
|
||
[components.ner.model] | ||
@architectures = "spacy.TransitionBasedParser.v2" | ||
state_type = "ner" | ||
extra_state_tokens = false | ||
hidden_width = 64 | ||
maxout_pieces = 2 | ||
use_upper = true | ||
nO = null | ||
|
||
[components.ner.model.tok2vec] | ||
@architectures = "spacy.Tok2VecListener.v1" | ||
width = ${components.tok2vec.model.encode.width} | ||
upstream = "*" | ||
|
||
[components.tok2vec] | ||
factory = "tok2vec" | ||
|
||
[components.tok2vec.model] | ||
@architectures = "spacy.Tok2Vec.v2" | ||
|
||
[components.tok2vec.model.embed] | ||
@architectures = "spacy.MultiHashEmbed.v2" | ||
width = ${components.tok2vec.model.encode.width} | ||
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"] | ||
rows = [5000,1000,2500,2500] | ||
include_static_vectors = true | ||
|
||
[components.tok2vec.model.encode] | ||
@architectures = "spacy.MaxoutWindowEncoder.v2" | ||
width = 256 | ||
depth = 8 | ||
window_size = 1 | ||
maxout_pieces = 3 | ||
|
||
[corpora] | ||
|
||
[corpora.dev] | ||
@readers = "spacy.Corpus.v1" | ||
path = ${paths.dev} | ||
max_length = 0 | ||
gold_preproc = false | ||
limit = 0 | ||
augmenter = null | ||
|
||
[corpora.train] | ||
@readers = "spacy.Corpus.v1" | ||
path = ${paths.train} | ||
max_length = 0 | ||
gold_preproc = false | ||
limit = 0 | ||
augmenter = null | ||
|
||
[training] | ||
dev_corpus = "corpora.dev" | ||
train_corpus = "corpora.train" | ||
seed = ${system.seed} | ||
gpu_allocator = ${system.gpu_allocator} | ||
dropout = 0.1 | ||
accumulate_gradient = 1 | ||
patience = 1600 | ||
max_epochs = 0 | ||
max_steps = 20000 | ||
eval_frequency = 200 | ||
frozen_components = [] | ||
annotating_components = [] | ||
before_to_disk = null | ||
before_update = null | ||
|
||
[training.batcher] | ||
@batchers = "spacy.batch_by_words.v1" | ||
discard_oversize = false | ||
tolerance = 0.2 | ||
get_length = null | ||
|
||
[training.batcher.size] | ||
@schedules = "compounding.v1" | ||
start = 100 | ||
stop = 1000 | ||
compound = 1.001 | ||
t = 0.0 | ||
|
||
[training.logger] | ||
@loggers = "spacy.ConsoleLogger.v1" | ||
progress_bar = false | ||
|
||
[training.optimizer] | ||
@optimizers = "Adam.v1" | ||
beta1 = 0.9 | ||
beta2 = 0.999 | ||
L2_is_weight_decay = true | ||
L2 = 0.01 | ||
grad_clip = 1.0 | ||
use_averages = false | ||
eps = 0.00000001 | ||
learn_rate = 0.001 | ||
|
||
[training.score_weights] | ||
ents_f = 1.0 | ||
ents_p = 0.0 | ||
ents_r = 0.0 | ||
ents_per_type = null | ||
|
||
[pretraining] | ||
|
||
[initialize] | ||
vectors = ${paths.vectors} | ||
init_tok2vec = ${paths.init_tok2vec} | ||
vocab_data = null | ||
lookups = null | ||
before_init = null | ||
after_init = null | ||
|
||
[initialize.components] | ||
|
||
[initialize.tokenizer] |
Oops, something went wrong.