[WIP] Update

ljvmiranda921 · Jan 4, 2025 · b1f5003 · b1f5003
1 parent 21e43a9
commit b1f5003
Showing 1 changed file with 167 additions and 1 deletion.
diff --git a/models/v0.2.0/project.yml b/models/v0.2.0/project.yml
@@ -14,12 +14,178 @@ description: |
   test sets (as intended).
   - **Better evaluations**: Aside from evaluating our dependency parser and POS tagger on UD-TRG and UD-Ugnayan, we have also included Universal NER ([Mayhew et al., 2023](https://arxiv.org/abs/2311.09122)) as our test set for evaluating the NER component.
   - **Improved base model for tl_calamancy_trf**: Based on internal evaluations, we are now using [mDeBERTa-v3 (base)](https://huggingface.co/microsoft/mdeberta-v3-base) as our source of context-sensitive vectors for tl_calamancy_trf.
+  - **No more pretraining**: We found that pretraining doesn't really offer huge performance gains (0-1%) given the huge effort and time needed to do it. Hence, for ease of training the whole pipeline, we removed it from the calamanCy recipe.
 
   The namespaces for the latest models remain the same. 
   The legacy models will have an explicit version number in their HuggingFace repositories.
   Please see [this HuggingFace collection](https://huggingface.co/collections/ljvmiranda921/calamancy-models-for-tagalog-nlp-65629cc46ef2a1d0f9605c87) for more information.
 
+  ## Set-up
 
+  You can use this project to replicate the pipelines shipped by the project.
+  First, you need to install the required dependencies:
 
+  ```
+  pip install -r requirements.txt
+  ```
 
-  
+  Then run the set-up commands:
+
+  ```
+  python -m spacy project assets
+  python -m spacy project run setup
+  ```
+
+  This step downloads all assets and prepares all the datasets and binaries for
+  training use.  You can then train a pipeline by passing its name to the spaCy
+  project command. For example, if you wish to train `tl_calamancy_md`, you can
+  execute the corresponding workflow like so:
+
+  ```
+  python -m spacy project run tl-calamancy-md 
+  ```
+
+  ## Model information
+
+  The table below shows an overview of the calamanCy models in this project. For more information,
+  I suggest checking the [language pipeline metadata](https://spacy.io/api/language#meta).
+
+
+  | Model                       | Pipelines                                   | Description                                                                                                  |
+  |-----------------------------|---------------------------------------------|--------------------------------------------------------------------------------------------------------------|
+  | tl_calamancy_md ()   | tok2vec, tagger, trainable_lemmatizer, morphologizer, parser, ner | CPU-optimized Tagalog NLP model. Pretrained using the TLUnified dataset. Using floret vectors (50k keys)     |
+  | tl_calamancy_lg ()  | tok2vec, tagger, trainable_lemmatizer, morphologizer, parser, ner | CPU-optimized large Tagalog NLP model. Pretrained using the TLUnified dataset. Using fastText vectors (714k) |
+  | tl_calamancy_trf () | transformer, tagger, trainable_lemmatizer, morphologizer, parser, ner            | GPU-optimized transformer Tagalog NLP model. Uses mdeberta-v3-base as context vectors.                   |
+
+vars:
+  # Versioning
+  version: 0.2.0
+  # Training
+  lang: "tl"
+  gpu_id: 0
+
+directories:
+  - "assets"
+  - "configs"
+  - "corpus"
+  - "packages"
+  - "scripts"
+  - "training"
+  - "vectors"
+
+assets:
+  - dest: assets/corpus.tar.gz
+    description: "Annotated TLUnified corpora in spaCy format with train, dev, and test splits."
+    url: "https://storage.googleapis.com/ljvmiranda/calamanCy/tl_tlunified_gold/v${vars.dataset_version}/corpus.tar.gz"
+  - dest: assets/tl_newscrawl-ud-train.conllu
+    description: "Train dataset for NewsCrawl"
+    url: https://raw.githubusercontent.com/UniversalDependencies/UD_Tagalog-NewsCrawl/refs/heads/dev/tl_newscrawl-ud-train.conllu
+  - dest: assets/tl_newscrawl-ud-dev.conllu
+    description: "Dev dataset for NewsCrawl"
+    url: https://raw.githubusercontent.com/UniversalDependencies/UD_Tagalog-NewsCrawl/refs/heads/dev/tl_newscrawl-ud-dev.conllu
+  - dest: assets/tl_newscrawl-ud-test.conllu
+    description: "Test dataset for NewsCrawl"
+    url: https://raw.githubusercontent.com/UniversalDependencies/UD_Tagalog-NewsCrawl/refs/heads/dev/tl_newscrawl-ud-test.conllu
+  - dest: assets/tl_trg-ud-test.conllu
+    description: "Test dataset for TRG"
+    url: https://raw.githubusercontent.com/UniversalDependencies/UD_Tagalog-TRG/refs/heads/master/tl_trg-ud-test.conllu
+  - dest: assets/tl_ugnayan-ud-test.conllu
+    description: "Test dataset for Ugnayan"
+    url: https://raw.githubusercontent.com/UniversalDependencies/UD_Tagalog-Ugnayan/refs/heads/master/tl_ugnayan-ud-test.conllu
+  - dest: "assets/fasttext.tl.gz"
+    description: "Tagalog fastText vectors provided from the fastText website (trained from CommonCrawl and Wikipedia)."
+    url: "https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.tl.300.vec.gz"
+  - dest: "assets/floret"
+    description: "Floret repository for training floret and fastText models."
+    git:
+      repo: "https://github.com/explosion/floret"
+      branch: "main"
+      path: ""
+
+workflows:
+  setup:
+    - "setup-finetuning-data"
+    - "setup-fasttext-vectors"
+    - "build-floret"
+
+commands:
+  - name: "setup-finetuning-data"
+    help: "Prepare the Tagalog corpora used for training various spaCy components"
+    script:
+      # ner: Extract Tagalog corpora
+      - mkdir -p corpus/ner
+      - "tar -xzvf assets/corpus.tar.gz -C corpus/ner"
+      # parser, tagger, morph: Convert treebank into spaCy format
+      - mkdir -p corpus/treebank
+      - >-
+        python -m spacy convert 
+        assets/tl_newscrawl-ud-train.conllu corpus/treebank
+        --converter conllu
+        --morphology
+        --merge-subtokens
+      - >-
+        python -m spacy convert 
+        assets/tl_newscrawl-ud-dev.conllu corpus/treebank
+        --converter conllu
+        --morphology
+        --merge-subtokens
+      - >-
+        python -m spacy convert 
+        assets/tl_newscrawl-ud-test.conllu corpus/treebank
+        --converter conllu
+        --n-sents 1
+        --morphology
+        --merge-subtokens
+      - >-
+        python -m spacy convert 
+        assets/tl_ugnayan-ud-test.conllu corpus/treebank
+        --converter conllu
+        --n-sents 1
+        --morphology
+        --merge-subtokens
+      - >-
+        python -m spacy convert 
+        assets/tl_trg-ud-test.conllu corpus/treebank
+        --converter conllu
+        --n-sents 1
+        --morphology
+        --merge-subtokens
+    deps:
+      - assets/corpus.tar.gz
+      - assets/tl_newscrawl-ud-train.conllu
+      - assets/tl_newscrawl-ud-dev.conllu
+      - assets/tl_newscrawl-ud-test.conllu
+      - assets/tl_ugnayan-ud-test.conllu
+      - assets/tl_trg-ud-test.conllu
+    outputs:
+      - corpus/ner/train.spacy
+      - corpus/ner/dev.spacy
+      - corpus/ner/test.spacy
+      - corpus/treebank/tl_newscrawl-ud-train.spacy
+      - corpus/treebank/tl_newscrawl-ud-dev.spacy
+      - corpus/treebank/tl_newscrawl-ud-test.spacy
+      - corpus/treebank/tl_ugnayan-ud-test.spacy
+      - corpus/treebank/tl_trg-ud-test.spacy
+
+  - name: "setup-fasttext-vectors"
+    help: "Make fastText vectors spaCy compatible"
+    script:
+      - gzip -d -f assets/fasttext.tl.gz
+      - mkdir -p vectors/fasttext-tl
+      - >-
+        python -m spacy init vectors
+        tl assets/fasttext.tl vectors/fasttext-tl
+    deps:
+      - assets/fasttext.tl.gz
+    outputs:
+      - vectors/fasttext-tl
+
+  - name: "build-floret"
+    help: "Build floret binary for training fastText / floret vectors"
+    script:
+      - make -C assets/floret
+      - chmod +x assets/floret/floret
+    deps:
+      - assets/floret
+    outputs:
+      - assets/floret/floret