Merge branch 'v4'

spraakbanken · Dec 7, 2020 · 0f25e65 · 0f25e65
2 parents 99fa166 + 40b05c5
commit 0f25e65
Show file tree

Hide file tree

Showing 604 changed files with 20,456 additions and 775,600 deletions.
diff --git a/.editorconfig b/.editorconfig
@@ -0,0 +1,18 @@
+# https://editorconfig.org/
+
+root = true
+
+[*]
+charset = utf-8
+end_of_line = lf
+insert_final_newline = true
+
+[*.py]
+indent_style = space
+indent_size = 4
+max_line_length = 120
+trim_trailing_whitespace = true
+
+[*.yaml]
+indent_style = space
+indent_size = 4
diff --git a/.gitattributes b/.gitattributes
@@ -1,4 +1,4 @@
-models/*.pickle filter=lfs diff=lfs merge=lfs -text
-models/*.model filter=lfs diff=lfs merge=lfs -text
-models/suc3.morphtable.words filter=lfs diff=lfs merge=lfs -text
 *.xml filter=lfs diff=lfs merge=lfs -text
+tests/**/gold_export/** filter=lfs diff=lfs merge=lfs -text
+tests/**/gold_sparv-workdir/** filter=lfs diff=lfs merge=lfs -text
+tests/**/source/** filter=lfs diff=lfs merge=lfs -text
diff --git a/.gitignore b/.gitignore
@@ -1,39 +1,36 @@
-sparv/__pycache__
-sparv/freeling.py
-sparv/util/__pycache__/
+# Sparv's data directory
+data
+
+# Snakemake
+.snakemake
+
+# Example corpora zip
+tests/example_corpora.zip
+
+# Editors
+.idea/
+.vscode
+
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# Environments
+.env
+.venv
+env/
 venv/
-*.pyc
+ENV/
+env.bak/
+venv.bak/
 
-models/freeling
-models/treetagger
-models/wsd
-models/bettertokenizer.sv.saldo-tokens
-models/blingbring.pickle
-models/blingbring.txt
-models/dalin.pickle
-models/dalinm.xml
-models/diapivot.pickle
-models/diapivot.xml
-models/geo.pickle
-models/geo_alternateNames.txt
-models/geo_cities1000.txt
-models/hunpos.saldo.suc-tags.morphtable
-models/hunpos.dalinm-swedberg.saldo.suc-tags.morphtable
-models/nst_utf8.txt
-models/saldo.compound.pickle
-models/saldo.pickle
-models/saldom.xml
-models/sensaldo-base*
-models/sensaldo.pickle
-models/stats.pickle
-models/stats_all.txt
-models/swedberg.pickle
-models/swedbergm.xml
-models/swefn.pickle
-models/swefn.xml
-models/swemalt-1.7.2.mco
+# Distribution / packaging
+build/
+dist/
+*.egg-info/
+*.egg
+MANIFEST*
 
-bin/maltparser-1.7.2/
-bin/wsd/
-bin/word_alignment/
-bin/treetagger/
+# Unit test / coverage reports
+.pytest_cache/
diff --git a/README.md b/README.md
@@ -1,46 +1,71 @@
 # Språkbanken's Sparv Pipeline
 
-The Sparv Pipeline is a corpus annotation pipeline created by [Språkbanken](https://spraakbanken.gu.se/).
-The source code is made available under the [MIT license](https://opensource.org/licenses/MIT).
+The Sparv pipeline is a corpus annotation tool run from the command line. Additional documentation can be found here:
+https://spraakbanken.gu.se/en/tools/sparv/pipeline.
 
-Additional documentation can be found here:
-https://spraakbanken.gu.se/en/tools/sparv/pipeline
+Check the [changelog](changelog.md) to see what's new!
 
-For questions, problems or suggestions contact:
-[email protected]
+Sparv is developed by [Språkbanken](https://spraakbanken.gu.se/). The source code is available under the [MIT
+license](https://opensource.org/licenses/MIT).
+
+If you have any questions, problems or suggestions please contact <[email protected]>.
 
 ## Prerequisites
 
-* A Unix-like environment (e.g. Linux, OS X)
-* [Python 3.4](http://python.org/) or newer
-* [GNU Make](https://www.gnu.org/software/make/)
-* [Java](http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html)
+* A Unix-like environment (e.g. Linux, OS X or [Windows Subsystem for
+  Linux](https://docs.microsoft.com/en-us/windows/wsl/about)) *Note:* Most things within Sparv should work in a Windows
+  environment as well but we cannot guarantee anything since we do not test our software on Windows.
+* [Python 3.6.1](http://python.org/) or newer
+* [Java](http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html) (if you want to run
+  Swedish dependency parsing, Swedish word sense disambiguation or the Stanford Parser)
 
 ## Installation
 
-* Before cloning the git repository make sure you have
-  [Git Large File Storage](https://git-lfs.github.com/)
-  installed (`apt install git-lfs`). Some files will not be downloaded correctly otherwise.
-* After cloning, set variables in `makefiles/Makefile.config` (especially `SPARV_PIPELINE_PATH`).
-* Add `SPARV_MAKEFILES` to your environment variables and point its path
-  to the `makefiles` directory.
-* Create a Python 3 virtual environment and install the requirements:
-
-    ```
-    python3 -m venv venv
-    source venv/bin/activate
-    pip install --upgrade pip
-    pip install -r requirements.txt
-    deactivate
-    ```
-* Build the pipeline models:
-
-    ```
-    make -C models/ all
-    # Optional: remove unnecessary files to save disk space
-    make -C models/ space
-    ```
-
-## Installation of additional software
-
-The Sparv Pipeline can be used together with several plugins and third-party software. Please check https://spraakbanken.gu.se/en/tools/sparv/pipeline/installation for more details!
+The Sparv pipeline can be installed using [pip](https://pip.pypa.io/en/stable/installing). We even recommend using
+[pipx](https://pipxproject.github.io/pipx/) so that you can install the `sparv` command globally:
+
+```bash
+python3 -m pip install --user pipx
+python3 -m pipx ensurepath
+pipx install sparv-pipeline
+```
+
+Alternatively you can install Sparv from the latest release from GitHub:
+
+```bash
+pipx install https://github.com/spraakbanken/sparv-pipeline/archive/latest.tar.gz
+```
+
+Now you should be ready to run the Sparv command! Try it by typing `sparv --help`.
+
+The Sparv Pipeline can be used together with several plugins and third-party software. Please check the [Sparv user
+manual](https://spraakbanken.gu.se/en/tools/sparv/pipeline/installation) for more details!
+
+## Roadmap
+
+* Export of corpus metadata to META-SHARE format
+* Support for Swedish historic texts
+* Support for parallel corpora
+* Preprocessing of indata with automatic chunking
+
+## Running tests
+
+If you want to run the tests you will need to clone this project from
+[GitHub](https://github.com/spraakbanken/sparv-pipeline) since the test data is not distributed with pip.
+
+Before cloning the repository with [git](https://git-scm.com/downloads) make sure you have [Git Large File
+Storage](https://git-lfs.github.com/) installed (`apt install git-lfs`). Some files will not be downloaded correctly
+otherwise.
+
+We recommend that you set up a virtual environment and install the dependencies (including the dev dependencies) listed
+in `setup.py`:
+
+```bash
+python3 -m venv venv
+source venv/bin/activate
+pip install -e .[dev]
+```
+
+Now with the virtual environment activated you can run `pytest` from the sparv-pipeline directory. You can run
+particular tests using the provided markers (e.g. `pytest -m swe` to run the Swedish tests only) or via substring
+matching (e.g. `pytest -k "not slow"` to skip the slow tests).
diff --git a/bin/analyze_xml b/bin/analyze_xml
diff --git a/bin/xml_extract b/bin/xml_extract
diff --git a/changelog.md b/changelog.md
@@ -0,0 +1,42 @@
+# Changelog
+
+## version 4.0.0 (2020-12-07)
+
+- This version contains a complete make-over of the Sparv pipeline!
+  - Everything is written in Python now (no more Makefiles or bash code).
+  - Increased platform independence
+  - This facilitates creating new modules, debugging, and maintenance.
+
+- Easier installation process, Sparv is now on pypi!
+  - New plugin system facilitates installation of Sparv plugins (like FreeLing).
+
+- New format for corpus config files
+  - The new format is yaml which is easier to write and more human readable than makefiles.
+  - There is a command-line wizard which helps you create corpus config files.
+  - You no longer have to specify XML elements and attributes that should be kept from the original files. The  XML
+    parser now parses all existing elements and their attributes by default. Their original names will be kept and
+    included in the export (unless you explicitely override this behaviour in the corpus config).
+
+- Improved interface
+  - New command line interface with help messages
+  - Better feedback with progress bar instead of illegible log output (log output is still available though)
+  - More helpful error messages
+
+- New corpus import and export formats
+  - Import of plain text files
+  - Export to csv (a user-friendly, non-technical column format)
+  - Export to (Språkbanken Text version of) CoNNL-U format
+  - Export to corpus statistics (word frequency lists)
+
+- Updated models and tools for processing Swedish corpora
+  - Sparv now uses Stanza with newly trained models and higher accuracy for POS-tagging and dependency parsing on
+    Swedish texts.
+
+- Better support for annotating other (i.e. non-Swedish) languages
+  - Integrated Stanford Parser for English analysis (POS-tags, baseforms, dependency parsing, named-entity recognition).
+  - Added named-entity recognition for FreeLing languages.
+  - If a language is supported by different annotation tools, you can now choose which tool to use.
+
+- Improved code modularity
+  - Increased independence between modules and language models
+  - This facilitates adding new annotation modules and import/export formats.