-
Notifications
You must be signed in to change notification settings - Fork 7
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
604 changed files
with
20,456 additions
and
775,600 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
# https://editorconfig.org/ | ||
|
||
root = true | ||
|
||
[*] | ||
charset = utf-8 | ||
end_of_line = lf | ||
insert_final_newline = true | ||
|
||
[*.py] | ||
indent_style = space | ||
indent_size = 4 | ||
max_line_length = 120 | ||
trim_trailing_whitespace = true | ||
|
||
[*.yaml] | ||
indent_style = space | ||
indent_size = 4 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
models/*.pickle filter=lfs diff=lfs merge=lfs -text | ||
models/*.model filter=lfs diff=lfs merge=lfs -text | ||
models/suc3.morphtable.words filter=lfs diff=lfs merge=lfs -text | ||
*.xml filter=lfs diff=lfs merge=lfs -text | ||
tests/**/gold_export/** filter=lfs diff=lfs merge=lfs -text | ||
tests/**/gold_sparv-workdir/** filter=lfs diff=lfs merge=lfs -text | ||
tests/**/source/** filter=lfs diff=lfs merge=lfs -text |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,39 +1,36 @@ | ||
sparv/__pycache__ | ||
sparv/freeling.py | ||
sparv/util/__pycache__/ | ||
# Sparv's data directory | ||
data | ||
|
||
# Snakemake | ||
.snakemake | ||
|
||
# Example corpora zip | ||
tests/example_corpora.zip | ||
|
||
# Editors | ||
.idea/ | ||
.vscode | ||
|
||
# Byte-compiled / optimized / DLL files | ||
__pycache__/ | ||
*.py[cod] | ||
*$py.class | ||
|
||
# Environments | ||
.env | ||
.venv | ||
env/ | ||
venv/ | ||
*.pyc | ||
ENV/ | ||
env.bak/ | ||
venv.bak/ | ||
|
||
models/freeling | ||
models/treetagger | ||
models/wsd | ||
models/bettertokenizer.sv.saldo-tokens | ||
models/blingbring.pickle | ||
models/blingbring.txt | ||
models/dalin.pickle | ||
models/dalinm.xml | ||
models/diapivot.pickle | ||
models/diapivot.xml | ||
models/geo.pickle | ||
models/geo_alternateNames.txt | ||
models/geo_cities1000.txt | ||
models/hunpos.saldo.suc-tags.morphtable | ||
models/hunpos.dalinm-swedberg.saldo.suc-tags.morphtable | ||
models/nst_utf8.txt | ||
models/saldo.compound.pickle | ||
models/saldo.pickle | ||
models/saldom.xml | ||
models/sensaldo-base* | ||
models/sensaldo.pickle | ||
models/stats.pickle | ||
models/stats_all.txt | ||
models/swedberg.pickle | ||
models/swedbergm.xml | ||
models/swefn.pickle | ||
models/swefn.xml | ||
models/swemalt-1.7.2.mco | ||
# Distribution / packaging | ||
build/ | ||
dist/ | ||
*.egg-info/ | ||
*.egg | ||
MANIFEST* | ||
|
||
bin/maltparser-1.7.2/ | ||
bin/wsd/ | ||
bin/word_alignment/ | ||
bin/treetagger/ | ||
# Unit test / coverage reports | ||
.pytest_cache/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,46 +1,71 @@ | ||
# Språkbanken's Sparv Pipeline | ||
|
||
The Sparv Pipeline is a corpus annotation pipeline created by [Språkbanken](https://spraakbanken.gu.se/). | ||
The source code is made available under the [MIT license](https://opensource.org/licenses/MIT). | ||
The Sparv pipeline is a corpus annotation tool run from the command line. Additional documentation can be found here: | ||
https://spraakbanken.gu.se/en/tools/sparv/pipeline. | ||
|
||
Additional documentation can be found here: | ||
https://spraakbanken.gu.se/en/tools/sparv/pipeline | ||
Check the [changelog](changelog.md) to see what's new! | ||
|
||
For questions, problems or suggestions contact: | ||
[email protected] | ||
Sparv is developed by [Språkbanken](https://spraakbanken.gu.se/). The source code is available under the [MIT | ||
license](https://opensource.org/licenses/MIT). | ||
|
||
If you have any questions, problems or suggestions please contact <[email protected]>. | ||
|
||
## Prerequisites | ||
|
||
* A Unix-like environment (e.g. Linux, OS X) | ||
* [Python 3.4](http://python.org/) or newer | ||
* [GNU Make](https://www.gnu.org/software/make/) | ||
* [Java](http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html) | ||
* A Unix-like environment (e.g. Linux, OS X or [Windows Subsystem for | ||
Linux](https://docs.microsoft.com/en-us/windows/wsl/about)) *Note:* Most things within Sparv should work in a Windows | ||
environment as well but we cannot guarantee anything since we do not test our software on Windows. | ||
* [Python 3.6.1](http://python.org/) or newer | ||
* [Java](http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html) (if you want to run | ||
Swedish dependency parsing, Swedish word sense disambiguation or the Stanford Parser) | ||
|
||
## Installation | ||
|
||
* Before cloning the git repository make sure you have | ||
[Git Large File Storage](https://git-lfs.github.com/) | ||
installed (`apt install git-lfs`). Some files will not be downloaded correctly otherwise. | ||
* After cloning, set variables in `makefiles/Makefile.config` (especially `SPARV_PIPELINE_PATH`). | ||
* Add `SPARV_MAKEFILES` to your environment variables and point its path | ||
to the `makefiles` directory. | ||
* Create a Python 3 virtual environment and install the requirements: | ||
|
||
``` | ||
python3 -m venv venv | ||
source venv/bin/activate | ||
pip install --upgrade pip | ||
pip install -r requirements.txt | ||
deactivate | ||
``` | ||
* Build the pipeline models: | ||
``` | ||
make -C models/ all | ||
# Optional: remove unnecessary files to save disk space | ||
make -C models/ space | ||
``` | ||
## Installation of additional software | ||
The Sparv Pipeline can be used together with several plugins and third-party software. Please check https://spraakbanken.gu.se/en/tools/sparv/pipeline/installation for more details! | ||
The Sparv pipeline can be installed using [pip](https://pip.pypa.io/en/stable/installing). We even recommend using | ||
[pipx](https://pipxproject.github.io/pipx/) so that you can install the `sparv` command globally: | ||
|
||
```bash | ||
python3 -m pip install --user pipx | ||
python3 -m pipx ensurepath | ||
pipx install sparv-pipeline | ||
``` | ||
|
||
Alternatively you can install Sparv from the latest release from GitHub: | ||
|
||
```bash | ||
pipx install https://github.com/spraakbanken/sparv-pipeline/archive/latest.tar.gz | ||
``` | ||
|
||
Now you should be ready to run the Sparv command! Try it by typing `sparv --help`. | ||
|
||
The Sparv Pipeline can be used together with several plugins and third-party software. Please check the [Sparv user | ||
manual](https://spraakbanken.gu.se/en/tools/sparv/pipeline/installation) for more details! | ||
|
||
## Roadmap | ||
|
||
* Export of corpus metadata to META-SHARE format | ||
* Support for Swedish historic texts | ||
* Support for parallel corpora | ||
* Preprocessing of indata with automatic chunking | ||
|
||
## Running tests | ||
|
||
If you want to run the tests you will need to clone this project from | ||
[GitHub](https://github.com/spraakbanken/sparv-pipeline) since the test data is not distributed with pip. | ||
|
||
Before cloning the repository with [git](https://git-scm.com/downloads) make sure you have [Git Large File | ||
Storage](https://git-lfs.github.com/) installed (`apt install git-lfs`). Some files will not be downloaded correctly | ||
otherwise. | ||
|
||
We recommend that you set up a virtual environment and install the dependencies (including the dev dependencies) listed | ||
in `setup.py`: | ||
|
||
```bash | ||
python3 -m venv venv | ||
source venv/bin/activate | ||
pip install -e .[dev] | ||
``` | ||
|
||
Now with the virtual environment activated you can run `pytest` from the sparv-pipeline directory. You can run | ||
particular tests using the provided markers (e.g. `pytest -m swe` to run the Swedish tests only) or via substring | ||
matching (e.g. `pytest -k "not slow"` to skip the slow tests). |
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
# Changelog | ||
|
||
## version 4.0.0 (2020-12-07) | ||
|
||
- This version contains a complete make-over of the Sparv pipeline! | ||
- Everything is written in Python now (no more Makefiles or bash code). | ||
- Increased platform independence | ||
- This facilitates creating new modules, debugging, and maintenance. | ||
|
||
- Easier installation process, Sparv is now on pypi! | ||
- New plugin system facilitates installation of Sparv plugins (like FreeLing). | ||
|
||
- New format for corpus config files | ||
- The new format is yaml which is easier to write and more human readable than makefiles. | ||
- There is a command-line wizard which helps you create corpus config files. | ||
- You no longer have to specify XML elements and attributes that should be kept from the original files. The XML | ||
parser now parses all existing elements and their attributes by default. Their original names will be kept and | ||
included in the export (unless you explicitely override this behaviour in the corpus config). | ||
|
||
- Improved interface | ||
- New command line interface with help messages | ||
- Better feedback with progress bar instead of illegible log output (log output is still available though) | ||
- More helpful error messages | ||
|
||
- New corpus import and export formats | ||
- Import of plain text files | ||
- Export to csv (a user-friendly, non-technical column format) | ||
- Export to (Språkbanken Text version of) CoNNL-U format | ||
- Export to corpus statistics (word frequency lists) | ||
|
||
- Updated models and tools for processing Swedish corpora | ||
- Sparv now uses Stanza with newly trained models and higher accuracy for POS-tagging and dependency parsing on | ||
Swedish texts. | ||
|
||
- Better support for annotating other (i.e. non-Swedish) languages | ||
- Integrated Stanford Parser for English analysis (POS-tags, baseforms, dependency parsing, named-entity recognition). | ||
- Added named-entity recognition for FreeLing languages. | ||
- If a language is supported by different annotation tools, you can now choose which tool to use. | ||
|
||
- Improved code modularity | ||
- Increased independence between modules and language models | ||
- This facilitates adding new annotation modules and import/export formats. |
Oops, something went wrong.