Skip to content

Commit

Permalink
Merge branch 'v4'
Browse files Browse the repository at this point in the history
  • Loading branch information
anne17 committed Dec 7, 2020
2 parents 99fa166 + 40b05c5 commit 0f25e65
Show file tree
Hide file tree
Showing 604 changed files with 20,456 additions and 775,600 deletions.
18 changes: 18 additions & 0 deletions .editorconfig
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# https://editorconfig.org/

root = true

[*]
charset = utf-8
end_of_line = lf
insert_final_newline = true

[*.py]
indent_style = space
indent_size = 4
max_line_length = 120
trim_trailing_whitespace = true

[*.yaml]
indent_style = space
indent_size = 4
6 changes: 3 additions & 3 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
models/*.pickle filter=lfs diff=lfs merge=lfs -text
models/*.model filter=lfs diff=lfs merge=lfs -text
models/suc3.morphtable.words filter=lfs diff=lfs merge=lfs -text
*.xml filter=lfs diff=lfs merge=lfs -text
tests/**/gold_export/** filter=lfs diff=lfs merge=lfs -text
tests/**/gold_sparv-workdir/** filter=lfs diff=lfs merge=lfs -text
tests/**/source/** filter=lfs diff=lfs merge=lfs -text
69 changes: 33 additions & 36 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,39 +1,36 @@
sparv/__pycache__
sparv/freeling.py
sparv/util/__pycache__/
# Sparv's data directory
data

# Snakemake
.snakemake

# Example corpora zip
tests/example_corpora.zip

# Editors
.idea/
.vscode

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# Environments
.env
.venv
env/
venv/
*.pyc
ENV/
env.bak/
venv.bak/

models/freeling
models/treetagger
models/wsd
models/bettertokenizer.sv.saldo-tokens
models/blingbring.pickle
models/blingbring.txt
models/dalin.pickle
models/dalinm.xml
models/diapivot.pickle
models/diapivot.xml
models/geo.pickle
models/geo_alternateNames.txt
models/geo_cities1000.txt
models/hunpos.saldo.suc-tags.morphtable
models/hunpos.dalinm-swedberg.saldo.suc-tags.morphtable
models/nst_utf8.txt
models/saldo.compound.pickle
models/saldo.pickle
models/saldom.xml
models/sensaldo-base*
models/sensaldo.pickle
models/stats.pickle
models/stats_all.txt
models/swedberg.pickle
models/swedbergm.xml
models/swefn.pickle
models/swefn.xml
models/swemalt-1.7.2.mco
# Distribution / packaging
build/
dist/
*.egg-info/
*.egg
MANIFEST*

bin/maltparser-1.7.2/
bin/wsd/
bin/word_alignment/
bin/treetagger/
# Unit test / coverage reports
.pytest_cache/
97 changes: 61 additions & 36 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,46 +1,71 @@
# Språkbanken's Sparv Pipeline

The Sparv Pipeline is a corpus annotation pipeline created by [Språkbanken](https://spraakbanken.gu.se/).
The source code is made available under the [MIT license](https://opensource.org/licenses/MIT).
The Sparv pipeline is a corpus annotation tool run from the command line. Additional documentation can be found here:
https://spraakbanken.gu.se/en/tools/sparv/pipeline.

Additional documentation can be found here:
https://spraakbanken.gu.se/en/tools/sparv/pipeline
Check the [changelog](changelog.md) to see what's new!

For questions, problems or suggestions contact:
[email protected]
Sparv is developed by [Språkbanken](https://spraakbanken.gu.se/). The source code is available under the [MIT
license](https://opensource.org/licenses/MIT).

If you have any questions, problems or suggestions please contact <[email protected]>.

## Prerequisites

* A Unix-like environment (e.g. Linux, OS X)
* [Python 3.4](http://python.org/) or newer
* [GNU Make](https://www.gnu.org/software/make/)
* [Java](http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html)
* A Unix-like environment (e.g. Linux, OS X or [Windows Subsystem for
Linux](https://docs.microsoft.com/en-us/windows/wsl/about)) *Note:* Most things within Sparv should work in a Windows
environment as well but we cannot guarantee anything since we do not test our software on Windows.
* [Python 3.6.1](http://python.org/) or newer
* [Java](http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html) (if you want to run
Swedish dependency parsing, Swedish word sense disambiguation or the Stanford Parser)

## Installation

* Before cloning the git repository make sure you have
[Git Large File Storage](https://git-lfs.github.com/)
installed (`apt install git-lfs`). Some files will not be downloaded correctly otherwise.
* After cloning, set variables in `makefiles/Makefile.config` (especially `SPARV_PIPELINE_PATH`).
* Add `SPARV_MAKEFILES` to your environment variables and point its path
to the `makefiles` directory.
* Create a Python 3 virtual environment and install the requirements:

```
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
deactivate
```
* Build the pipeline models:
```
make -C models/ all
# Optional: remove unnecessary files to save disk space
make -C models/ space
```
## Installation of additional software
The Sparv Pipeline can be used together with several plugins and third-party software. Please check https://spraakbanken.gu.se/en/tools/sparv/pipeline/installation for more details!
The Sparv pipeline can be installed using [pip](https://pip.pypa.io/en/stable/installing). We even recommend using
[pipx](https://pipxproject.github.io/pipx/) so that you can install the `sparv` command globally:

```bash
python3 -m pip install --user pipx
python3 -m pipx ensurepath
pipx install sparv-pipeline
```

Alternatively you can install Sparv from the latest release from GitHub:

```bash
pipx install https://github.com/spraakbanken/sparv-pipeline/archive/latest.tar.gz
```

Now you should be ready to run the Sparv command! Try it by typing `sparv --help`.

The Sparv Pipeline can be used together with several plugins and third-party software. Please check the [Sparv user
manual](https://spraakbanken.gu.se/en/tools/sparv/pipeline/installation) for more details!

## Roadmap

* Export of corpus metadata to META-SHARE format
* Support for Swedish historic texts
* Support for parallel corpora
* Preprocessing of indata with automatic chunking

## Running tests

If you want to run the tests you will need to clone this project from
[GitHub](https://github.com/spraakbanken/sparv-pipeline) since the test data is not distributed with pip.

Before cloning the repository with [git](https://git-scm.com/downloads) make sure you have [Git Large File
Storage](https://git-lfs.github.com/) installed (`apt install git-lfs`). Some files will not be downloaded correctly
otherwise.

We recommend that you set up a virtual environment and install the dependencies (including the dev dependencies) listed
in `setup.py`:

```bash
python3 -m venv venv
source venv/bin/activate
pip install -e .[dev]
```

Now with the virtual environment activated you can run `pytest` from the sparv-pipeline directory. You can run
particular tests using the provided markers (e.g. `pytest -m swe` to run the Swedish tests only) or via substring
matching (e.g. `pytest -k "not slow"` to skip the slow tests).
28 changes: 0 additions & 28 deletions bin/analyze_xml

This file was deleted.

77 changes: 0 additions & 77 deletions bin/xml_extract

This file was deleted.

42 changes: 42 additions & 0 deletions changelog.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Changelog

## version 4.0.0 (2020-12-07)

- This version contains a complete make-over of the Sparv pipeline!
- Everything is written in Python now (no more Makefiles or bash code).
- Increased platform independence
- This facilitates creating new modules, debugging, and maintenance.

- Easier installation process, Sparv is now on pypi!
- New plugin system facilitates installation of Sparv plugins (like FreeLing).

- New format for corpus config files
- The new format is yaml which is easier to write and more human readable than makefiles.
- There is a command-line wizard which helps you create corpus config files.
- You no longer have to specify XML elements and attributes that should be kept from the original files. The XML
parser now parses all existing elements and their attributes by default. Their original names will be kept and
included in the export (unless you explicitely override this behaviour in the corpus config).

- Improved interface
- New command line interface with help messages
- Better feedback with progress bar instead of illegible log output (log output is still available though)
- More helpful error messages

- New corpus import and export formats
- Import of plain text files
- Export to csv (a user-friendly, non-technical column format)
- Export to (Språkbanken Text version of) CoNNL-U format
- Export to corpus statistics (word frequency lists)

- Updated models and tools for processing Swedish corpora
- Sparv now uses Stanza with newly trained models and higher accuracy for POS-tagging and dependency parsing on
Swedish texts.

- Better support for annotating other (i.e. non-Swedish) languages
- Integrated Stanford Parser for English analysis (POS-tags, baseforms, dependency parsing, named-entity recognition).
- Added named-entity recognition for FreeLing languages.
- If a language is supported by different annotation tools, you can now choose which tool to use.

- Improved code modularity
- Increased independence between modules and language models
- This facilitates adding new annotation modules and import/export formats.
Loading

0 comments on commit 0f25e65

Please sign in to comment.