Merge branch 'dev'

spraakbanken · Aug 10, 2022 · fc82ae1 · fc82ae1
2 parents 7293d42 + 1863ada
commit fc82ae1
Show file tree

Hide file tree

Showing 395 changed files with 7,480 additions and 5,777 deletions.
diff --git a/.gitignore b/.gitignore
@@ -34,3 +34,4 @@ MANIFEST*
 
 # Unit test / coverage reports
 .pytest_cache/
+logs/
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,11 +1,118 @@
 # Changelog
 
+## [5.0.0]
+
+### Added
+
+- Added a [quick start guide](https://spraakbanken.gu.se/sparv/#/user-manual/quick-start) in the documentation.
+- Added importers for more file formats: docx and odt.
+- Added support for [language
+  varieties](https://spraakbanken.gu.se/sparv/#/developers-guide/writing-sparv-plugins?id=languages-and-varieties).
+- Re-introduced analyses for [Old Swedish and Swedish from the
+  1800's](https://spraakbanken.gu.se/sparv/#/developers-guide/writing-sparv-plugins?id=languages-and-varieties).
+- Added a more flexible stats export which lets you choose which annotations to include in the frequency list.
+- Added installer for stats export.
+- Added Stanza support for English.
+- Added better install and uninstall instructions for plugins.
+- Added support for [XML
+  namespaces](https://spraakbanken.gu.se/sparv/#/user-manual/corpus-configuration?id=xml-namespaces).
+- Added explicit `ref` annotations (indexing tokens within sentences) for Stanza, Malt and Stanford.
+- Added a `--reset` flag to the `sparv setup` command for resetting the data directory setting.
+- Added a separate installer for installing scrambled CWB files.
+- A warning message is printed when Sparv discovers source files that don't match the file extension in the corpus
+  config.
+- An error message is shown if unknown exporters are listed under `export.default`.
+- Allow source annotations named "not".
+- Added a source filename annotator.
+- Show an error message if user specifies an invalid installation.
+- Added a `--stats` flag to several commands, showing a summary after completion of time spent per annotator.
+- Added `stanza.max_token_length` option.
+- Added Hunpos-backoff annotation for Stanza msd and pos.
+- Added `--force` flag to `run-rule` and `create-file` commands to force recreation of the listed targets.
+- Added a new exporter which produces a YAML file with info about the Sparv version and annotation date.
+  This info is also added to the combined XML exports.
+- Exit with an error message if a required executable is missing.
+- Show a warning if an installed plugin is incompatible with Sparv.
+- Introduced compression of annotation files in sparv-workdir. The type of compression can be configured (or disabled)
+  by using the `sparv.compression` variable. `gzip` is used by default.
+- Add flags `--rerun-incomplete` and `--mark-complete` to the `sparv run` command for handling incomplete output files.
+- Several exporters now show a warning if a token annotation isn't included in the list of export annotations.
+- Added `get_size()` to the `Annotation` and `AnnotationAllSourceFiles` classes, to get the size (number of values)
+  for an annotation.
+- Added support for [individual progress bars for
+  annotators](https://spraakbanken.gu.se/sparv/#/developers-guide/writing-sparv-plugins?id=progress-bar).
+- Added `SourceAnnotationsAllSourceFiles` class.
+
+### Changed
+
+- Significantly improved the CLI startup time.
+- Replaced the `--verbose` flag with `--simple` and made verbose the default mode.
+- Everything needed by Sparv modules (including `utils`) is now available through the `sparv.api` package.
+- Empty corpus config files are treated as missing config files.
+- Moved CWB corpus installer from `korp` module into `cwb` module.
+  This lead to some name changes of variables used in the corpus config:
+    - `korp.remote_cwb_datadir` is now called `cwb.remote_data_dir`
+    - `korp.remote_cwb_registry` is now called `cwb.remote_registry_dir`
+    - `korp.remote_host` has been split into `korp.remote_host` (host for SQL files) and `cwb.remote_host` (host for CWB
+       files)
+    - install target `korp:install_corpus` has been renamed and split into `cwb:install_corpus` and 
+      `cwb:install_corpus_scrambled`
+- Renamed the following stats exports:
+    `stats_export:freq_list` is now called `stats_export:sbx_freq_list`
+    `stats_export:freq_list_simple` is now called `stats_export:sbx_freq_list_simple`
+    `stats_export:install_freq_list` is now called `stats_export:install_sbx_freq_list`
+    `stats_export:freq_list_fsv` is now called `stats_export:sbx_freq_list_fsv`
+- Now incrementally compresses bz2 files in compressed XML export to avoid memory problems with large files.
+- Corpus source files are now called "source files" instead of "documents". Consequently, the `--doc/-d` flag has been
+  renamed to `--file/-f`.
+- `import.document_annotation` has been renamed to `import.text_annotation`, and all references to "document" as a text
+  unit have been changed to "text".
+- Minimum Python version is now 3.6.2.
+- Removed Python 2 dependency for hfst-SweNER.
+- Tweaked compound analysis to make it less slow and added option to disable using source text as lexicon.
+- `cwb` module now exports to regular export directory instead of CWB's own directories.
+- Removed ability to use absolute path for exports.
+- Renamed the installer `xml_export:install_original` to `xml_export:install`. The configuration variables
+  `xml_export.export_original_host` and `xml_export.export_original_path` have been changed to
+  `xml_export.export_host` and `xml_export.export_path` respectively. The configuration variables for the scrambled
+  installer has been changed from `xml_export.export_host` and `xml_export.export_path` to
+  `xml_export.export_scrambled_host` and `xml_export.export_scrambled_path` respectively.
+- Removed `header_annotations` configuration variable from `export` (it is still available as
+  `xml_export.header_annotations`).
+- All export files must now be written to subdirectories, and each subdirectory must use the exporter's module name as
+  prefix (or be equal to the module name).
+- Empty attributes are no longer included in the csv export.
+- When Sparv crashes due to unexpected errors, the traceback is now hidden from the user unless the `--log debug`
+  argument is used.
+- If the `-j`/`--cores` option is used without an argument, all available CPU cores are used.
+- Importers are now required to write a source structure file.
+- CWB installation now also works locally.
+
+### Fixed
+
+- Fixed rule ambiguity problems (functions with an order higher than 1 were not accessible).
+- Automatically download correct Hunpos model depending on the Hunpos version installed.
+- Stanza can now handle tokens containing whitespaces.
+- Fixed a bug which lead to computing the source file list multiple times.
+- Fixed a few date related crashes in the `cwb` module.
+- Fixed installation of compressed, scrambled XML export.
+- Fixed bug in PunctuationTokenizer leading to orphaned tokens.
+- Fixed crash when scrambling nested spans by only scrambling the outermost ones.
+- Fixed crash in xml_import when no elements are imported.
+- Fixed crash on empty sentences in Stanza.
+- Better handling of empty XML elements in XML export.
+- Faulty custom modules now result in a warning instead of a crash.
+- Notify user when SweNER crashes.
+- Fixed crash when config file can't be read due to file permissions.
+- Fixed bug where `geo:contextual` would only work for sentences.
+- Fixed crash on systems with encodings other than UTF-8.
+
 ## [4.1.1] - 2021-09-20
 
 ### Fixed
 
 - Workaround for bug in some versions of Python 3.8 and 3.9.
-- Fixed bugs in segmenter module.
+- Fixed bugs in `segmenter` module.
 
 ## [4.1.0] - 2021-04-14
 

diff --git a/README.md b/README.md
@@ -15,7 +15,7 @@ If you have any questions, problems or suggestions please contact <sb-sparv@sven
 * A Unix-like environment (e.g. Linux, OS X or [Windows Subsystem for
   Linux](https://docs.microsoft.com/en-us/windows/wsl/about)) *Note:* Most of Sparv's features should work in a Windows
   environment as well, but since we don't do any testing on Windows we cannot guarantee anything.
-* [Python 3.6.1](http://python.org/) or newer
+* [Python 3.6.2](http://python.org/) or newer
 
 ## Installation
 
@@ -24,7 +24,7 @@ Sparv is available on [PyPI](https://pypi.org/project/sparv-pipeline/) and can b
 We recommend using pipx, which will install Sparv in an isolated environment while still making it available to be run
 from anywhere.
 
-```bash
+```
 python3 -m pip install --user pipx
 python3 -m pipx ensurepath
 pipx install sparv-pipeline
@@ -47,7 +47,7 @@ otherwise.
 We recommend that you set up a virtual environment and install the dependencies (including the dev dependencies) listed
 in `setup.py`:
 
-```bash
+```
 python3 -m venv venv
 source venv/bin/activate
 pip install -e .[dev]

diff --git a/docs/README.md b/docs/README.md
@@ -6,7 +6,7 @@ Sparv's documentation is written in markdown and can be rendered as HTML or PDF.
 ## Setup HTML documentation
 
 Create symlinks to documentation directories if needed:
-```bash
+```
 cd docsify
 ln -s ../user-manual user-manual
 ln -s ../developers-guide developers-guide
@@ -15,19 +15,19 @@ cd ..
 ```
 
 Set Sparv version number:
-```bash
+```
 cd doscify
 ./set_version.sh
 cd ..
 ```
 
-Serve documentation with python:
-```bash
+Serve documentation with python (from the `docs` directory):
+```
 python3 -m http.server --directory docsify 3000
 ```
 
-*or* with docsify:
-```bash
+*or* with docsify (from the `docs` directory):
+```
 npm i docsify-cli -g
 docsify serve docsify --port 3000
 ```
@@ -42,8 +42,13 @@ Sync HTML documentation to server:
 
 ## Render documentation as PDF
 
-Convert User Manual and Developer's Guide from markdown to PDF (requires markdown and latex):
-```bash
+Install requirements (markdown and latex):
+```
+sudo apt-get install markdown pandoc texlive-latex-base texlive-fonts-recommended texlive-fonts-extra texlive-latex-extra
+```
+
+Convert User Manual and Developer's Guide from markdown to PDF:
+```
 cd md2pdf
 ./make_pdf.sh
 ```

diff --git a/docs/developers-guide/config-parameters.md b/docs/developers-guide/config-parameters.md
@@ -79,22 +79,21 @@ for the `import` and the `export` categories that are inherited by importers and
 
 Inheritable configuration keys for `import`:
 
-| config key | description |
-|:-----------|:------------|
-|`document_annotation` | The annotation representing one text document. Any text-level annotations will be attached to this annotation.
-|`encoding`            | Encoding of source document. Defaults to UTF-8.
-|`keep_control_chars`  | Set to True if control characters should not be removed from the text.
-|`normalize`           | Normalize input using any of the following forms: 'NFC', 'NFKC', 'NFD', and 'NFKD'.
-|`source_dir`          | The path to the directory containing the source documents relative to the corpus directory.
+| config key     | description                                    |
+|:---------------|:-----------------------------------------------|
+|`text_annotation`    | The annotation representing one text. Any text-level annotations will be attached to this annotation.
+|`encoding`           | Encoding of source file. Defaults to UTF-8.
+|`keep_control_chars` | Set to True if control characters should not be removed from the text.
+|`normalize`          | Normalize input using any of the following forms: 'NFC', 'NFKC', 'NFD', and 'NFKD'.
+|`source_dir`         | The path to the directory containing the source files relative to the corpus directory.
 
 Inheritable configuration keys for `export`:
 
-| config key | description  |
-|:-----------|:-------------|
+| config key | description            |
+|:-----------|:-----------------------|
 |`default`                  | Exports to create by default when running 'sparv run'.
-|`source_annotations`       | List of annotations from the original document to be kept.
+|`source_annotations`       | List of annotations from the source file to be kept.
 |`annotations`              | List of automatic annotations to include.
-|`header_annotations`       | List of header elements from the original document to include in the export.
 |`word`                     | The token strings to be included in the export.
 |`remove_module_namespaces` | Set to false if module name spaces should be kept in the export.
 |`sparv_namespace`          | A string representing the name space to be added to all annotations created by Sparv.

diff --git a/docs/developers-guide/general-concepts.md b/docs/developers-guide/general-concepts.md
@@ -3,8 +3,8 @@ This section will give a brief overview of how Sparv modules work and introduce
 provided in the following chapters.
 
 The Sparv Pipeline is comprised of some core functionality and many different modules containing Sparv functions that
-serve different purposes like reading and parsing source documents, building or downloading models, producing
-annotations and producing output documents that contain the source text and annotations. All of these modules (i.e. the
+serve different purposes like reading and parsing source files, building or downloading models, producing
+annotations and producing output files that contain the source text and annotations. All of these modules (i.e. the
 code inside the `sparv/modules` directory) are replacable. A Sparv function is decorated with a special
 [decorator](developers-guide/sparv-decorators) that tells Sparv what purpose it serves. A function's parameters hold
 information about what input is needed in order to run the function and what output is produced by it. The Sparv core