Skip to content

Commit

Permalink
Merge branch 'dev'
Browse files Browse the repository at this point in the history
  • Loading branch information
anne17 committed Nov 3, 2022
2 parents 8c7e1f9 + cd5366b commit d3eb0db
Show file tree
Hide file tree
Showing 49 changed files with 1,425 additions and 359 deletions.
36 changes: 35 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,40 @@
# Changelog

## [5.0.0]
## [5.1.0] - 2022-11-03

### Added

- Added exporter for Korp frontend config files.
- Added the `--keep-going` flag, which makes Sparv continue with other independent tasks when a task fails.
- Added an [overview of some of the built-in
annotations](https://spraakbanken.gu.se/sparv/#/user-manual/available-analyses) in the documentation.
- Added `AnnotationName` and `ExportAnnotationNames` classes, to be used instead of the `is_input` parameter.
- Lists of annotations can now be used as input and output for annotators by using the `List` type hint.
- Added support for optional annotator outputs.
- Added support for uninstallers using the `@uninstaller` decorator.
- Added `Marker` and `OutputMarker` classes, to be used mainly by installers and uninstallers.
- Added a new annotator `misc:concat2` which concatenates two or more annotations with an optional separator.
- Added a `remove` method to the `Annotation` classes for removing annotation files.
- Added a metadata field: `short_description`.
- Added a setting for truncating the annotations `misc_head` and `misc_tail` to avoid crashes by cwb-encode.

### Changed

- Removed the `is_input` parameter from the `ExportAnnotationsAllSourceFiles` class as it didn't make sense.
- Installers and uninstallers are now required to create markers.
- Removed Korp modes info from CWB info file as it is included in the Korp config.
- Disable highlighting of numbers in the log output because it was confusing.
- Slightly improved the `sbx_freq_list_date` exporter.
- The util functions `install_directory` and `ìnstall_file` have been replaced by the more general `install_path`.

### Fixed

- Fixed 'maximum recursion depth exceeded' problem by upgrading Stanza.
- The preloader now respects the compression setting.
- Fixed progress bars not working when running preloaded annotators.
- Fixed a rare logging crash.

## [5.0.0] - 2022-08-10

### Added

Expand Down
78 changes: 68 additions & 10 deletions docs/developers-guide/sparv-classes.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,6 @@ annotation is needed as input for a function, e.g. `Annotation("<token:word>")`.

- `name`: The name of the annotation.
- `source_file`: The name of the source file.
- `is_input`: If set to `False` the annotation won't be added to the rule's input. Default: `True`

**Properties:**

Expand All @@ -32,6 +31,7 @@ annotation is needed as input for a function, e.g. `Annotation("<token:word>")`.

- `split()`: Split name into annotation name and attribute.
- `exists()`: Return True if annotation file exists.
- `remove()`: Remove annotation file.
- `read(allow_newlines: bool = False)`: Yield each line from the annotation.
- `get_children(child: BaseAnnotation, orphan_alert=False, preserve_parent_annotation_order=False)`: Return two lists.
The first one is a list with n (= total number of parents) elements where every element is a list of indices in the
Expand Down Expand Up @@ -70,6 +70,7 @@ require the specificed annotation for every source file in the corpus.
- `read_spans(source_file: str, decimals=False, with_annotation_name=False)`: Yield the spans of the annotation.
- `create_empty_attribute(source_file: str)`: Return a list filled with None of the same size as this annotation.
- `exists(source_file: str)`: Return True if annotation file exists.
- `remove(source_file: str)`: Remove annotation file.
- `get_size(source_file: str)`: Get the number of values.


Expand All @@ -91,6 +92,19 @@ file).

- `split()`: Split name into annotation name and attribute.
- `read()`: Read arbitrary corpus level string data from annotation file.
- `exists()`: Return True if annotation file exists.
- `remove()`: Remove annotation file.


## AnnotationName
Use this class when only the name of an annotation is of interest, not the actual data. The annotation will not be added
as a prerequisite for the annotator, meaning that the use of `AnnotationName` will not automatically trigger the
creation of the referenced annotation.

**Arguments:**

- `name`: The name of the annotation.
- `source_file`: The name of the source file.


## AnnotationData
Expand All @@ -111,6 +125,7 @@ This class represents an annotation holding arbitrary data, i.e. data that is no

- `split()`: Split name into annotation name and attribute.
- `exists()`: Return True if annotation file exists.
- `remove()`: Remove annotation file.
- `read(source_file: Optional[str] = None)`: Read arbitrary string data from annotation file.


Expand All @@ -131,8 +146,9 @@ file in the corpus.
**Methods:**

- `split()`: Split name into annotation name and attribute.
- `exists()`: Return True if annotation file exists.
- `read(source_file: Optional[str] = None)`: Read arbitrary string data from annotation file.
- `exists(source_file: str)`: Return True if annotation file exists.
- `remove(source_file: str)`: Remove annotation file.
- `read(source_file: str)`: Read arbitrary string data from annotation file.


## Binary
Expand Down Expand Up @@ -182,25 +198,29 @@ An instance of this class represents an export file. This class is used to defin

## ExportAnnotations
List of annotations to be included in the export. This list is defined in the corpus configuration. Annotation files
for the current source file will automatically be added as dependencies when using this class, unless `is_input` is set
to `False`.
for the current source file will automatically be added as dependencies when using this class.

**Arguments:**

- `config_name`: The config variable pointing out what annotations to include.
- `is_input`: If set to `False` the annotations won't be added to the rule's input. Default: `True`


## ExportAnnotationsAllSourceFiles
List of annotations to be included in the export. This list is defined in the corpus configuration. Annotation files
for _all_ source files will automatically be added as dependencies when using this class, unless `is_input` is set to
`False`. With `is_input` set to `False`, there is no difference between using `ExportAnnotationsAllSourceFiles` and
`ExportAnnotations`.
for _all_ source files will automatically be added as dependencies when using this class.

**Arguments:**

- `config_name`: The config variable pointing out what annotations to include.


## ExportAnnotationNames
List of annotations to be included in the export. This list is defined in the corpus configuration. Unlike
`ExportAnnotations`, the annotations will not be added as dependencies when using this class.

**Arguments:**

- `config_name`: The config variable pointing out what annotations to include.
- `is_input`: If set to `False` the annotations won't be added to the rule's input. Default: `True`


## ExportInput
Expand All @@ -225,13 +245,29 @@ List of header annotation names for a given source file.
- `read()`: Read the headers file and return a list of header annotation names.
- `write(header_annotations: List[str])`: Write headers file.
- `exists()`: Return True if headers file exists for this source file.
- `remove()`: Remove headers file.


## Language
In instance of this class holds information about the luanguage of the corpus. This information is retrieved from the
corpus configuration and is specified as ISO 639-3 code.


## Marker
Similar to `AnnotationCommonData`, but usually without any actual data. Markers are simply used to tell if something has
been run. Created by using `OutputMarker`.

**Arguments:**

- `name`: The name of the marker.

**Methods:**

- `read()`: Read arbitrary corpus level string data from marker file.
- `exists()`: Return True if marker file exists.
- `remove()`: Remove marker file.


## Model
An instance of this class holds a path to a model file relative to the Sparv model directory. This class is typically
used as input to annotator functions.
Expand Down Expand Up @@ -291,6 +327,7 @@ Regular annotation or attribute used as output (e.g. of an annotator function).
- `write(values, append: bool = False, allow_newlines: bool = False, source_file: Optional[str] = None)`: Write an
annotation to file. Existing annotation will be overwritten. 'values' should be a list of values.
- `exists()`: Return True if annotation file exists.
- `remove()`: Remove annotation file.


## OutputAllSourceFiles
Expand All @@ -308,6 +345,7 @@ file must be specified for all actions.
- `write(values, source_file: str, append: bool = False, allow_newlines: bool = False)`: Write an annotation to file.
Existing annotation will be overwritten. 'values' should be a list of values.
- `exists(source_file: str)`: Return True if annotation file exists.
- `remove(source_file: str)`: Remove annotation file.


## OutputCommonData
Expand All @@ -322,6 +360,8 @@ Similar to [`OutputData`](#outputdata) but for a data annotation that is valid f

- `split()`: Split name into annotation name and attribute.
- `write(value, append: bool = False)`: Write arbitrary corpus level string data to annotation file.
- `exists()`: Return True if annotation file exists.
- `remove()`: Remove annotation file.


## OutputData
Expand All @@ -339,6 +379,7 @@ is used as output.
- `split()`: Split name into annotation name and attribute.
- `write(value, append: bool = False)`: Write arbitrary corpus level string data to annotation file.
- `exists()`: Return True if annotation file exists.
- `remove()`: Remove annotation file.


## OutputDataAllSourceFiles
Expand All @@ -355,6 +396,23 @@ but the source file must be specified for all actions.
- `split()`: Split name into annotation name and attribute.
- `write(value, source_file: str, append: bool = False)`: Write arbitrary corpus level string data to annotation file.
- `exists(source_file: str)`: Return True if annotation file exists.
- `remove(source_file: str)`: Remove annotation file.


## OutputMarker
Similar to `OutputCommonData`, but usually without any actual data. Markers are simply used to tell that something has
been run, usually used by functions that don't have any natural output, like installers and uninstallers.

**Arguments**:
- `name`: The name of the marker.
- `cls`: The annotation class of the output.
- `description`: An optional description.

**Methods:**

- `write(value = "")`: Write arbitrary corpus level string data to marker file. Usually called without arguments.
- `exists()`: Return True if marker file exists.
- `remove()`: Remove marker file.


## Source
Expand Down
41 changes: 39 additions & 2 deletions docs/developers-guide/sparv-decorators.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,14 +114,21 @@ def freq_list_simple(corpus: Corpus = Corpus(),
```

## @installer
A function decorated with `@installer` is used to copy a corpus export to a remote server.
A function decorated with `@installer` is used to deploy the corpus or related files to a remote location. For example,
the XML output could be copied to a web server, or SQL data could be inserted into a database.

Every installer needs to create a marker of the type `OutputMarker` at the end of a successful installation. Simply
call the `write()` method on the marker to create the required empty file.

It is recommended that an installer removes any related uninstaller's marker, to enable uninstallation.

**Arguments:**

- `description`: Description of the installer. Used for displaying help texts in the CLI.
- `name`: Optional name to use instead of the function name.
- `config`: List of Config instances defining config options for the installer.
- `language`: List of supported languages. If no list is supplied all languages are supported.
- `uninstaller`: Name of related uninstaller.

**Example:**
```python
Expand All @@ -131,12 +138,42 @@ A function decorated with `@installer` is used to copy a corpus export to a remo
])
def install(corpus: Corpus = Corpus(),
xmlfile: ExportInput = ExportInput("xml_export.combined/[metadata.id].xml.bz2"),
out: OutputCommonData = OutputCommonData("xml_export.install_export_pretty_marker"),
out: OutputMarker = OutputMarker("xml_export.install_export_pretty_marker"),
export_path: str = Config("xml_export.export_path"),
host: str = Config("xml_export.export_host")):
...
```

## @uninstaller
A function decorated with `@uninstaller` is used to undo what an installer has done, e.g. remove corpus files from a
remote location or delete corpus data from a database.

Every uninstaller needs to create a marker of the type `OutputMarker` at the end of a successful uninstallation.
Simply call the `write()` method on the marker to create the required empty file.

It is recommended that an uninstaller removes any related installer's marker, to enable re-installation.

**Arguments:**

- `description`: Description of the uninstaller. Used for displaying help texts in the CLI.
- `name`: Optional name to use instead of the function name.
- `config`: List of Config instances defining config options for the uninstaller.
- `language`: List of supported languages. If no list is supplied all languages are supported.

**Example:**
```python
@uninstaller("Remove compressed XML from remote host", config=[
Config("xml_export.export_host", "", description="Remote host to remove XML export from."),
Config("xml_export.export_path", "", description="Path on remote host to remove XML export from.")
])
def uninstall(corpus: Corpus = Corpus(),
xmlfile: ExportInput = ExportInput("xml_export.combined/[metadata.id].xml.bz2"),
out: OutputMarker = OutputMarker("xml_export.uninstall_export_pretty_marker"),
export_path: str = Config("xml_export.export_path"),
host: str = Config("xml_export.export_host")):
...
```

## @modelbuilder
A function decorated with `@modelbuilder` is used to setup a model used by other Sparv components (typically
annotators). Setting up a model could for example mean downloading a file, unzipping it, converting it into a different
Expand Down
26 changes: 13 additions & 13 deletions docs/developers-guide/utilities.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,28 +88,28 @@ Reorder chunks according to `chunk_order` and open/close tags in the correct ord
- `chunk_order`: Annotation containing the new order of the chunk.


## Install Utils
`sparv.api.util.install` provides util functions used for installing corpora onto remote locations.
## Install/Uninstall Utils
`sparv.api.util.install` provides util functions used for installing and uninstalling corpora, either locally or
remotely.


### install_directory()
Rsync every file from a local directory to a target host. The target path is extracted from filenames by replacing "#"
with "/".
### install_path()
Transfer a file or directory to a target destination, optionally on a different host.

**Arguments:**

- `host`: The remote host to install to.
- `directory`: The directory to sync.
- `source_path`: Path to the local file or directory to sync.
- `host` (optional): The remote host to install to.
- `target_path`: The name of the target file or directory.


### install_file()
Rsync a file to a target host.
### uninstall_path()
Remove a file or directory, optionally on a different host.

**Arguments:**

- `local_file`: Path to the local file to sync.
- `host`: The remote host to install to.
- `remote_file`: The name of the resulting file on the remote host.
- `path`: Path to the file or directory to remove.
- `host` (optional): The remote host on which the file or directory is located.


### install_mysql()
Expand Down Expand Up @@ -176,7 +176,7 @@ Call Java with a jar file, command line arguments and stdin. Returns a pair `(st


### clear_directory()
Create a new empty directory. Remove it's contents if it already exists.
Create a new empty directory. Remove its contents if it already exists.

**Arguments:**

Expand Down
2 changes: 1 addition & 1 deletion docs/docsify/_coverpage.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

> Språkbanken's text analysis tool
<p class="version"> version 5.0.0 </p>
<p class="version"> version 5.1.0 </p>

<p class="links">
<a class="button" target="_blank" href="https://github.com/spraakbanken/sparv-pipeline">Sparv on GitHub</a>
Expand Down
1 change: 1 addition & 0 deletions docs/docsify/_sidebar.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
- [Running Sparv](user-manual/running-sparv.md)
- [Requirements for Source Files](user-manual/requirements-for-source-files.md)
- [Corpus Configuration](user-manual/corpus-configuration.md)
- [Available analyses](user-manual/available-analyses.md)
- Developer's Guide
- [General Concepts](developers-guide/general-concepts.md)
- [Writing Sparv Plugins](developers-guide/writing-sparv-plugins.md)
Expand Down
3 changes: 3 additions & 0 deletions docs/docsify/sparv-pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,9 @@ or suggestions please contact <[email protected]>.

This documentation is also available as PDF. You can download the [user manual](https://github.com/spraakbanken/sparv-pipeline/releases/latest/download/user-manual.pdf) and the [developer's guide](https://github.com/spraakbanken/sparv-pipeline/releases/latest/download/developers-guide.pdf) from the [latest Sparv release on GitHub](https://github.com/spraakbanken/sparv-pipeline/releases/latest).

Cite Sparv: *[Martin Hammarstedt, Anne Schumacher, Lars Borin, Markus Forsberg (2022): Sparv 5 User Manual](https://gup.ub.gu.se/publication/318405?lang=en
)* &nbsp; [![BibTeX](_media/bibtex.png)](https://spraakbanken.gu.se/en/research/publications/bibtex/318405)

> [!TIP]
> Did you know that you can get notified about new Sparv releases by subscribing to our GitHub repository? Here's how:
> 1. Log in to GitHub
Expand Down
Binary file added docs/images/bibtex.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 1 addition & 2 deletions docs/md2pdf/.gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
*.pdf
user-manual.md
*.md
user-manual.tex
developers-guide.md
developers-guide.tex
Loading

0 comments on commit d3eb0db

Please sign in to comment.