Skip to content

Releases: ufal/udpipe

UDPipe 2.1.0

16 Nov 12:47
Compare
Choose a tag to compare

Compared to UDPipe 2.0.0:

  • Add support for using a morphological dictionary via ufal.morphodita during prediction – if the dictionary returns some analyses for a given form, we return the one most probable according to the predicted logits.
  • Add support for --no_single_root in the evaluation script.

UDPipe 1.3.1

16 Nov 09:01
Compare
Choose a tag to compare

Maintenance release of UDPipe1.

Changes since UDPipe 1.3.0:

  • Update MorphoDiTa to 1.11.2.

UDPipe 1.3.0

16 Feb 18:24
Compare
Choose a tag to compare

Maintenance release of UDPipe1.

Changes since UDPipe 1.2.0:

  • Get rid of UndefinedBehaviourSanitizer and AddressSanitizer findings.
  • Add segment_size and learning_rate_final parameters to tokenizer training.
  • Add several options to udpipe_server.
  • Fix bug in returning the trained model as a string; use bytes instead.
  • Fix a bug that newlines after URL/emails were considered just spaces.
  • Fix a silent error on aarch64 caused by assuming char is signed.
  • On Windows, the file paths are now UTF-8 encoded, instead of ANSI. This change affects the API, binary arguments, and program outputs.
  • The Windows binaries are now compiled with VS 2019, older systems than Windows 7 are no longer supported.
  • Add ARM64 macOS build.
  • The Python wheels are provided for Pythons 3.6-3.11.

UDPipe 2.0.0

05 Aug 10:49
Compare
Choose a tag to compare

Compared to UDPipe 1:

  • UDPipe 2 is Python-only and tested only in Linux,
  • UDPipe 2 is meant as a research tool, not as a user-friendly UDPipe 1 replacement,
  • UDPipe 2 achieves much better performance, but requires a GPU for reasonable performance,
  • UDPipe 2 does not perform tokenization by itself – it uses UDPipe 1 for that.

UDPipe 2 is available as a REST service running at https://lindat.mff.cuni.cz/services/udpipe. If you like, you can use the udpipe2_client.py script to interact with it.

However, if you prefer to run UDPipe 2 locally, you can use this release.

Running Inference with Existing Models

To run UDPipe 2, you need to first download a model from the list of UDPipe 2 models. Then you can run UDPipe 2 as a local REST server, and use the udpipe2_client.py script to interact with it (in the same way as with the official service).

To run the server, use the udpipe2_server.py script.

  • Install the requirements.txt. While only TF 1 is supported for model training (ancient, I know), you can use also TF 2 for inference.
  • The script has the following required options:
    • port: the port to listen on. We use SO_REUSEPORT to allow multiple processes to run concurrently, supporting seamless upgrades;
    • default_model: model name to use when no model is specified in the request;
    • models: each model is then a quadruple of the following parameters (each published model contains a file MODEL.txt with these parameters):
      • model names: any number of model names separated by :; furthermore, any hyphen-separated prefix of any model name can be also used as a name (e.g., czech-pdt-ud-2.10-220711:cs_pdt-ud-2.10-220711:cs:ces:cze);
      • model path: path to the model directory;
      • treebank name: because multiple treebanks can be handled by a single model, we need to specify a treebank name to use (this also specifies which tokenizer to use from the model directory);
      • acknowledgements: a URL to the model's acknowledgements.
  • The script has the following optional parameters:
    • --batch_size: batch size to use (default 32);
    • --logfile: if specified, log to this file instead of standard error;
    • --max_request_size: maximum request size, in bytes (default 4MB);
    • --preload_models: list of models to preload (or all) immediately after start (default none);
    • --threads: number of threads to use (default is to use all physical cores);
    • --wembedding_server: for deployment purposes, it might be useful to compute the contextualized embeddings (mBERT, RobeCzech) not in the UDPipe 2 service, but in a specialized service – see https://github.com/ufal/wembedding_service for documentation of the wembeddings service (default is to compute the embeddings directly in the UDPipe 2 service).

The service can be stopped by a SIGINT (Ctrl+C) signal or by a SIGUSR1 signal. Once such a signal is received, the service stops accepting new requests, but waits until all existing connections are handled and closed.

The models are loaded on-demand, but they are never freed. If a GPU is available, then all computation is performed on it (and an OOM might occur if too many models are loaded). If you would like to run BERT on a GPU and the remaining computation on a CPU, you could use GPU-enabled wembeddings service plus a CPU-only UDPipe 2 service.

UDPipe 1.2.0

02 Aug 19:08
Compare
Choose a tag to compare

Changes since UDPipe 1.1.0:

  • On-demand loading of models in REST server, with a pool of least recently used models.
  • Make GRU tokenizer dimension configurable (16, 24, 64 supported).
  • Track paragraph boundaries even under normalized_spaces.
  • Support experimental sentence segmentation using jointly both the tokenizer and the parser.
  • Add EPE output format.
  • Make default model in REST server explicit.
  • Support pre-filling according to URL params in the webapp.

UDPipe 1.1.0

29 Mar 10:24
Compare
Choose a tag to compare

Changes since UDPipe 1.0.0:

  • Morphodita_parsito models (now version 3) require at least UDPipe version 1.1.0.
  • CoNLL-U v2 format is supported. Notably spaces in forms and lemmas are now allowed, as are empty nodes.
  • Support options for input_format and output_format instances.
  • Preserve all spacing when tokenizing.
  • Optionally generate document-level token ranges in the original text.
  • Optionally respect given segmentation during tokenization.
  • Tokenizer can be trained to allow spaces in tokens (default if there are forms with spaces in the training data).
  • Parser can be trained to return always one root per sentence (default).
  • Improve input_format API to allow inter-block state (for correct tracking of inter-sentence spaces and document-level offsets).
  • Improve output_format API to support begin/end document marks and to allow state in the output_format instance (to allow numbering output sentences, for example).

UDPipe 1.0.0

27 May 06:54
Compare
Choose a tag to compare
  • Initial public release.