Skip to content

Releases: OpenNMT/Tokenizer

Tokenizer 1.29.0

08 Oct 15:18
Compare
Choose a tag to compare

Changes

  • [Python] Drop support for Python 3.5

New features

  • [Python] Build wheels for Python 3.10
  • [Python] Add tokenization method Tokenizer.tokenize_batch

Tokenizer 1.28.1

30 Sep 16:29
Compare
Choose a tag to compare

Fixes and improvements

  • Fix detokenization when a token includes a fullwidth percent sign (%) that is not used as an escape sequence (version 1.27.0 contained a partial fix for this bug)

Tokenizer 1.28.0

17 Sep 08:27
Compare
Choose a tag to compare

Changes

  • [C++] Remove the SpaceTokenizer class that is not meant to be public and can be confused with the "space" tokenization mode

New features

  • Build Python wheels for Windows
  • Add option tokens_delimiter to configure how tokens are delimited in tokenized files (default is a space)
  • Expose option with_separators in Python and CLI to include whitespace characters in the tokenized output
  • [Python] Add package version information in pyonmttok.__version__

Fixes and improvements

  • Fix detokenization when option with_separators is enabled

Tokenizer 1.27.0

30 Aug 09:54
Compare
Choose a tag to compare

Changes

  • Linux Python wheels are now compiled with manylinux2010 and require pip >= 19.0 for installation
  • macOS Python wheels now require macOS >= 10.14

Fixes and improvements

  • Fix casing resolution when some letters do not have case information
  • Fix detokenization when a token includes a fullwidth percent sign (%) that is not used as an escape sequence
  • Improve error message when setting invalid segment_alphabet or lang options
  • Update SentencePiece to 0.1.96
  • [Python] Improve declaration of functions and classes for better type hints and checks
  • [Python] Update ICU to 69.1

Tokenizer 1.26.4

25 Jun 09:04
Compare
Choose a tag to compare

Fixes and improvements

  • Fix regression introduced in last version for preserved tokens that are not segmented by BPE

Tokenizer 1.26.3

24 Jun 11:11
Compare
Choose a tag to compare

Fixes and improvements

  • Fix another divergence with the SentencePiece output when there is only one subword and the spacer is detached

Tokenizer 1.26.2

08 Jun 14:23
Compare
Choose a tag to compare

Fixes and improvements

  • Fix a divergence with the SentencePiece output when the spacer is detached from the word

Tokenizer 1.26.1

31 May 10:54
Compare
Choose a tag to compare

Fixes and improvements

  • Fix application of the BPE vocabulary when using preserve_segmented_tokens and a subword appears without joiner in the vocabulary
  • Fix compilation with ICU versions older than 60

Tokenizer 1.26.0

19 Apr 08:28
Compare
Choose a tag to compare

New features

  • Add lang tokenization option to apply language-specific case mappings

Fixes and improvements

  • Use ICU to convert strings to Unicode values instead of a custom implementation

Tokenizer 1.25.0

15 Mar 09:20
Compare
Choose a tag to compare

New features

  • Add training flag in tokenization methods to disable subword regularization during inference
  • [Python] Implement __len__ method in the Token class

Fixes and improvements

  • Raise an error when enabling case_markup with incompatible tokenization modes "space" and "none"
  • [Python] Improve parallelization when Tokenizer.tokenize is called from multiple Python threads (the Python GIL is now released)
  • [Python] Cleanup some manual Python <-> C++ types conversion