Releases: OpenNMT/Tokenizer
Releases · OpenNMT/Tokenizer
Tokenizer 1.29.0
Changes
- [Python] Drop support for Python 3.5
New features
- [Python] Build wheels for Python 3.10
- [Python] Add tokenization method
Tokenizer.tokenize_batch
Tokenizer 1.28.1
Fixes and improvements
- Fix detokenization when a token includes a fullwidth percent sign (%) that is not used as an escape sequence (version 1.27.0 contained a partial fix for this bug)
Tokenizer 1.28.0
Changes
- [C++] Remove the
SpaceTokenizer
class that is not meant to be public and can be confused with the "space" tokenization mode
New features
- Build Python wheels for Windows
- Add option
tokens_delimiter
to configure how tokens are delimited in tokenized files (default is a space) - Expose option
with_separators
in Python and CLI to include whitespace characters in the tokenized output - [Python] Add package version information in
pyonmttok.__version__
Fixes and improvements
- Fix detokenization when option
with_separators
is enabled
Tokenizer 1.27.0
Changes
- Linux Python wheels are now compiled with
manylinux2010
and requirepip
>= 19.0 for installation - macOS Python wheels now require macOS >= 10.14
Fixes and improvements
- Fix casing resolution when some letters do not have case information
- Fix detokenization when a token includes a fullwidth percent sign (%) that is not used as an escape sequence
- Improve error message when setting invalid
segment_alphabet
orlang
options - Update SentencePiece to 0.1.96
- [Python] Improve declaration of functions and classes for better type hints and checks
- [Python] Update ICU to 69.1
Tokenizer 1.26.4
Fixes and improvements
- Fix regression introduced in last version for preserved tokens that are not segmented by BPE
Tokenizer 1.26.3
Fixes and improvements
- Fix another divergence with the SentencePiece output when there is only one subword and the spacer is detached
Tokenizer 1.26.2
Fixes and improvements
- Fix a divergence with the SentencePiece output when the spacer is detached from the word
Tokenizer 1.26.1
Fixes and improvements
- Fix application of the BPE vocabulary when using
preserve_segmented_tokens
and a subword appears without joiner in the vocabulary - Fix compilation with ICU versions older than 60
Tokenizer 1.26.0
New features
- Add
lang
tokenization option to apply language-specific case mappings
Fixes and improvements
- Use ICU to convert strings to Unicode values instead of a custom implementation
Tokenizer 1.25.0
New features
- Add
training
flag in tokenization methods to disable subword regularization during inference - [Python] Implement
__len__
method in theToken
class
Fixes and improvements
- Raise an error when enabling
case_markup
with incompatible tokenization modes "space" and "none" - [Python] Improve parallelization when
Tokenizer.tokenize
is called from multiple Python threads (the Python GIL is now released) - [Python] Cleanup some manual Python <-> C++ types conversion