Skip to content

Navigation Menu

Explore
By company size
By use case
By industry
View all solutions
Topics
- AI
- DevOps
- Security
- Software Development
- View all
Explore
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
Repositories
- Enterprise platform
  AI-powered developer platform
Available add-ons
Pricing

Search code, repositories, users, issues, pull requests...

Search

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

OpenNMT / Tokenizer Public

Notifications You must be signed in to change notification settings
Fork 70
Star 284

Code
Issues 7
Pull requests 2
Actions
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Actions
Security
Insights

Releases: OpenNMT/Tokenizer

Releases · OpenNMT/Tokenizer

Tokenizer 1.29.0

08 Oct 15:18

guillaumekln

Compare

Choose a tag to compare

Loading

Tokenizer 1.29.0

Changes

[Python] Drop support for Python 3.5

New features

[Python] Build wheels for Python 3.10
[Python] Add tokenization method Tokenizer.tokenize_batch

Assets 2

Loading

All reactions

Tokenizer 1.28.1

30 Sep 16:29

guillaumekln

Compare

Choose a tag to compare

Loading

Tokenizer 1.28.1

Fixes and improvements

Fix detokenization when a token includes a fullwidth percent sign (％) that is not used as an escape sequence (version 1.27.0 contained a partial fix for this bug)

Assets 2

Loading

All reactions

Tokenizer 1.28.0

17 Sep 08:27

guillaumekln

Compare

Choose a tag to compare

Loading

Tokenizer 1.28.0

Changes

[C++] Remove the SpaceTokenizer class that is not meant to be public and can be confused with the "space" tokenization mode

New features

Build Python wheels for Windows
Add option tokens_delimiter to configure how tokens are delimited in tokenized files (default is a space)
Expose option with_separators in Python and CLI to include whitespace characters in the tokenized output
[Python] Add package version information in pyonmttok.__version__

Fixes and improvements

Fix detokenization when option with_separators is enabled

Assets 2

Loading

innerNULL reacted with thumbs up emoji

All reactions

👍 1 reaction

1 person reacted

Tokenizer 1.27.0

30 Aug 09:54

guillaumekln

Compare

Choose a tag to compare

Loading

Tokenizer 1.27.0

Changes

Linux Python wheels are now compiled with manylinux2010 and require pip >= 19.0 for installation
macOS Python wheels now require macOS >= 10.14

Fixes and improvements

Fix casing resolution when some letters do not have case information
Fix detokenization when a token includes a fullwidth percent sign (％) that is not used as an escape sequence
Improve error message when setting invalid segment_alphabet or lang options
Update SentencePiece to 0.1.96
[Python] Improve declaration of functions and classes for better type hints and checks
[Python] Update ICU to 69.1

Assets 2

Loading

All reactions

Tokenizer 1.26.4

25 Jun 09:04

guillaumekln

Compare

Choose a tag to compare

Loading

Tokenizer 1.26.4

Fixes and improvements

Fix regression introduced in last version for preserved tokens that are not segmented by BPE

Assets 2

Loading

All reactions

Tokenizer 1.26.3

24 Jun 11:11

guillaumekln

Compare

Choose a tag to compare

Loading

Tokenizer 1.26.3

Fixes and improvements

Fix another divergence with the SentencePiece output when there is only one subword and the spacer is detached

Assets 2

Loading

All reactions

Tokenizer 1.26.2

08 Jun 14:23

guillaumekln

Compare

Choose a tag to compare

Loading

Tokenizer 1.26.2

Fixes and improvements

Fix a divergence with the SentencePiece output when the spacer is detached from the word

Assets 2

Loading

All reactions

Tokenizer 1.26.1

31 May 10:54

guillaumekln

Compare

Choose a tag to compare

Loading

Tokenizer 1.26.1

Fixes and improvements

Fix application of the BPE vocabulary when using preserve_segmented_tokens and a subword appears without joiner in the vocabulary
Fix compilation with ICU versions older than 60

Assets 2

Loading

All reactions

Tokenizer 1.26.0

19 Apr 08:28

guillaumekln

Compare

Choose a tag to compare

Loading

Tokenizer 1.26.0

New features

Add lang tokenization option to apply language-specific case mappings

Fixes and improvements

Use ICU to convert strings to Unicode values instead of a custom implementation

Assets 2

Loading

All reactions

Tokenizer 1.25.0

15 Mar 09:20

guillaumekln

Compare

Choose a tag to compare

Loading

Tokenizer 1.25.0

New features

Add training flag in tokenization methods to disable subword regularization during inference
[Python] Implement __len__ method in the Token class

Fixes and improvements

Raise an error when enabling case_markup with incompatible tokenization modes "space" and "none"
[Python] Improve parallelization when Tokenizer.tokenize is called from multiple Python threads (the Python GIL is now released)
[Python] Cleanup some manual Python <-> C++ types conversion

Assets 2

Loading

All reactions

Previous 1 2 3 Next

Footer

© 2024 GitHub, Inc.

Footer navigation

Terms
Privacy
Security
Status
Docs
Contact

You can’t perform that action at this time.