Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check that processors add the number of tokens they say they will #1312

Closed
wants to merge 3 commits into from

Conversation

boyleconnor
Copy link
Contributor

@boyleconnor boyleconnor commented Aug 4, 2023

Fixes #1314

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@boyleconnor boyleconnor marked this pull request as draft August 5, 2023 18:18
@Narsil
Copy link
Collaborator

Narsil commented Aug 7, 2023

No we cannot panic like that it's not OK.

I'm very fine if the reported numbers is not the perfect one, panicking randomly is not acceptable for it

@boyleconnor
Copy link
Contributor Author

@Narsil, I'm very surprised to read this:

I'm very fine if the reported numbers is not the perfect one

If there is a mismatch between added_tokens() and the number of tokens actually added by processor.process_encodings(), then (if truncation is enabled) Tokenizer.encode() will truncate to the wrong length, which I think most Hugging Face users would consider completely unacceptable behavior (e.g. imagine if you are truncating the tokens on some input before sending it to an expensive API).

IIUC the only way that this failure (added_tokens() != final_length - original_length) would be possible is if a Hugging Face contributor makes a mistake when implementing a Processor for a new tokenization system and the value of added_tokens() doesn't match the actual number of added tokens. I don't see how this could be "randomly" triggered without a significant systemic error being present in the tokenizers code base.

@boyleconnor
Copy link
Contributor Author

@ArthurZucker do you want to share thoughts on this PR directly? I believe you mentioned on the issue that you would check it out

@github-actions github-actions bot added the Stale label Dec 14, 2023
@ArthurZucker
Copy link
Collaborator

Hey! I don't think a new processor is planned, which is why it's not really worth the effort for now, but thanks a lot for wanting to contribute 🤗 I'm sorry in the delayed answer

@boyleconnor
Copy link
Contributor Author

Okay, I will close this PR, then

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants