Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Give error when initializing tokenizer with too high stride #1306

Merged
merged 5 commits into from
Jul 28, 2023

Conversation

boyleconnor
Copy link
Contributor

@boyleconnor boyleconnor commented Jul 27, 2023

Related to #1269. Implements improvements requested here.

Enabling truncation with bad parameters (specifically stride and max_length) on a tokenizer now gives an error (in the form of a proper Python exception, explaining the relation of added special tokens and max length to stride).

Note that I didn't remove the original assert!() for a few reasons:

  1. I assumed that the number of added special tokens is always lowest when is_pair = false. I'm not sure if this is true/a justified assumption
  2. I have not implemented any error output for the Node bindings. I haven't even looked at them yet, but I imagine they will enable truncation with bad params and ignore the Err output of the function. It might be nice to implement that functionality as part of this PR.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jul 27, 2023

The documentation is not available anymore as the PR was closed or merged.

Copy link
Collaborator

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.

The code is good.
It's a breaking change, but it's only on the Rust side.

However I'm very curious when this is actually an issue. Having such large strides really seems a bug in the first place to me, almost looks like you're trying to get a sliding window out of it, is that it ?

Ok(())
}

/// Disable truncation
#[pyo3(text_signature = "(self)")]
fn no_truncation(&mut self) {
self.tokenizer.with_truncation(None);
let _ = self.tokenizer.with_truncation(None);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expect here ? no_truncation should never fail, but it's better to actually fail than be in the dark.

/// Fails if `stride` is too high relative to `max_length` and `post_processor.added_tokens()`
pub fn with_truncation(&mut self, trunc: Option<TruncationParams>) -> Result<&mut Self> {
if let Some(trunc_params) = &trunc {
let n_added_tokens = self.get_n_added_tokens(false);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why false ? It depends.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed that fewer tokens were added when is_pair=false (I'm not actually sure if this is a good assumption). So, when with_truncation() is called, we check if there is any possible case where this is a valid combination of max_length and stride for this tokenizer, and fail if it isn't.

Another reason why I kept the original assert!() from the inference part of the code is that it's possible that any additional special token(s) when is_pair=true will push it over the edge and make the stride too large for the max_length

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Narsil can you confirm whether my assumption in the above comment is a good one (i.e. is added_tokens(false) always less than added_tokens(true)?)

@boyleconnor
Copy link
Contributor Author

boyleconnor commented Jul 27, 2023

Having such large strides really seems a bug in the first place to me

You are right that this should probably never happen if someone is reading the documentation carefully (and properly understands HuggingFace's idiosyncratic definition of "stride").

However, this issue might come up if the user is using some toy examples. E.g. if you're playing around with tokenizers and you want to look at example outputs, you might try looking at the output of, say, the following

tokenizer = tokenizers.Tokenizer.from_pretrained('bert-base-cased')
tokenizer.enable_truncation(5, stride=3)
print(tokenizer.encode("Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.")

would lead to this error. I think users would be particularly inclined to "play around" with tokenizers like this (I certainly did) since the interplay of "stride", "max length", and "added tokens" is hard to wrap your head around without examples.

@Narsil Narsil merged commit c2664ae into huggingface:main Jul 28, 2023
11 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants