Give error when initializing tokenizer with too high stride #1306

boyleconnor · 2023-07-27T06:28:16Z

Related to #1269. Implements improvements requested here.

Enabling truncation with bad parameters (specifically stride and max_length) on a tokenizer now gives an error (in the form of a proper Python exception, explaining the relation of added special tokens and max length to stride).

Note that I didn't remove the original assert!() for a few reasons:

I assumed that the number of added special tokens is always lowest when is_pair = false. I'm not sure if this is true/a justified assumption
I have not implemented any error output for the Node bindings. I haven't even looked at them yet, but I imagine they will enable truncation with bad params and ignore the Err output of the function. It might be nice to implement that functionality as part of this PR.

HuggingFaceDocBuilderDev · 2023-07-27T06:40:45Z

The documentation is not available anymore as the PR was closed or merged.

…ad parameters

Narsil

Thanks for the PR.

The code is good.
It's a breaking change, but it's only on the Rust side.

However I'm very curious when this is actually an issue. Having such large strides really seems a bug in the first place to me, almost looks like you're trying to get a sliding window out of it, is that it ?

Narsil · 2023-07-27T07:41:06Z

bindings/python/src/tokenizer.rs

        Ok(())
    }

    /// Disable truncation
    #[pyo3(text_signature = "(self)")]
    fn no_truncation(&mut self) {
-        self.tokenizer.with_truncation(None);
+        let _ = self.tokenizer.with_truncation(None);


Can you expect here ? no_truncation should never fail, but it's better to actually fail than be in the dark.

Narsil · 2023-07-27T07:42:17Z

tokenizers/src/tokenizer/mod.rs

+    /// Fails if `stride` is too high relative to `max_length` and `post_processor.added_tokens()`
+    pub fn with_truncation(&mut self, trunc: Option<TruncationParams>) -> Result<&mut Self> {
+        if let Some(trunc_params) = &trunc {
+            let n_added_tokens = self.get_n_added_tokens(false);


Why false ? It depends.

I assumed that fewer tokens were added when is_pair=false (I'm not actually sure if this is a good assumption). So, when with_truncation() is called, we check if there is any possible case where this is a valid combination of max_length and stride for this tokenizer, and fail if it isn't.

Another reason why I kept the original assert!() from the inference part of the code is that it's possible that any additional special token(s) when is_pair=true will push it over the edge and make the stride too large for the max_length

@Narsil can you confirm whether my assumption in the above comment is a good one (i.e. is added_tokens(false) always less than added_tokens(true)?)

boyleconnor · 2023-07-27T16:14:08Z

Having such large strides really seems a bug in the first place to me

You are right that this should probably never happen if someone is reading the documentation carefully (and properly understands HuggingFace's idiosyncratic definition of "stride").

However, this issue might come up if the user is using some toy examples. E.g. if you're playing around with tokenizers and you want to look at example outputs, you might try looking at the output of, say, the following

tokenizer = tokenizers.Tokenizer.from_pretrained('bert-base-cased')
tokenizer.enable_truncation(5, stride=3)
print(tokenizer.encode("Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.")

would lead to this error. I think users would be particularly inclined to "play around" with tokenizers like this (I certainly did) since the interplay of "stride", "max length", and "added tokens" is hard to wrap your head around without examples.

boyleconnor added 4 commits July 26, 2023 23:50

Split get_n_added_tokens into separate method

ec31bd2

Modify TokenizerImpl.with_truncation() to raise an error if given b…

ffce311

…ad parameters

Return Python error if tokenizer.with_truncation() fails

e812b42

Add dummy variable assignment for no_truncation() case

06261db

boyleconnor force-pushed the fix-stride-error branch from 99a9897 to 06261db Compare July 27, 2023 06:50

Unrelated fmt fix.

16af73a

Narsil approved these changes Jul 27, 2023

View reviewed changes

Narsil merged commit c2664ae into huggingface:main Jul 28, 2023
11 of 12 checks passed

boyleconnor mentioned this pull request Aug 6, 2023

Add expect() for disabling truncation #1316

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Give error when initializing tokenizer with too high stride #1306

Give error when initializing tokenizer with too high stride #1306

boyleconnor commented Jul 27, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Jul 27, 2023 •

edited

Loading

Narsil left a comment •

edited

Loading

Narsil Jul 27, 2023

Narsil Jul 27, 2023

boyleconnor Jul 27, 2023

boyleconnor Aug 6, 2023

boyleconnor commented Jul 27, 2023 •

edited

Loading

Give error when initializing tokenizer with too high stride #1306

Give error when initializing tokenizer with too high stride #1306

Conversation

boyleconnor commented Jul 27, 2023 • edited Loading

HuggingFaceDocBuilderDev commented Jul 27, 2023 • edited Loading

Narsil left a comment • edited Loading

Choose a reason for hiding this comment

Narsil Jul 27, 2023

Choose a reason for hiding this comment

Narsil Jul 27, 2023

Choose a reason for hiding this comment

boyleconnor Jul 27, 2023

Choose a reason for hiding this comment

boyleconnor Aug 6, 2023

Choose a reason for hiding this comment

boyleconnor commented Jul 27, 2023 • edited Loading

boyleconnor commented Jul 27, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Jul 27, 2023 •

edited

Loading

Narsil left a comment •

edited

Loading

boyleconnor commented Jul 27, 2023 •

edited

Loading