Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception upon attempting to load a Tokenizer from file #566

Closed
joepalermo opened this issue Dec 16, 2020 · 31 comments
Closed

Exception upon attempting to load a Tokenizer from file #566

joepalermo opened this issue Dec 16, 2020 · 31 comments
Labels

Comments

@joepalermo
Copy link

Hi, I'm attempting to simply serialize and then unserialize a trained tokenizer. When I run the following code:

tokenizer = Tokenizer(BPE())
trainer = BpeTrainer(vocab_size=280)
tokenizer.train(trainer, ["preprocessing/corpus/corpus.txt"])
save_to_filepath = 'preprocessing/tokenizer.json'
tokenizer.save(save_to_filepath)
tokenizer = Tokenizer.from_file(save_to_filepath)

I get the following traceback:

Traceback (most recent call last):
...
    tokenizer = Tokenizer.from_file(save_to_filepath)
Exception: data did not match any variant of untagged enum ModelWrapper at line 1 column 5408
@n1t0
Copy link
Member

n1t0 commented Jan 6, 2021

Hi @joepalermo, would you mind sharing the resulting tokenizer.json file? It would be very helpful for us to debug this.

@joepalermo
Copy link
Author

joepalermo commented Jan 19, 2021

@n1t0 Thanks for your help.

GitHub isn't letting me attach a .json file to a comment, so I'll just paste the contents of it here:

{"version":"1.0","truncation":null,"padding":null,"added_tokens":[],"normalizer":null,"pre_tokenizer":null,"post_processor":null,"decoder":null,"model":{"dropout":null,"unk_token":null,"continuing_subword_prefix":null,"end_of_word_suffix":null,"fuse_unk":false,"vocab":{"\n":0," ":1,"(":2,")":3,"":4,"+":5,",":6,"-":7,".":8,"/":9,"0":10,"1":11,"2":12,"3":13,"4":14,"5":15,"6":16,"7":17,"8":18,"9":19,";":20,"=":21,"?":22,"C":23,"D":24,"F":25,"G":26,"I":27,"L":28,"S":29,"W":30,"a":31,"b":32,"c":33,"d":34,"e":35,"f":36,"g":37,"h":38,"i":39,"j":40,"k":41,"l":42,"m":43,"n":44,"o":45,"p":46,"q":47,"r":48,"s":49,"t":50,"u":51,"v":52,"w":53,"x":54,"y":55,"z":56," -":57,"e ":58,"t ":59," +":60," =":61," + ":62," - ":63,". ":64,";\n":65,"**":66,"Le":67,"Let ":68," = ":69,".;\n":70,"s ":71,"th":72," = -":73,"iv":74,"the ":75,"2":76,"r ":77,"of":78,". Let ":79,"d ":80,"?;\n":81,"at":82,"2":83,"of ":84,"3":85,"de":86,"or ":87,"4":88,"os":89,"pos":90,"(-":91,"5*":92,"Su":93,"ppos":94,"Suppos":95,"is ":96,"n ":97,"be ":98,"nd ":99,"co":100," a":101,"at ":102,"Wh":103,"What ":104,"ul":105," be ":106," - 1":107," + 1":108,"e -":109,"com":110,"3":111,"st ":112,") = ":113,"What is ":114,"ac":115,"act":116," f":117,"So":118,"lv":119,"Solv":120,"al":121,"ive ":122,") = -":123,"ate ":124,"mo":125,"commo":126,"common ":127,"in":128,"0":129,"Suppose ":130,"Cal":131,"cul":132,"Calcul":133,"Calculate ":134,"div":135,"divi":136," for ":137,"What is the ":138,"riv":139,"ative ":140,"deriv":141,"derivative ":142," and ":143,")/":144,"re":145,"or of ":146,"Is ":147,"). ":148,", ":149,"he":150,"im":151,"pr":152,"prim":153,"2 + ":154,"st common ":155,"fact":156,").;\n":157,"Suppose -":158,"Calculate the ":159," - 2":160,"6":161,"prime ":162," = 0":163," + 2":164,"Solve ":165,"2 - ":166,"or":167,", -":168,"derivative of ":169,"4":170,"10":171,"7":172,"ir":173,"y ":174,"r w":175,"d b":176,"ain":177,"main":178,"the prime ":179,"der w":180,"ded b":181,"is divi":182,"remain":183,"factor":184,"the prime factor":185,"der whe":186,"is divided b":187,"remainder whe":188,"the prime factors ":189,"12":190,"remainder when ":191,"the prime factors of ":192,"is divided by ":193,"min":194,"ti":195,"er":196," is divided by ":197,"Solve -":198,") be ":199,") be the ":200," w":201,"). Let ":202,"le ":203,"mul":204,"ple ":205," - 3":206,"tiple ":207,"multiple ":208,"rt ":209,"multiple of ":210,"8":211," + 3":212,"of -":213,"est common ":214,"11":215," a ":216," wrt ":217," - 2":218,"/2":219,". Suppose ":220," + 2":221,"(-2":222,". Is ":223,"9":224,". What is the ":225,"Fi":226,"Find ":227,"(-1":228,")?;\n":229," - 4":230,"/3":231,"derivative of -":232," + 4":233," - 3":234,"5":235,"eco":236,"seco":237,"second ":238," + 3":239,"0 = ":240,"0 = -":241,"Find the ":242," - -":243,"thir":244,"third ":245,"15":246,". Calculate the ":247,"13":248," + 4":249,"sor of ":250,"divisor of ":251," + -":252,"14":253," - 4*":254,"ghe":255,"hi":256,"ghest common ":257,"highest common ":258,". D":259,"no":260,"deno":261,"common deno":262,"minat":263,"common denominat":264,". Suppose -":265,"1*":266,"ar":267,"What ar":268,"What are ":269,"e?;\n":270,"16":271,"ber":272,"mber":273,"nu":274,"What are the prime factors of ":275,"mber?;\n":276,"number?;\n":277,"Li":278,"List ":279},"merges":[" -","e ","t "," +"," ="," + "," - ",". ","; \n","* ","L e","Le t "," = ",". ;\n","s ","t h"," = -","i v","th e ","2 ","r ","o f",". Let ","d ","? ;\n","a t"," 2","of ","3 ","d e","o r ","4 ","o s","p os","( -","5 ","S u","p pos","Su ppos","i s ","n ","b e ","n d ","c o"," a","a t ","W h","Wh at ","u l"," be "," - 1"," + 1","e -","co m"," 3","s t ",") = ","What is ","a c","ac t"," f","S o","l v","So lv","a l","iv e ",") = -","at e ","m o","com mo","commo n ","i n","0 ","Suppos e ","C al","c ul","Cal cul","Calcul ate ","d iv","div i"," f or ","What is the ","r iv","at ive ","de riv","deriv ative "," a nd ",") /","r e","or of ","I s ",") . ",", ","h e","i m","p r","pr im","2 + ","st common ","f act",") .;\n","Suppos e -","Calculate the "," - 2","6 ","prim e "," = 0"," + 2","Solv e ","2 - ","o r",", -","derivative of "," 4","1 0","7 ","i r","y ","r w","d b","a in","m ain","the prime ","de r w","de d b","is divi","re main","fact or","the prime factor","der w he","is divi ded b","remain der whe","the prime factor s ","1 2","remainder whe n ","the prime factors of ","is divided b y ","m in","t i","e r"," is divided by ","Solv e -",") be ",") be the "," w",") . Let ","l e ","m ul","p le "," - 3","ti ple ","mul tiple ","r t ","multiple of ","8 "," + 3","of -","e st common ","1 1"," a "," w rt "," - 2","/ 2",". Suppose "," + 2","(- 2",". Is ","9 ",". What is the ","F i","Fi nd ","(- 1",") ?;\n"," - 4","/ 3","derivative of -"," + 4"," - 3"," 5","e co","s eco","seco nd "," + 3","0 = ","0 = -","Find the "," - -","th ir","thir d ","1 5",". Calculate the ","1 3"," + 4","s or of ","divi sor of "," + -","1 4"," - 4","g he","h i","ghe st common ","hi ghest common ",". D","n o","de no","common deno","min at","common deno minat",". Suppose -","1 *","a r","What ar","What ar e ","e ?;\n","1 6","b er","m ber","n u","What are the prime factors of ","mber ?;\n","nu mber?;\n","L i","Li st "]}}

@joepalermo
Copy link
Author

joepalermo commented Jan 19, 2021

This is really confusing because I don't think I'm doing anything unusual.

Also note, I tried unpickling the tokenizer object and it gives a similar error: Exception: Error while attempting to unpickle Tokenizer: data did not match any variant of untagged enum ModelWrapper at line 1 column 5304

@lukas-blecher
Copy link

I've had the same issue. Try adding a pre_tokenizer:

from tokenizers.pre_tokenizers import Whitespace
tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = Whitespace()
trainer = BpeTrainer(vocab_size=280)
tokenizer.train(trainer, ["preprocessing/corpus/corpus.txt"])
save_to_filepath = 'preprocessing/tokenizer.json'
tokenizer.save(save_to_filepath)
tokenizer = Tokenizer.from_file(save_to_filepath)

@Hustcw
Copy link

Hustcw commented Apr 15, 2021

any update to this problem? I've had the same issue

@n1t0
Copy link
Member

n1t0 commented Apr 15, 2021

Have you tried the solution proposed by @lukas-blecher to use a pre-tokenizer?

I believe this issue is related to this one: #645

@Hustcw
Copy link

Hustcw commented Apr 17, 2021

Have you tried the solution proposed by @lukas-blecher to use a pre-tokenizer?

I believe this issue is related to this one: #645

Yes, I've used a pre-tokenizer. I find this problem is caused by more than one spaces in tokenizer's merge mentioned in #645.

@ejohb
Copy link

ejohb commented Nov 30, 2021

Having same problem. I already have a pre-tokenizer added.

@ejohb
Copy link

ejohb commented Dec 1, 2021

Having same problem. I already have a pre-tokenizer added.

After some fiddling, the problem occurs only when I remove pre_tokenizers.Whitespace() and add pre_tokenizers.Split(pattern='\w+|[^\w\s]+', behavior='isolated') in its place.

@ruitedk6
Copy link

In case this might be of help to others:
I was getting this error when using the SentenceTranformers library, and in my case upgrading tokenizers to version 0.10.3 fixed the issue:

pip install tokenizers==0.10.3

If anyone is getting this error, I recommend also taking a look at the dependency requirements (e.g., which version of the tokenizers libraries is required).

@duskybomb
Copy link

Yes, @ejohb is right. The problem occurs when using pre_tokenizers.Split() :/

@Narsil
Copy link
Collaborator

Narsil commented May 2, 2022

@duskybomb Does the problem still exist on latest 0.12.1 ? I can't seem to reproduce.

@duskybomb
Copy link

@Narsil yes, it is still there in 0.12.1. The errror when I was trying to load: Exception: data did not match any variant of untagged enum ModelWrapper at line 59999 column 3.
this is the pretokenizer i was using: tokenizer.pre_tokenizer = Split(pattern="<BREAK>", behavior="removed")

Also, I am not sure if this is desired or not -- but the vocab had <BREAK> merged with tokens despite using removed behavior.
eg: <BREAK>small<BREAK>, with small being the actual token.

@Narsil
Copy link
Collaborator

Narsil commented May 2, 2022

Do you have a simple reproducible script ?
here is the script I tried to use to reproduce, but it seems to be working properly

from tokenizers import trainers, models, Tokenizer, pre_tokenizers

tokenizer = Tokenizer(models.BPE())
trainer = trainers.BpeTrainer(
    special_tokens=["<unk>", "<pad>", "<sep>"],
    vocab_size=8000,
)
tokenizer.pre_tokenizer = pre_tokenizers.Split(pattern="\w+|[^\w\s]+", behavior="isolated")
tokenizer.add_special_tokens(["<sep>"])
tokenizer.add_tokens(["<sep>"])


def iterator_over_seqs():
    with open("data/big.txt", "r") as f:
        for line in f:
            yield "ABCEFGH"


tokenizer.train_from_iterator(iterator=iterator_over_seqs(), trainer=trainer)
tokenizer.save("tok.json", pretty=True)
encoded = tokenizer.encode("ABCD<sep>EFGH")
tok = Tokenizer.from_file("tok.json")  # This is what is supposed to fail no ? It doesn't here.
print(encoded.ids)
```

@yechong316
Copy link

yechong316 commented Jul 29, 2022

I also encountered the same problem, the json file is as follow, pleaseplease transform txt to json
tokenizer-wiki.txt

@Narsil
Copy link
Collaborator

Narsil commented Aug 1, 2022

hi @yechong316 ,

It seems your file contains merges which are not acceptable in the current deployed version of tokenizers.

Those merges contain multiple spaces: "e s " for instance (line 9499).
This should not be doable within the library, hence the limitation. So it's normal if you created the merges yourselves in some manner.

  • If it was done within the library, a reproducible script would be super helpful to reproduce and fix.
  • In general this is not a limitation of the underlying BPE model but really a self imposed limitation within the library. We can definitely lift this limitation off (Merges cannot handle tokens containing spaces. #909 if you want to try it out, but it will need rewriting the merges in a different way). It's not currently merged, as changing anything regarding serialization requires a great deal of care to make sure we're not breaking anything in a backward incompatible way. But if there's enough attention for this feature, it definitely can be added !

@bashFish
Copy link

bashFish commented Nov 9, 2022

just to complement on Narsil:
there are several "white space characters" usable in the tokenizer file, e.g. "Ġ" (unicode: ord("Ġ")=288) which in turn can be used in the merges

also, in case ye removed some of your vocab's, be sure all merges are still possible- in case some can't be resolved after altering it, it would throw the same error..

@nihirv
Copy link

nihirv commented Nov 22, 2022

Hi, I'm running into the same issue. However, I explicitly want to have multiple whitespaces in my merges. Could someone point me in the right direction on how I could do this?

@davidgilbertson
Copy link

This is still an issue in 0.13.2.

To reproduce:

from tokenizers import Tokenizer, models, trainers

bpe_model = models.BPE(unk_token="[UNK]")
tokenizer = Tokenizer(model=bpe_model)
tokenizer.train_from_iterator(
    iterator=["test~ing lick~ing kick~ing"],
    trainer=trainers.BpeTrainer(),
)

path = "my_tokenizer.json"
tokenizer.save(path)

tok_loaded = Tokenizer.from_file(path)

In this particular case, tokenizer.pre_tokenizer = Whitespace() is a workaround.

@Narsil
Copy link
Collaborator

Narsil commented Mar 7, 2023

Have you checked out the PR that fixes it ?
#909

Which not going to merge anytime soon since it changes the on-disk format of the tokenizer, so we need a compelling reason for going through the pain of making this change.

If any model that requires it gets merged into transformers for instance, that would be a very valid reason !

In the meantime, the PR should work.

@ashutoshsaboo
Copy link

ashutoshsaboo commented Mar 7, 2023

Hi @Narsil : I think I've a very weird issue, which seems similar to the same above error stack trace in this issue. Here are the steps how it goes:

  1. So I trained an instance of custom XLMRobertaFast tokenizer from scratch on my multi-lingual corpus. Point to note is that I trained it on transformers-4.26.0 version on a python 3.7 conda environment in a different EC2 instance. After I had trained this tokenizer, in a separate script I had loaded the same using XLMRobertaTokenizerFast.from_pretrained() and it had worked fine without any errors.
  2. Now few days later, due to certain reasons I had to change my instance - I'm on a different instance that doesn't have python 3.7 and has python 3.6. So the latest version supported for python 3.6 is also transformers-4.18.0 which is installed on this instance. Now when I'm trying to load the same saved tokenizer which loaded perfectly with the 4.26.0 version as mentioned above, is failing now when loaded with the same function: XLMRobertaTokenizerFast.from_pretrained(). I tried it on transformers==4.2.1 to just double-check if it wasn't any bug in the 4.26.0 version or not. The error stack trace on both the tried transformers version on python 3.6 is as below:
Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 59 column 3

Is this expected? Are tokenizers supposed to be backwards incompatible across different transformer lib versions? Installing from scratch python 3.7 isn't trivial on this instance, hence request you to please help if anything can be done here as a workaround. While training the tokenizer I didn't do any extravagant - initialised a SentencePieceBPETokenizer() and just trained it from scratch by invoking .train() on my corpus.

Strangely the trained model on python 3.7 instance is loading perfectly on python 3.6 instance. So the issue is only with the tokenizer.

@Narsil request your help on this^. I can't post the same tokenizer here due to confidentiality reasons. But if you need any other info from me to help with this, please feel free to request right away.

@Narsil
Copy link
Collaborator

Narsil commented Mar 7, 2023

Can you check your tokenizers versions ? I think they are not the same major. (probably).

tokenizers is designed to be backwards compatible, but you're talking here about forward compatibility (some artefact created with a newer version working on an older version).

I can't tell you exactly what's going on, but the pre_tokenizer in the JSON file cannot be read by your older version. We did change the layout at some point, but again, in a backward compatbile fashion (older JSON are still read, but newer ones are written to disk).

It's probably not too hard to modify the 3.7 version to be loadable in your 3.6 environment. Just train a dummy model in the same fashion and look at how it's saved on disk in the old version. Can you do exactly the same thing ? I'm not sure it depends on your options you choose, and if they were only implemented later.

Have your tried using pyenv ? It's usually pretty good at installing different python versions on most systems (not sure it works in your case)

Does it make sense ?

If you happen to modify a JSON manaully, please double check the output of the tokenizer afterwards, it's easy to introduce subtle bugs without realizing.

@ashutoshsaboo
Copy link

ashutoshsaboo commented Mar 7, 2023

woohoo editing the JSON worked! :D many thanks! @Narsil as a suggestion: should this forward compatibility changes across tokenizer versions be more specifically documented somewhere, so it's accessible easily?

FYI -- I just had to add "str_rep": "▁", in decoder as well as pre_tokenizer keys of the python 3.7 trained tokenizer.json to get it work on 3.6 version.

@Narsil
Copy link
Collaborator

Narsil commented Mar 7, 2023

should this forward compatibility changes across tokenizer versions be more specifically documented somewhere, so it's accessible easily?

There's a changelog + releases : https://github.com/huggingface/tokenizers/releases?page=2 Should be enough (but not necessarily easily discoverable.

Please triple check the output ids before claiming victory :)

@ashutoshsaboo
Copy link

ashutoshsaboo commented Mar 7, 2023 via email

@Narsil
Copy link
Collaborator

Narsil commented Mar 8, 2023

I mean that the encodings are exactly the same on a larger enough subset of text. (tokenizer.encode(mystring))

@delgermurun
Copy link

delgermurun commented Jul 16, 2023

I am having this problem. Here is the reproducible script:

from tokenizers.trainers import BpeTrainer
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Split

# https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
t = """First Citizen:
Before we proceed any further, hear me speak.

..."""

tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
trainer = BpeTrainer(special_tokens=["[UNK]"], vocab_size=1000, min_frequency=2)
tokenizer.pre_tokenizer = Split("\w+|[^\w\s]+", behavior="isolated")

tokenizer.train_from_iterator(
    iterator=[t],
    trainer=trainer,
)

tokenizer.save("tokenizer.json")

Works fine if I use trained tokenizer directly (not loading from the file)

print(tokenizer.encode("""especially       against Caius Marcius?

All:
Against""").tokens)

Output: ['es', 'p', 'ec', 'i', 'all', 'y ', ' ', ' ', ' ', ' ', ' ', ' a', 'gainst ', 'Caius Marc', 'i', 'us', '?\n\nAll:\n', 'A', 'gain', 'st']

But loading the tokenizer from the file fails.

tokenizer = Tokenizer.from_file("tokenizer.json")
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
Cell In[88], line 1
----> 1 tokenizer = Tokenizer.from_file("tokenizer.json")

Exception: data did not match any variant of untagged enum ModelWrapper at line 382 column 3

Version: tokenizers==0.13.3

@Narsil
Copy link
Collaborator

Narsil commented Jul 17, 2023

Can you open a new issue please ?

It's not really good practice to resurrect old threads as it pollutes searches with potentially irrelevant content, and makes your issue which is likely a new bug less discoverable for others. (Ofc it's good to search beforehand to prevent duplicates, but when the thread is super old or closed, you can most likely create a new thread, and link the old one you found just it case we want to merge)

@Narsil
Copy link
Collaborator

Narsil commented Jul 17, 2023

Ok looked at this issue (I will copy it into a new issue once there's one).

The error is because of the current tokenizer format which expects the merges part of the file to not contain any space
There's a very old draft PR #909 that I made that can unlock that use case.

This wasn't implemented at the time, because changing the format is a pretty risky change for backward compatibility, and there didn't seem to be any real world use case.

@mpjanus
Copy link

mpjanus commented Sep 21, 2023

I had the same error when loading LLama 2 models. Upgrading to transformers==4.33.2 and tokenizers==0.13.3 solved it for me.

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Apr 30, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests