Potential issues with HF GPT2 Models #4

aitorormazabal · 2022-11-24T11:00:39Z

Hello,

I am using the GPT2 models available in HF, and running into a few issues. Firstly, there seems to be an issue with the tokenizer. Trying to calculate perplexity using the evaluate module, as follows:

from evaluate import load
perplexity = load("perplexity", module_type="metric")
results = perplexity.compute(predictions=["Hola, como estas?"], model_id="PlanTL-GOB-ES/gpt2-base-bne", device="cpu")

Gives the following error:

 ...
  File "/ikerlariak/aormazabal024/PhD/Poetry-Generation/demo/poetry-env-traganarru/lib/python3.8/site-packages/torch/nn/functional.py", line 2199, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self`

This seems to be related to the special tokens for <pad>, <s>, </s> and <unk> not being properly set (but are used by the evaluate module), as the only special token added in the tokenizer is <|endoftext|>. One can manually fix it for the local snapshot:

tokenizer.pad_token = '<pad>'
tokenizer.bos_token = '</s>'
tokenizer.eos_token = '</s>'
tokenizer.unk_token = '<unk>'
tokenizer.save_pretrained('[snapshot-path]')

However, even after fixing this, I am getting quite high perplexities compared to the 10-13 reported in the paper for all sentences I am trying (assuming per-word-perplexity is reported). Is it possible there was an issue when converting from fairseq to HF, and are the original fairseq models available somewhere to compare? Or maybe I am making a mistake when calculating the ppl, was there any tokenization done to the text apart from BPE (i.e. replacing newlines with , which is pretty standard in fairseq)?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential issues with HF GPT2 Models #4

Potential issues with HF GPT2 Models #4

aitorormazabal commented Nov 24, 2022

Potential issues with HF GPT2 Models #4

Potential issues with HF GPT2 Models #4

Comments

aitorormazabal commented Nov 24, 2022