Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Error(s) in loading state_dict for EmbeddingPipe: size mismatch for word_embeddings.weight #645

Open
mcao516 opened this issue Jul 7, 2022 · 9 comments
Labels
bug Something isn't working good first issue Good for newcomers help wanted This issue needs assistance

Comments

@mcao516
Copy link

mcao516 commented Jul 7, 2022

Describe the bug
RuntimeError: Error(s) in loading state_dict for EmbeddingPipe: size mismatch for word_embeddings.weight: copying a param with shape torch.Size([25216, 6144]) from checkpoint, the shape in current model is torch.Size([50304, 6144]).

To Reproduce

  1. Download Slim weights
  2. Update the vocabulary and checkpoint path in ./configs/20B.yml (HFTokenizer is used)
  3. Run: ./deepy.py generate.py ./configs/20B.yml -i prompt.txt -o sample_outputs.txt

Screenshots
image

Environment (please complete the following information):

  • GPUs: 2x RTX8000 (48G)
@mcao516 mcao516 added the bug Something isn't working label Jul 7, 2022
@jdagdelen
Copy link

jdagdelen commented Jul 16, 2022

I'm experiencing this too. Not sure what I'm doing wrong. Downloaded the weights from here which the "fixed" link from #646. However, I also downloaded the slim weights and that seems to load ok, although the output from the model is gibberish.

@FayZ676
Copy link

FayZ676 commented Dec 9, 2022

I am getting the same problem too when trying to train a 1-3B model.

To Reproduce:

  1. Download Slim weights
  2. Update ./configs/1-3B.yml as shown in the screen shots below.
  3. Run python ./deepy.py train.py -d configs 1-3B.yml

Screenshots:
Screen Shot 2022-12-09 at 3 20 38 PM
Screen Shot 2022-12-09 at 3 20 54 PM

Environment:

  • GPU's: 4x 3090 (96G)

@binglun30
Copy link

I also had the same problem, when using a single machine to load the slim weight downloaded on github, it reported a similar error, here is a screenshot of the error message
image

Environment:

GPU's: 4x 3090 (96G)

@djaym7
Copy link

djaym7 commented Apr 19, 2023

What's the solution ? and why closed ?

@StellaAthena
Copy link
Member

@djaym7 Thanks for saying something. I don't recall closing this and have reopened it.

@StellaAthena StellaAthena reopened this Apr 19, 2023
@StellaAthena StellaAthena added good first issue Good for newcomers help wanted This issue needs assistance labels Apr 30, 2023
@StellaAthena
Copy link
Member

@FayZ676 the url you’re linking to does not contain the weights for a 1.3B model, it contains the weights for a 20B model. That’s why you’re getting a size mismatch: it’s quite simply the wrong size. I suspect that this is unrelated to the problems the others are having.

@leclem so that change allows you to finetune the 20B model? Can you post a WandB link showing it training so I can check out the loss etc are as expected?

@shaunstoltz
Copy link

I have the same issue trying to train. Downloaded slim weight and using ./config/20B.yml and running "python3 ./deepy.py train.py ./configs/20B.yml" gives this error:

RuntimeError: Error(s) in loading state_dict for EmbeddingPipe:
size mismatch for word_embeddings.weight: copying a param with shape torch.Size([12608, 6144]) from checkpoint, the shape in current model is torch.Size([12672, 6144]).

@dashstander
Copy link
Contributor

I suspect that this is an error that has to do with model parallelism. @shaunstoltz how many GPUs were you loading the model onto / what was the model parallelism setting?

@diazero-security
Copy link

Does anyone have a solution for this problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers help wanted This issue needs assistance
Projects
None yet
Development

No branches or pull requests

9 participants