-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minillm: can you share a detail operation to reproduce the result? #298
Comments
where is the “gpt2-base” model?Is it init-gpt2-120M? download several “ init-gpt2-120M MiniLLM-gpt2-120M SFT-gpt2-120M” but no “gpt2-base” model. |
3.2 Change Model Parallel Size python3 tools/convert_mp.py Why use gpt2 as example but here change into llama? The codes did not include results fold after downloading? What is the operation order? We can not reproduce the result according to the steps shown on the readme. |
Hi! The processed data download from this link shold be put under this path The "maxmimum sequence length" notice is a warning from the Transformers tokenizer trggiered hwne we are tokenizing a long document from openwebtext. In the data processing scripts, we will construct "chunks" with a max length 1024 by merging the exceed tokens in the current document to the next document (lines 68-78). |
The gpt2-base model is the official pre-trained gpt2-base model trained by OpenAI, which can be downloaded from this repo. This works as the intiailization for SFT (to train init-gpt2-120M and SFT-gpt2-120M). |
Generally, gpt2 does not need model parallelism because it is sufficiently small to fit in common GPUs. Meanwhile, gpt2's vocabulary size is 50257, not divided by 2, which means it cannot directly be paralleled to multiple GPUs. We use gpt2 as an example because it is small and conenient for quick reproduce. The instructions for LLaMA are almost the same as gpt2 |
We have provided more detailed on the downloaded data and model paths in the readme. |
I was confused about the readme text, especially which model and dataset should be download to which directory. Can you share a detail steps.
I run "bash scripts/gpt2/tools/process_data_dolly.sh " right, but when run "bash scripts/gpt2/tools/process_data_pretrain.sh" it report error: "Token indices sequence length is longer than the specified maximum sequence length for this model (1186 > 1024). Running this sequence through the model will result in indexing errors"
But in 2.1 The processed data can be downloaded from the following links: dolly. I use this link to download processed data, why its length exceeds maximum sequence length?
The text was updated successfully, but these errors were encountered: