Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minillm: can you share a detail operation to reproduce the result? #298

Open
LinaZhangCoding opened this issue Jan 6, 2025 · 6 comments

Comments

@LinaZhangCoding
Copy link

I was confused about the readme text, especially which model and dataset should be download to which directory. Can you share a detail steps.

I run "bash scripts/gpt2/tools/process_data_dolly.sh " right, but when run "bash scripts/gpt2/tools/process_data_pretrain.sh" it report error: "Token indices sequence length is longer than the specified maximum sequence length for this model (1186 > 1024). Running this sequence through the model will result in indexing errors"

But in 2.1 The processed data can be downloaded from the following links: dolly. I use this link to download processed data, why its length exceeds maximum sequence length?

@LinaZhangCoding
Copy link
Author

where is the “gpt2-base” model?Is it init-gpt2-120M? download several “ init-gpt2-120M MiniLLM-gpt2-120M SFT-gpt2-120M” but no “gpt2-base” model.

@LinaZhangCoding
Copy link
Author

3.2 Change Model Parallel Size
You can increase/decrease the tensor parallel sizes with

python3 tools/convert_mp.py
--input_path results/llama/train/minillm/7B-init-13B-sft
--source_mp_size 1
--target_mp_size 4
--model_type llama # choose from opt and llama
To use the model with Model Parallel, we provide two example scripts for training and evaluation

Why use gpt2 as example but here change into llama? The codes did not include results fold after downloading? What is the operation order? We can not reproduce the result according to the steps shown on the readme.

@t1101675
Copy link
Contributor

I was confused about the readme text, especially which model and dataset should be download to which directory. Can you share a detail steps.

I run "bash scripts/gpt2/tools/process_data_dolly.sh " right, but when run "bash scripts/gpt2/tools/process_data_pretrain.sh" it report error: "Token indices sequence length is longer than the specified maximum sequence length for this model (1186 > 1024). Running this sequence through the model will result in indexing errors"

But in 2.1 The processed data can be downloaded from the following links: dolly. I use this link to download processed data, why its length exceeds maximum sequence length?

Hi! The processed data download from this link shold be put under this path /PATH/TO/LMOps/minillm/processed_data/dolly. After downloading this data, you do not need to run bash scripts/gpt2/tools/process_data_dolly.sh again.

The "maxmimum sequence length" notice is a warning from the Transformers tokenizer trggiered hwne we are tokenizing a long document from openwebtext. In the data processing scripts, we will construct "chunks" with a max length 1024 by merging the exceed tokens in the current document to the next document (lines 68-78).

@t1101675
Copy link
Contributor

where is the “gpt2-base” model?Is it init-gpt2-120M? download several “ init-gpt2-120M MiniLLM-gpt2-120M SFT-gpt2-120M” but no “gpt2-base” model.

The gpt2-base model is the official pre-trained gpt2-base model trained by OpenAI, which can be downloaded from this repo. This works as the intiailization for SFT (to train init-gpt2-120M and SFT-gpt2-120M).

@t1101675
Copy link
Contributor

3.2 Change Model Parallel Size You can increase/decrease the tensor parallel sizes with

python3 tools/convert_mp.py --input_path results/llama/train/minillm/7B-init-13B-sft --source_mp_size 1 --target_mp_size 4 --model_type llama # choose from opt and llama To use the model with Model Parallel, we provide two example scripts for training and evaluation

Why use gpt2 as example but here change into llama? The codes did not include results fold after downloading? What is the operation order? We can not reproduce the result according to the steps shown on the readme.

Generally, gpt2 does not need model parallelism because it is sufficiently small to fit in common GPUs. Meanwhile, gpt2's vocabulary size is 50257, not divided by 2, which means it cannot directly be paralleled to multiple GPUs. We use gpt2 as an example because it is small and conenient for quick reproduce. The instructions for LLaMA are almost the same as gpt2

@t1101675
Copy link
Contributor

We have provided more detailed on the downloaded data and model paths in the readme.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants