Minillm: can you share a detail operation to reproduce the result? #298

LinaZhangCoding · 2025-01-06T08:38:39Z

I was confused about the readme text, especially which model and dataset should be download to which directory. Can you share a detail steps.

I run "bash scripts/gpt2/tools/process_data_dolly.sh " right, but when run "bash scripts/gpt2/tools/process_data_pretrain.sh" it report error: "Token indices sequence length is longer than the specified maximum sequence length for this model (1186 > 1024). Running this sequence through the model will result in indexing errors"

But in 2.1 The processed data can be downloaded from the following links: dolly. I use this link to download processed data, why its length exceeds maximum sequence length?

LinaZhangCoding · 2025-01-06T23:13:57Z

where is the “gpt2-base” model？Is it init-gpt2-120M？ download several “ init-gpt2-120M MiniLLM-gpt2-120M SFT-gpt2-120M” but no “gpt2-base” model.

LinaZhangCoding · 2025-01-07T00:05:52Z

3.2 Change Model Parallel Size
You can increase/decrease the tensor parallel sizes with

python3 tools/convert_mp.py
--input_path results/llama/train/minillm/7B-init-13B-sft
--source_mp_size 1
--target_mp_size 4
--model_type llama # choose from opt and llama
To use the model with Model Parallel, we provide two example scripts for training and evaluation

Why use gpt2 as example but here change into llama? The codes did not include results fold after downloading? What is the operation order? We can not reproduce the result according to the steps shown on the readme.

t1101675 · 2025-01-10T21:46:38Z

I was confused about the readme text, especially which model and dataset should be download to which directory. Can you share a detail steps.

I run "bash scripts/gpt2/tools/process_data_dolly.sh " right, but when run "bash scripts/gpt2/tools/process_data_pretrain.sh" it report error: "Token indices sequence length is longer than the specified maximum sequence length for this model (1186 > 1024). Running this sequence through the model will result in indexing errors"

But in 2.1 The processed data can be downloaded from the following links: dolly. I use this link to download processed data, why its length exceeds maximum sequence length?

Hi! The processed data download from this link shold be put under this path /PATH/TO/LMOps/minillm/processed_data/dolly. After downloading this data, you do not need to run bash scripts/gpt2/tools/process_data_dolly.sh again.

The "maxmimum sequence length" notice is a warning from the Transformers tokenizer trggiered hwne we are tokenizing a long document from openwebtext. In the data processing scripts, we will construct "chunks" with a max length 1024 by merging the exceed tokens in the current document to the next document (lines 68-78).

t1101675 · 2025-01-10T21:50:30Z

where is the “gpt2-base” model？Is it init-gpt2-120M？ download several “ init-gpt2-120M MiniLLM-gpt2-120M SFT-gpt2-120M” but no “gpt2-base” model.

The gpt2-base model is the official pre-trained gpt2-base model trained by OpenAI, which can be downloaded from this repo. This works as the intiailization for SFT (to train init-gpt2-120M and SFT-gpt2-120M).

t1101675 · 2025-01-10T21:59:19Z

3.2 Change Model Parallel Size You can increase/decrease the tensor parallel sizes with

python3 tools/convert_mp.py --input_path results/llama/train/minillm/7B-init-13B-sft --source_mp_size 1 --target_mp_size 4 --model_type llama # choose from opt and llama To use the model with Model Parallel, we provide two example scripts for training and evaluation

Why use gpt2 as example but here change into llama? The codes did not include results fold after downloading? What is the operation order? We can not reproduce the result according to the steps shown on the readme.

Generally, gpt2 does not need model parallelism because it is sufficiently small to fit in common GPUs. Meanwhile, gpt2's vocabulary size is 50257, not divided by 2, which means it cannot directly be paralleled to multiple GPUs. We use gpt2 as an example because it is small and conenient for quick reproduce. The instructions for LLaMA are almost the same as gpt2

t1101675 · 2025-01-10T22:23:06Z

We have provided more detailed on the downloaded data and model paths in the readme.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minillm: can you share a detail operation to reproduce the result? #298

Minillm: can you share a detail operation to reproduce the result? #298

LinaZhangCoding commented Jan 6, 2025

LinaZhangCoding commented Jan 6, 2025

LinaZhangCoding commented Jan 7, 2025

t1101675 commented Jan 10, 2025

t1101675 commented Jan 10, 2025

t1101675 commented Jan 10, 2025

t1101675 commented Jan 10, 2025

Minillm: can you share a detail operation to reproduce the result? #298

Minillm: can you share a detail operation to reproduce the result? #298

Comments

LinaZhangCoding commented Jan 6, 2025

LinaZhangCoding commented Jan 6, 2025

LinaZhangCoding commented Jan 7, 2025

t1101675 commented Jan 10, 2025

t1101675 commented Jan 10, 2025

t1101675 commented Jan 10, 2025

t1101675 commented Jan 10, 2025