GitHub - TingchenFu/ACL24-ExpertFusion

Disperse-Then-Merge: Pushing the Limits of Instruction Tuning via Alignment Tax Reduction

Tingchen Fu^‡†, Deng Cai^‡, Lemao Liu^‡, Shuming Shi^‡, Rui Yan^†

^† Gaoling School of Artificial Intelligence, Renmin University of China

^‡ Tencent AI Lab

Overview

Supervised fine-tuning (SFT) on instruction-following corpus is a crucial approach toward the alignment of large language models (LLMs). However, the performance of LLMs on standard knowledge and reasoning benchmarks tends to suffer from deterioration at the latter stage of the SFT process, echoing the phenomenon of alignment tax. Through our pilot study, we put a hypothesis that the data biases are probably one cause behind the phenomenon. To address the issue, we introduce a simple disperse-then-merge framework composed of three steps:

Data Distributing: We disperse the instruction-following data into portions.
Training Sub-Models: We train multiple sub-models using different data portions.
Model Merging: We merge multiple models into a single one via model merging techniques.

Step1: Data Distributing

We use allenai/tulu-v2-sft-mixture as our instruction-following data and keep only the first turn of query and response if there are multiple turns in a dialogue session. The statistics of our data is shonw as follows:

Use the following command to preprocess the instruction data:

python3 -u preprocess.py \
  --dataset allenai/tulu-v2-sft-mixture \
  --n_cluster 4 \
  --output_file PATH_TO_OUTPUT \
  --output_cluster PATH_TO_OUTPUT_CLUSTER \

Step2: Training Sub-Models

Next we perform SFT on sub-models using the following command:

RUN_DIR="$PWD"
for cluster_id in 0 1 2 3 
do 
    torchrun --nproc_per_node=8 --master_port=1234  code/submodel_training.py  \
        --model_name_or_path meta-llama/Llama-2-7b-hf \
        --label_names labels  \
        --train_file  PATH_TO_TRAIN_DATA e.g., ${RUN_DIR}/data/tulu_v2.json  \
        --cluster_file PATH_TO_TRAIN_CLUSTER e.g., ${RUN_DIR}/data/tulu_v2_cluster.json  \
        --data_ratio   1.0  \
        --cluster_id ${cluster_id}  \
        --do_train  \
        --fp16  True    \
        --bf16  False    \
        --load_in_4bit False  \
        --bnb_4bit_quant_type nf4   \
        --bnb_4bit_compute_type  float16  \
        --peft_type lora \
        --lora_modules gate_proj,up_proj,down_proj \
        --lora_alpha 16 \
        --lora_r 16 \
        --lora_dropout 0.05 \
        --block_size 1024  \
        --per_device_train_batch_size 1 \
        --per_device_eval_batch_size 1 \
        --adam_beta1 0.9 \
        --adam_beta2 0.98 \
        --adam_epsilon 1e-6 \
        --gradient_accumulation_steps 2 \
        --num_train_epochs 1 \
        --learning_rate 3e-4 \
        --lr_scheduler_type cosine \
        --max_grad_norm 0.3 \
        --weight_decay 0.0 \
        --warmup_steps 100 \
        --warmup_ratio 0 \
        --logging_steps 10 \
        --save_total_limit 1 \
        --evaluation_strategy "no" \
        --save_strategy "epoch" \
        --save_steps 1000 \
        --eval_steps 1000 \
        --metric_for_best_model loss \
        --output_dir PATH_TO_SAVE_DIR \
        --ddp_find_unused_parameters False  \
        --overwrite_output_dir  \
done

Step3: Model Merging

After training, we merge the sub-models into a single model using the following command:

python -u code/script/merge.py \
    --ckpt_paths PATH_TO_SUBMODEL_CHECKPOINTS \
    --output_path PATH_TO_MERGED_CHECKPOINT \
    --mix_ratio 1,1,1,1  \

License

The work is intended and licensed for research use only. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

Contact

Please feel free to reach out to us if you have any questions or suggestions. You can contact us via the following email addresses: [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
code		code
data		data
.DS_Store		.DS_Store
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Disperse-Then-Merge: Pushing the Limits of Instruction Tuning via Alignment Tax Reduction

Contents

Overview

Step1: Data Distributing

Step2: Training Sub-Models

Step3: Model Merging

License

Contact

About

Releases

Packages

Languages

TingchenFu/ACL24-ExpertFusion

Folders and files

Latest commit

History

Repository files navigation

Disperse-Then-Merge: Pushing the Limits of Instruction Tuning via Alignment Tax Reduction

Contents

Overview

Step1: Data Distributing

Step2: Training Sub-Models

Step3: Model Merging

License

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages