Tingchen Fu‡†, Deng Cai‡, Lemao Liu‡, Shuming Shi‡, Rui Yan†
† Gaoling School of Artificial Intelligence, Renmin University of China
‡ Tencent AI Lab
- Disperse-Then-Merge: Pushing the Limits of Instruction Tuning via Alignment Tax Reduction
- Contents
- Overview
- Step1: Data Distributing
- Step2: Training Sub-Models
- Step3: Model Merging
- License
- Contact
Supervised fine-tuning (SFT) on instruction-following corpus is a crucial approach toward the alignment of large language models (LLMs). However, the performance of LLMs on standard knowledge and reasoning benchmarks tends to suffer from deterioration at the latter stage of the SFT process, echoing the phenomenon of alignment tax. Through our pilot study, we put a hypothesis that the data biases are probably one cause behind the phenomenon. To address the issue, we introduce a simple disperse-then-merge framework composed of three steps:
- Data Distributing: We disperse the instruction-following data into portions.
- Training Sub-Models: We train multiple sub-models using different data portions.
- Model Merging: We merge multiple models into a single one via model merging techniques.
We use allenai/tulu-v2-sft-mixture as our instruction-following data and keep only the first turn of query and response if there are multiple turns in a dialogue session. The statistics of our data is shonw as follows:
Use the following command to preprocess the instruction data:
python3 -u preprocess.py \
--dataset allenai/tulu-v2-sft-mixture \
--n_cluster 4 \
--output_file PATH_TO_OUTPUT \
--output_cluster PATH_TO_OUTPUT_CLUSTER \
Next we perform SFT on sub-models using the following command:
RUN_DIR="$PWD"
for cluster_id in 0 1 2 3
do
torchrun --nproc_per_node=8 --master_port=1234 code/submodel_training.py \
--model_name_or_path meta-llama/Llama-2-7b-hf \
--label_names labels \
--train_file PATH_TO_TRAIN_DATA e.g., ${RUN_DIR}/data/tulu_v2.json \
--cluster_file PATH_TO_TRAIN_CLUSTER e.g., ${RUN_DIR}/data/tulu_v2_cluster.json \
--data_ratio 1.0 \
--cluster_id ${cluster_id} \
--do_train \
--fp16 True \
--bf16 False \
--load_in_4bit False \
--bnb_4bit_quant_type nf4 \
--bnb_4bit_compute_type float16 \
--peft_type lora \
--lora_modules gate_proj,up_proj,down_proj \
--lora_alpha 16 \
--lora_r 16 \
--lora_dropout 0.05 \
--block_size 1024 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--adam_beta1 0.9 \
--adam_beta2 0.98 \
--adam_epsilon 1e-6 \
--gradient_accumulation_steps 2 \
--num_train_epochs 1 \
--learning_rate 3e-4 \
--lr_scheduler_type cosine \
--max_grad_norm 0.3 \
--weight_decay 0.0 \
--warmup_steps 100 \
--warmup_ratio 0 \
--logging_steps 10 \
--save_total_limit 1 \
--evaluation_strategy "no" \
--save_strategy "epoch" \
--save_steps 1000 \
--eval_steps 1000 \
--metric_for_best_model loss \
--output_dir PATH_TO_SAVE_DIR \
--ddp_find_unused_parameters False \
--overwrite_output_dir \
done
After training, we merge the sub-models into a single model using the following command:
python -u code/script/merge.py \
--ckpt_paths PATH_TO_SUBMODEL_CHECKPOINTS \
--output_path PATH_TO_MERGED_CHECKPOINT \
--mix_ratio 1,1,1,1 \
The work is intended and licensed for research use only. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.
Please feel free to reach out to us if you have any questions or suggestions. You can contact us via the following email addresses: [email protected]