Prerequisites:
- Python 3.11
- CUDA Toolkit 12 or higher
- gcc-9 or highe
Install dependencies:
pip install -r requirements
Then install the package via
pip install .
For quantization method, AnyPrecision and AutoGPTQ are used. Please see their installation guide for more details.
Here is a minimal working example
from pmpd import PMPDForCausalLM, Scheduler
model = PMPDForCausalLM(model_path, precisions=[4,3,2], use_anyprec=True).eval().cuda()
precisions = precisions
kw_dict = {}
kw_dict['precisions'] = [3,2]
# argument for naive scheduler
kw_dict['high_bit_steps'] = 10
# argument for kv_cache scheduler
kw_dict['precision_switch_points'] = precision_switch_points
kw_dict['save_dir'] = classifier_path
if use_anyprec:
config = model.model.config
else:
config = model.model.models[str(precisions[0])].config
kw_dict['dim'] = config.hidden_size // config.num_attention_heads
kw_dict['num_heads'] = config.num_key_value_heads
# initialize the scheduler
model.scheduler = Scheduler.get_scheduler('naive', **kw_dict)
outputs = model.generate(
input_ids=input_ids,
max_new_tokens=max_steps,
prefill_bit=prefill_bit,
past_key_values=past_key_values)
The model path should point to the directory of the quantized model if using AnyPrecision or a json configuration file otherwise.
If AnyPrecision format is not used, a separate model directory for each precision is needed. The json configuration file should look like this:
{
"2" : "gptq-Llama-2-7b-chat-hf-2bit",
"3" : "gptq-Llama-2-7b-chat-hf-3bit",
"4" : "gptq-Llama-2-7b-chat-hf-4bit",
"8" : "gptq-Llama-2-7b-chat-hf-8bit",
"model_path" : "meta-llama/Llama-2-7b-chat-hf"
}
gptq-Llama-2-7b-chat-hf-2bit
is the folder where 2-bit model is saved, for instance.
Here, we describe an end-to-end PMPD usage example using AnyPrecision as PTQ method and MobileLlama-1.4B as the base model.
To quantize a model to any-precision, run quantize.py with relevant arguments:
python quantize.py mtgv/MobileLLaMA-1.4B-Chat --seed_precision 3 --parent_precision 4
To train the Learned scheduler, follow the steps in the later section.
To find the optimal number of high-precision model steps on CNN/DM validation dataset, run:
python3 pmpd/eval/eval.py \
--model-id MobileLLaMA \
--model-path anyprec-model-path \
--bench-name cnn_dm \
--precision-high 4 \
--precision-low 3 \
--steps 0,85,170 \
--gpus 1,2,3 \
--answer-file-dir data/anyprec/ \
--static-search
3 answer files are generated, for 0, 85, 170 high-precision steps respectively. The best high-precision step is picked based on custom criterion for performance. To evaluate the performance of each high-precision step, refer to this section.
The following command can be run to generate answer files on CNN/DM test dataset.
python3 pmpd/eval/eval.py \
--model-id MobileLLaMA \
--model-path anyprec-model-path \
--fp-model mtgv/MobileLLaMA-1.4B-Chat \
--bench-name cnn_dm \
--precision-high 4 \
--precision-low 3 \
--answer-file-dir data/anyprec/ \
--classifier-path learned-classifier-path \
--high-bit-steps best-high-precision-step \
--baseline \
--kv-scheduler \
--static-scheduler \
Replace best-high-precision-step
with the optimal high-precision steps (an integer) found and learned-classifier-path
with the directory path of the saved learned scheduler.
See section on how to evaluate the generated answer files.
We currently support the following schedulers:
This scheduler runs a high-precision model for a fiexed number of steps, then switch to a low-precision model.
The scheduler can be created via
kw_dict = {}
kw_dict['precisions'] = [3,2]
# argument for naive scheduler
kw_dict['high_bit_steps'] = 10
scheduler = Scheduler.get_scheduler('naive', **kw_dict)
This scheduler is a learned scheduler, which takes in KV cache to determine whether the precision should be lowered.
This scheduler can be created via
kw_dict = {}
kw_dict['precisions'] = [3,2]
# argument for kv_cache scheduler
kw_dict['precision_switch_points'] = precision_switch_points
kw_dict['save_dir'] = classifier_path
if use_anyprec:
config = model.model.config
else:
config = model.model.models[str(precisions[0])].config
kw_dict['dim'] = config.hidden_size // config.num_attention_heads
kw_dict['num_heads'] = config.num_key_value_heads
scheduler = Scheduler.get_scheduler('naive', **kw_dict)
classifier_path
points to the directory where the learned weights are saved.
The training dataset is taken from the C4 dataset. To generate the scheduler training dataset, quantized models are needed.
We first generate a tokenized dataset of a fixed length:
python pmpd/train/generate_train_dataset.py --model-path lmsys/vicuna-7b-v1.5 --dir pmpd/train/data --size 256
Then, we use quantized models to generate labels
python pmpd/train/generate_train_dataset.py --second-stage --model-path anyprec-vicuna-7b-4-2 --dir pmpd/train/data --dataset pmpd/train/data/1st_stage_train_dataset_vicuna-7b-v1.5_256.pt --size 256 --precisions 3,2 --use-anyprec
The generated second stage dataset can then be used for training.
An example training script:
python pmpd/train/train_scheduler.py \
--model_name_or_path meta-llama/Llama-2-7b-chat-hf \
--data_path data_path.pt \
--output_dir test \
--num_train_epochs 30 \
--gradient_accumulation_steps 32 \
--learning_rate 0.0045 \
--weight_decay 0.08031288528295362 \
--precisions 4,3,2
An example hypertune script:
CUDA_VISIBLE_DEVICES=0 python pmpd/train/hypertune.py \
--model_path mtgv/MobileLLaMA-1.4B-Chat \
--data_path data_path.pt \
-- precisions 4,3
The script pmpd/eval/evaluate_generation.py
can be used to run multiple evaluations.
python3 pmpd/eval/evaluate_generation.py --model-path gptq-vicuna-7b-config.json --model-id vicuna-7b --bench-name mt_bench --answer-file data/gptq/mt_bench/vicuna-7b-kv_cache-4-3-2.jsonl --scheduler kv_cache --classifier_path test_gptq-vicuna-7b-config.json_1024_1.1_lr_0.005/ --precisions 4,3,2 --use-multi-model
--use-multi-model
means AnyPrecision format is not used.
The generated answer files can then be evaluated via:
python3 pmpd/eval/get_score.py answer-files-path