Add Qwen benchmarks (#108)

## Summary  This PR adds qwen2 7b benchmarks  ## Testing Done   - Hardware Type: A100-80G-PCIe - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence --------- Co-authored-by: shisahni <[email protected]>
linkedin · Aug 26, 2024 · 03fd603 · 03fd603
1 parent e8a58aa
commit 03fd603
Show file tree

Hide file tree

Showing 8 changed files with 36 additions and 6 deletions.
diff --git a/examples/huggingface/README.md b/examples/huggingface/README.md
@@ -3,7 +3,7 @@
 ## How to Run
 ```bash
 pip install -r requirements.txt
-sh run.sh
+sh run_{MODEL}.sh
 ```
 
 **Notes**
@@ -16,9 +16,18 @@ sh run.sh
 
 ## Benchmark Result
 
+### LLaMA
 Benchmark conditions: LLaMA 3-8B, Alpaca Dataset, Max seq len = 512, Data Type = bf16, Optimizer = AdamW, Gradient Checkpointing = True, Distributed Strategy = FSDP1 on 4 A100s.
 
-The throughput increases by approximately 20% with more data, but the GPU memory is reduced by 40%. This means you can train the model on smaller GPUs, with larger batch sizes, or with longer sequence lengths at no additional cost.
+Throughput improves by around 20%, while GPU memory usage drops by 40%. This allows you to train the model on smaller GPUs, use larger batch sizes, or handle longer sequence lengths without incurring additional costs.
 
-![Throughput](img/hf_tps.png)
-![GPU Memory Allocated](img/hf_mem_alloc.png)
+![Throughput](img/llama_tps.png)
+![GPU Memory Allocated](img/llama_mem_alloc.png)
+
+### QWEN
+Benchmark conditions: Qwen2-7B, Alpaca Dataset, Max seq len = 512, Data Type = bf16, Optimizer = AdamW, Gradient Checkpointing = True, Distributed Strategy = FSDP1 on 4 A100s.
+
+Throughput improves by around 10%, while GPU memory usage drops by 50%. 
+
+![Throughput](img/qwen_tps.png)
+![GPU Memory Allocated](img/qwen_mem_alloc.png)
diff --git a/examples/huggingface/img/hf_mem_alloc.png → examples/huggingface/img/llama_mem_alloc.png b/examples/huggingface/img/hf_mem_alloc.png → examples/huggingface/img/llama_mem_alloc.png
diff --git a/examples/huggingface/img/hf_tps.png → examples/huggingface/img/llama_tps.png b/examples/huggingface/img/hf_tps.png → examples/huggingface/img/llama_tps.png
diff --git a/examples/huggingface/img/qwen_mem_alloc.png b/examples/huggingface/img/qwen_mem_alloc.png
diff --git a/examples/huggingface/img/qwen_tps.png b/examples/huggingface/img/qwen_tps.png
diff --git a/examples/huggingface/run.sh → examples/huggingface/run_llama.sh b/examples/huggingface/run.sh → examples/huggingface/run_llama.sh
diff --git a/examples/huggingface/run_qwen.sh b/examples/huggingface/run_qwen.sh
@@ -0,0 +1,20 @@
+torchrun --nnodes=1 --nproc-per-node=4 training.py \
+    --model_name "Qwen/Qwen2-7B" \
+    --bf16 \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 48 \
+    --per_device_eval_batch_size 64 \
+    --eval_strategy "no" \
+    --save_strategy "no" \
+    --learning_rate 6e-6 \
+    --weight_decay 0.05 \
+    --warmup_ratio 0.1 \
+    --lr_scheduler_type "cosine" \
+    --logging_steps 1 \
+    --include_num_input_tokens_seen \
+    --report_to none \
+    --fsdp "full_shard auto_wrap" \
+    --fsdp_config config/fsdp_config.json \
+    --seed 42 \
+    --use_liger False \
+    --output_dir alpaca_finetuning
diff --git a/examples/huggingface/training.py b/examples/huggingface/training.py
@@ -6,7 +6,7 @@
 from callback import EfficiencyCallback
 from trl import DataCollatorForCompletionOnlyLM, SFTTrainer
 
-from liger_kernel.transformers import apply_liger_kernel_to_llama
+import liger_kernel.transformers
 
 # TODO: clean up the code after hf meeting
 
@@ -54,7 +54,8 @@ def train():
     )
 
     if custom_args.use_liger is True:
-        apply_liger_kernel_to_llama()
+        # liger_kernel.transformers.apply_liger_kernel_to_llama()
+        liger_kernel.transformers.apply_liger_kernel_to_qwen2()
 
     trainer = SFTTrainer(
         model=model,