-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core
/ tests ] v1 slow tests
#1218
Changes from all commits
38b8d90
b2d6f41
10191f1
62743a3
affd0db
51c90b3
770ba3c
c524a65
7d75884
827df67
41a6839
a2dd624
cc15412
4582f4f
d82983e
a0552d5
c06cbd7
8b606d2
4dabf99
8a9a62b
e9fcc94
403d2cd
c7ddba4
33ce5dc
976fd54
210dcaf
9ecd19b
434fb28
9585ce8
4ea24af
37d5596
2823ab9
e16abf2
b2bdcb4
07c491a
83df91c
b4d875e
34e4ee7
ceedd8d
01b8ac0
57d1401
7f9762f
573d8da
350529e
f7cb79b
cc3b430
321bba1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
#!/bin/bash | ||
# This script runs an SFT example end-to-end on a tiny model using different possible configurations | ||
# but defaults to QLoRA + PEFT | ||
OUTPUT_DIR="test_dpo/" | ||
MODEL_NAME="HuggingFaceM4/tiny-random-LlamaForCausalLM" | ||
MAX_STEPS=5 | ||
BATCH_SIZE=2 | ||
SEQ_LEN=128 | ||
|
||
# Handle extra arguments in case one passes accelerate configs. | ||
EXTRA_ACCELERATE_ARGS="" | ||
EXTRA_TRAINING_ARGS="""--use_peft \ | ||
--load_in_4bit | ||
""" | ||
|
||
# This is a hack to get the number of available GPUs | ||
NUM_GPUS=2 | ||
|
||
if [[ "${TRL_ACCELERATE_CONFIG}" == "" ]]; then | ||
EXTRA_ACCELERATE_ARGS="" | ||
else | ||
EXTRA_ACCELERATE_ARGS="--config_file $TRL_ACCELERATE_CONFIG" | ||
# For DeepSpeed configs we need to set the `--fp16` flag to comply with our configs exposed | ||
# on `examples/accelerate_configs` and our runners do not support bf16 mixed precision training. | ||
if [[ $TRL_ACCELERATE_CONFIG == *"deepspeed"* ]]; then | ||
EXTRA_TRAINING_ARGS="--fp16" | ||
else | ||
echo "Keeping QLoRA + PEFT" | ||
fi | ||
fi | ||
|
||
|
||
CMD=""" | ||
accelerate launch $EXTRA_ACCELERATE_ARGS \ | ||
--num_processes $NUM_GPUS \ | ||
--mixed_precision 'fp16' \ | ||
`pwd`/examples/scripts/dpo.py \ | ||
--model_name_or_path $MODEL_NAME \ | ||
--output_dir $OUTPUT_DIR \ | ||
--max_steps $MAX_STEPS \ | ||
--per_device_train_batch_size $BATCH_SIZE \ | ||
--max_length $SEQ_LEN \ | ||
$EXTRA_TRAINING_ARGS | ||
""" | ||
|
||
echo "Starting program..." | ||
|
||
{ # try | ||
echo $CMD | ||
eval "$CMD" | ||
} || { # catch | ||
# save log for exception | ||
echo "Operation Failed!" | ||
exit 1 | ||
} | ||
exit 0 | ||
Comment on lines
+48
to
+56
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The bash script will be run on the assumption that : |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
#!/bin/bash | ||
# This script runs an SFT example end-to-end on a tiny model using different possible configurations | ||
# but defaults to QLoRA + PEFT | ||
OUTPUT_DIR="test_sft/" | ||
MODEL_NAME="HuggingFaceM4/tiny-random-LlamaForCausalLM" | ||
DATASET_NAME="imdb" | ||
MAX_STEPS=5 | ||
BATCH_SIZE=2 | ||
SEQ_LEN=128 | ||
|
||
|
||
# Handle extra arguments in case one passes accelerate configs. | ||
EXTRA_ACCELERATE_ARGS="" | ||
EXTRA_TRAINING_ARGS="""--use_peft \ | ||
--load_in_4bit | ||
""" | ||
|
||
# Set your number of GPUs here | ||
NUM_GPUS=2 | ||
|
||
if [[ "${TRL_ACCELERATE_CONFIG}" == "" ]]; then | ||
EXTRA_ACCELERATE_ARGS="" | ||
else | ||
EXTRA_ACCELERATE_ARGS="--config_file $TRL_ACCELERATE_CONFIG" | ||
# For DeepSpeed configs we need to set the `--fp16` flag to comply with our configs exposed | ||
# on `examples/accelerate_configs` and our runners do not support bf16 mixed precision training. | ||
if [[ $TRL_ACCELERATE_CONFIG == *"deepspeed"* ]]; then | ||
EXTRA_TRAINING_ARGS="--fp16" | ||
else | ||
echo "Keeping QLoRA + PEFT" | ||
fi | ||
fi | ||
|
||
|
||
CMD=""" | ||
accelerate launch $EXTRA_ACCELERATE_ARGS \ | ||
--num_processes $NUM_GPUS \ | ||
--mixed_precision 'fp16' \ | ||
`pwd`/examples/scripts/sft.py \ | ||
--model_name $MODEL_NAME \ | ||
--dataset_name $DATASET_NAME \ | ||
--output_dir $OUTPUT_DIR \ | ||
--max_steps $MAX_STEPS \ | ||
--batch_size $BATCH_SIZE \ | ||
--seq_length $SEQ_LEN \ | ||
$EXTRA_TRAINING_ARGS | ||
""" | ||
|
||
echo "Starting program..." | ||
|
||
{ # try | ||
echo $CMD | ||
eval "$CMD" | ||
} || { # catch | ||
# save log for exception | ||
echo "Operation Failed!" | ||
exit 1 | ||
} | ||
exit 0 |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -55,7 +55,7 @@ RUN source activate trl && \ | |
transformers \ | ||
accelerate \ | ||
peft \ | ||
trl | ||
trl[test]@git+https://github.com/huggingface/trl | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It makes sense to always build TRL from source on our docker images actually |
||
|
||
RUN source activate trl && \ | ||
pip freeze | grep trl | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
compute_environment: LOCAL_MACHINE | ||
debug: false | ||
distributed_type: "NO" | ||
downcast_bf16: 'no' | ||
gpu_ids: all | ||
machine_rank: 0 | ||
main_training_function: main | ||
mixed_precision: 'bf16' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. shouldn't that also be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. After thinking a bit, I think we can keep everything bf16 let me push something |
||
num_machines: 1 | ||
num_processes: 8 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. isn't that 8 gpus? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes but it gets overwritten in the shell script to keep the config file untouched |
||
rdzv_backend: static | ||
same_network: true | ||
tpu_env: [] | ||
tpu_use_cluster: false | ||
tpu_use_sudo: false | ||
use_cpu: false |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -45,7 +45,7 @@ class ScriptArguments: | |
batch_size: Optional[int] = field(default=64, metadata={"help": "the batch size"}) | ||
seq_length: Optional[int] = field(default=512, metadata={"help": "Input sequence length"}) | ||
gradient_accumulation_steps: Optional[int] = field( | ||
default=16, metadata={"help": "the number of gradient accumulation steps"} | ||
default=1, metadata={"help": "the number of gradient accumulation steps"} | ||
) | ||
load_in_8bit: Optional[bool] = field(default=False, metadata={"help": "load the model in 8 bits precision"}) | ||
load_in_4bit: Optional[bool] = field(default=False, metadata={"help": "load the model in 4 bits precision"}) | ||
|
@@ -55,14 +55,16 @@ class ScriptArguments: | |
peft_lora_r: Optional[int] = field(default=64, metadata={"help": "the r parameter of the LoRA adapters"}) | ||
peft_lora_alpha: Optional[int] = field(default=16, metadata={"help": "the alpha parameter of the LoRA adapters"}) | ||
logging_steps: Optional[int] = field(default=1, metadata={"help": "the number of logging steps"}) | ||
use_auth_token: Optional[bool] = field(default=True, metadata={"help": "Use HF auth token to access the model"}) | ||
use_auth_token: Optional[bool] = field(default=False, metadata={"help": "Use HF auth token to access the model"}) | ||
num_train_epochs: Optional[int] = field(default=3, metadata={"help": "the number of training epochs"}) | ||
max_steps: Optional[int] = field(default=-1, metadata={"help": "the number of training steps"}) | ||
save_steps: Optional[int] = field( | ||
default=100, metadata={"help": "Number of updates steps before two checkpoint saves"} | ||
) | ||
save_total_limit: Optional[int] = field(default=10, metadata={"help": "Limits total number of checkpoints."}) | ||
push_to_hub: Optional[bool] = field(default=False, metadata={"help": "Push the model to HF Hub"}) | ||
fp16: Optional[bool] = field(default=False, metadata={"help": "Whether to activate fp16 mixed precision"}) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Those were missing before |
||
bf16: Optional[bool] = field(default=False, metadata={"help": "Whether to activate bf16 mixed precision"}) | ||
gradient_checkpointing: Optional[bool] = field( | ||
default=False, metadata={"help": "Whether to use gradient checkpointing or no"} | ||
) | ||
|
@@ -126,6 +128,8 @@ class ScriptArguments: | |
push_to_hub=script_args.push_to_hub, | ||
hub_model_id=script_args.hub_model_id, | ||
gradient_checkpointing=script_args.gradient_checkpointing, | ||
fp16=script_args.fp16, | ||
bf16=script_args.bf16, | ||
# TODO: uncomment that on the next release | ||
# gradient_checkpointing_kwargs=script_args.gradient_checkpointing_kwargs, | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose to support common tests in a follow up PR, right now many of them fail because of device mismatch problems, I think it is ok to not have them for now as these tests are run anyway on the CI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, that would indeed be a nice addition, since we had issues in the past with device mismatch which this would have caught.