-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core
/ tests ] v1 slow tests
#1218
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some early comments @lvwerra - I would appreciate if you can have a look while I continue roughly testing everything !
.github/workflows/slow-tests.yml
Outdated
# Run only when python files are modified | ||
- "trl/**.py" | ||
- "examples/**.py" | ||
branches: [ add-slow-tests ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To modify after final pass
pip install -e . --no-deps | ||
pip install pytest-reportlog | ||
|
||
- name: Run common tests on single GPU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose to support common tests in a follow up PR, right now many of them fail because of device mismatch problems, I think it is ok to not have them for now as these tests are run anyway on the CI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, that would indeed be a nice addition, since we had issues in the past with device mismatch which this would have caught.
{ # try | ||
echo $CMD | ||
eval "$CMD" | ||
} || { # catch | ||
# save log for exception | ||
echo "Operation Failed!" | ||
exit 1 | ||
} | ||
exit 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bash script will be run on the assumption that :
if the training fails it will return the exit status 1, else 0 and we use that info with the makefile command to retrieve the exit status of the previous bash command to see if the script failed or not
@@ -55,7 +55,7 @@ RUN source activate trl && \ | |||
transformers \ | |||
accelerate \ | |||
peft \ | |||
trl | |||
trl[test]@git+https://github.com/huggingface/trl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense to always build TRL from source on our docker images actually
examples/scripts/dpo.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I adapted the DPO script to make sure it supports QLoRA, I feel this feature is quite under-rated today and should be publicized much more
@@ -63,6 +63,8 @@ class ScriptArguments: | |||
) | |||
save_total_limit: Optional[int] = field(default=10, metadata={"help": "Limits total number of checkpoints."}) | |||
push_to_hub: Optional[bool] = field(default=False, metadata={"help": "Push the model to HF Hub"}) | |||
fp16: Optional[bool] = field(default=False, metadata={"help": "Whether to activate fp16 mixed precision"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those were missing before
scripts/log_reports.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mostly copied and adapted from the one from peft
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for working on this @younesbelkada, it's awesome! Left a few small comments. Main question about how to structure and organize the test suite a bit. At the moment I am a bit confused since the DPO/SFT tests are actually not on the test suite but on the example scripts.
pip install -e . --no-deps | ||
pip install pytest-reportlog | ||
|
||
- name: Run common tests on single GPU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, that would indeed be a nice addition, since we had issues in the past with device mismatch which this would have caught.
main_training_function: main | ||
mixed_precision: 'bf16' | ||
num_machines: 1 | ||
num_processes: 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't that 8 gpus?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes but it gets overwritten in the shell script to keep the config file untouched
gpu_ids: all | ||
machine_rank: 0 | ||
main_training_function: main | ||
mixed_precision: 'bf16' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't that also be fp16
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking a bit, I think we can keep everything bf16 let me push something
tests/slow/testing_constants.py
Outdated
# limitations under the License. | ||
|
||
# TODO: push them under trl-org | ||
MODELS_TO_TEST = ["HuggingFaceM4/tiny-random-LlamaForCausalLM", "HuggingFaceM4/tiny-random-MistralForCausalLM"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe two other archs could be Mixtral and Phi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add Phi, however the Mixtral tinu checkpoint is quite large (~200MB) making it longer to test it :/
.github/workflows/slow-tests.yml
Outdated
make slow_tests_single_gpu | ||
make slow_dpo_tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about just calling that slow_tests
or slow_accelerator_tests
and do both SFT and DPO together rather than separate? if we add more and more trainer to the tests makes it a bit confusing.
thinking about it more what do you think about renaming/structuring the tests into:
slow_tests
: they actually run the test suite and test small self-contained parts of the codescript_tests
: they run the example scripts with different options which tests end to end the pipeline
What do you think? at the moment they are a bit intertwined which i find a bit confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, let me push few things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I modified the make command to have a more global command that tests dpo + SFT. In the future if we want to test let's say KTO, we just need to extend that command
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a few small questions!
Thanks @lvwerra for all your time reviewing this big PR ! |
* v1 slow tests * nit * add qlora tests for DPO * add decorator * release memory + log reports * report to none to avoid seg fault issues * update setup * fix * add exampel testing * fix nit * change temp filename * add workflow file * fix comment * add slack push script * more tests for DPO * add dpo example tests * another makefile command * fix * add paths + clean up * nit * Update slow-tests.yml * trigger tests * up * up * more fixes * fix * final fixes * minor fixes * oops * add more text * fix * more * trigger CI * up * fix * remove * run the tests on 2 GPUs only * final fix SFT * revert config files + address comments * fix * add Phi * final fixes * final fix
What does this PR do?
This PR makes the current CI stronger and more robust by introducing GPU slow tests in order to catch some issues such as #1216 before each release (happened twice this week!). The goal is to run as many tests as possible to cover almost all usecases of most used classes.
I'll progressively update the PR by
Adding the correct workflow file
Adding the Slack messaging system for getting feedback from the CI (for examples)
Extending the tests with
and test everything
cc @lvwerra @lewtun