-
Notifications
You must be signed in to change notification settings - Fork 260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch amd-ci to use MI300X runner. #428
Conversation
The tests are failing. That was i what i meant by the need for an interactive env to debug. @tjtanaa can you help us look into the failure since you have AMD GPU access? Thanks! |
Yeah fair, but hopefully once we get past this initial bring up, it will be easier to identify the issues when failing. If @tjtanaa is not able to resolve easily, we can provide dedicated access to one node for short term to make this initial bring up easier :) |
CI SetupI am running my test on MI250 and I manage to run all the test successfully without experiencing the error that you are facing. I do not have clear details about the environment on your CI machine, but my suspicion is that the pytorch version and triton is installed might not be fully compatible. Could you try to reinstall pytorch version and set triton to triton==3.0.0?
Note: You could also try using the Pytorch Stable version. However, I still use this nightly version because this is what the vLLM Dockerfile.rocm is using, which should be proven to be a stable one. Install triton from upstream Side note about Unit Test caseIn the With this, all the tests should pass. Have you ever consider running the CI tests within a docker environment? That could make the env more controlled and replicable on different AMD CI machines. |
@saienduri is it possible to run in docker? |
@ByronHsu I could see from your screenshot that your environment has NVIDIA dependencies installed. I suspect that the |
Yes, we can run in a docker. Please provide which one we would like to run within along with |
Glad to know that triton >= 3.0.0 works on AMD. The screenshot is from https://github.com/linkedin/Liger-Kernel/actions/runs/12174881791/job/33957645620?pr=428. I love the docker idea. @tjtanaa can you create a docker image for liger and hand over to @saienduri ? Would be great if we can publish the image through our CI too. |
@saienduri @ByronHsu The followings are some additional details which are also highlighted in the PR description. Details
sudo docker run \
--network=host \
--group-add=video \
--ipc=host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
liger-ci \
/bin/bash -c "make test-convergence; make test"
FAILED test/convergence/test_mini_models_multimodal.py::test_mini_model_multimodal[mini_qwen2_vl-32-0.0001-dtype0-1e-08-1e-05-0.005-1e-05-0.005-1e-05] - AttributeError: 'Qwen2VLConfig' object has no attribute 'video_token_id'. Did you mean: 'vision_token_id'?
FAILED test/convergence/test_mini_models_multimodal.py::test_mini_model_multimodal[mini_qwen2_vl-32-0.0001-dtype1-0.001-0.01-0.1-0.01-0.01-0.01] - AttributeError: 'Qwen2VLConfig' object has no attribute 'video_token_id'. Did you mean: 'vision_token_id'?
FAILED test/convergence/test_mini_models_no_logits.py::test_mini_model[mini_qwen2_vl-32-0.0001-dtype6-1e-08-1e-05-0.005-1e-05-0.005-1e-05] - AttributeError: 'Qwen2VLConfig' object has no attribute 'image_token_id'. Did you mean: 'pad_token_id'?
FAILED test/convergence/test_mini_models_no_logits.py::test_mini_model[mini_qwen2_vl-32-0.0001-dtype7-0.001-0.01-0.1-0.01-0.01-0.01] - AttributeError: 'Qwen2VLConfig' object has no attribute 'image_token_id'. Did you mean: 'pad_token_id'? |
## Summary <!--- This is a required section; please describe the main purpose of this proposed code change. ---> This is a PR to suggests the use of Docker Images to run the Liger Kernel CI on AMD Machine. <!--- ## Details This is an optional section; is there anything specific that reviewers should be aware of? ---> ## Details When running the `docker run` you might need to run with `sudo`. In this version, I have not included it. If you find problem using AMD GPU within the docker container. Please add back the `sudo`. E.g. ```bash sudo docker run \ --network=host \ --group-add=video \ --ipc=host \ --cap-add=SYS_PTRACE \ --security-opt seccomp=unconfined \ --device /dev/kfd \ --device /dev/dri \ liger-ci \ /bin/bash -c "make test-convergence; make test" ``` ## Additional Details The `transformers` version has been pinned to `4.46.3` as there is a API changes in the later version of `transformers`. E.g. I have tested with `transformers==4.47.0` and the following error occured. ```bash FAILED test/convergence/test_mini_models_multimodal.py::test_mini_model_multimodal[mini_qwen2_vl-32-0.0001-dtype0-1e-08-1e-05-0.005-1e-05-0.005-1e-05] - AttributeError: 'Qwen2VLConfig' object has no attribute 'video_token_id'. Did you mean: 'vision_token_id'? FAILED test/convergence/test_mini_models_multimodal.py::test_mini_model_multimodal[mini_qwen2_vl-32-0.0001-dtype1-0.001-0.01-0.1-0.01-0.01-0.01] - AttributeError: 'Qwen2VLConfig' object has no attribute 'video_token_id'. Did you mean: 'vision_token_id'? FAILED test/convergence/test_mini_models_no_logits.py::test_mini_model[mini_qwen2_vl-32-0.0001-dtype6-1e-08-1e-05-0.005-1e-05-0.005-1e-05] - AttributeError: 'Qwen2VLConfig' object has no attribute 'image_token_id'. Did you mean: 'pad_token_id'? FAILED test/convergence/test_mini_models_no_logits.py::test_mini_model[mini_qwen2_vl-32-0.0001-dtype7-0.001-0.01-0.1-0.01-0.01-0.01] - AttributeError: 'Qwen2VLConfig' object has no attribute 'image_token_id'. Did you mean: 'pad_token_id'? ``` ## Testing Done <!--- This is a required section; please describe how this change was tested. ---> <!-- Replace BLANK with your device type. For example, A100-80G-PCIe Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> - Hardware Type: <BLANK> - [ ] run `make test` to ensure correctness - [ ] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence Co-authored-by: tjtanaa <[email protected]>
Thanks @tjtanaa! I have given you the maintainer access. You can now push to |
@ByronHsu I have treated the I have also temporary added the condition to change the However, I do think that the flakiness requires more investigations later one. It is not flaky in natural, it consistently failed and the numerical values are always the same. |
cd .. | ||
python -m pip install pytest pytest-xdist pytest-rerunfailures pytest-flakefinder pytest-cpp | ||
python -m pip uninstall -y torch torchvision | ||
python -m pip install --pre \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a bit nasty. Can we have a easy way to install amd dep? Like pip install liger-kernel[amd]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressing this in PR #436.
@@ -74,6 +74,7 @@ def forward(self, x): | |||
return output.type_as(x) | |||
|
|||
|
|||
@pytest.mark.flaky(reruns=3, reruns_delay=2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this count as "pass" after all rerun "fails"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be counted as "FAILED".
Merging for now! Thanks for the great work @saienduri @tjtanaa! Let's resolve my comments in the following PRs |
This commit switches the amd-ci workflow to use MI300x gpu provided by AMD for testing coverage.