Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch amd-ci to use MI300X runner. #428

Merged
merged 38 commits into from
Dec 7, 2024
Merged

Switch amd-ci to use MI300X runner. #428

merged 38 commits into from
Dec 7, 2024

Conversation

saienduri
Copy link
Collaborator

This commit switches the amd-ci workflow to use MI300x gpu provided by AMD for testing coverage.

@ByronHsu
Copy link
Collaborator

ByronHsu commented Dec 5, 2024

The tests are failing. That was i what i meant by the need for an interactive env to debug. @tjtanaa can you help us look into the failure since you have AMD GPU access? Thanks!

@saienduri
Copy link
Collaborator Author

saienduri commented Dec 5, 2024

Yeah fair, but hopefully once we get past this initial bring up, it will be easier to identify the issues when failing. If @tjtanaa is not able to resolve easily, we can provide dedicated access to one node for short term to make this initial bring up easier :)

@tjtanaa
Copy link
Collaborator

tjtanaa commented Dec 5, 2024

@ByronHsu

CI Setup

I am running my test on MI250 and I manage to run all the test successfully without experiencing the error that you are facing.

I do not have clear details about the environment on your CI machine, but my suspicion is that the pytorch version and triton is installed might not be fully compatible.

Could you try to reinstall pytorch version and set triton to triton==3.0.0?

python3 -m pip uninstall -y torch torchvision \
            && python3 -m pip install --pre \
                torch==2.6.0.dev20241113+rocm6.2 \
                'setuptools-scm>=8' \
                torchvision==0.20.0.dev20241113+rocm6.2 \
                --extra-index-url https://download.pytorch.org/whl/nightly/rocm6.2

Note: You could also try using the Pytorch Stable version. However, I still use this nightly version because this is what the vLLM Dockerfile.rocm is using, which should be proven to be a stable one.

Install triton from upstream python3 -m pip install triton==3.0.0, to resolve the issue with triton's cache_manager triton-lang/triton#5013 .

Side note about Unit Test case

In the test/transformers/test_cross_entropy.py::test_float32_internal, the num_warp needs to be set to 16 on AMD hardware.

With this, all the tests should pass.


Have you ever consider running the CI tests within a docker environment? That could make the env more controlled and replicable on different AMD CI machines.

@ByronHsu
Copy link
Collaborator

ByronHsu commented Dec 5, 2024

@saienduri is it possible to run in docker?
@tjtanaa We are using triton 3.1.0 and torch 2.5.1. Is only 3.1.0 working on AMD?

image

@tjtanaa
Copy link
Collaborator

tjtanaa commented Dec 5, 2024

@ByronHsu
I have tested triton==3.0.0 and triton==3.1.0 with torch==2.6.0.dev20241113+rocm6.2.
I don't quite understand the question Is only 3.1.0 working on AMD?. I think any triton version that is >3.0.0 should work on AMD.

I could see from your screenshot that your environment has NVIDIA dependencies installed. I suspect that the torch in the environment is for NVIDIA. If you install torch that is for AMD machine you will also see this dependency in your environment pytorch-triton-rocm

@saienduri
Copy link
Collaborator Author

saienduri commented Dec 5, 2024

Yes, we can run in a docker. Please provide which one we would like to run within along with docker pull and docker run commands that works for you @tjtanaa

@ByronHsu
Copy link
Collaborator

ByronHsu commented Dec 5, 2024

Glad to know that triton >= 3.0.0 works on AMD. The screenshot is from https://github.com/linkedin/Liger-Kernel/actions/runs/12174881791/job/33957645620?pr=428.

I love the docker idea. @tjtanaa can you create a docker image for liger and hand over to @saienduri ? Would be great if we can publish the image through our CI too.

@tjtanaa
Copy link
Collaborator

tjtanaa commented Dec 6, 2024

@saienduri @ByronHsu
I have opened a PR #430 into your branch (saienduri/amd-ci). Can you validate if the steps are working. I have validated the Dockerfile.rocm and the command added into the .github/workflow/amd-ci.yml. The only thing that is not tested is the whether the amd-ci.yml workflow is setup correctly.

The followings are some additional details which are also highlighted in the PR description.

Details

  1. When running the docker run you might need to run with sudo. In this version, I have not included it. If you find problem using AMD GPU within the docker container. Please add back the sudo. E.g.
sudo docker run \
   --network=host \
   --group-add=video \
   --ipc=host \
   --cap-add=SYS_PTRACE \
   --security-opt seccomp=unconfined \
   --device /dev/kfd \
   --device /dev/dri \
   liger-ci \
   /bin/bash -c "make test-convergence; make test"
  1. The transformers version has been pinned to 4.46.3 as there is a API changes in the later version of transformers. E.g. I have tested with transformers==4.47.0 and the following error occured.
FAILED test/convergence/test_mini_models_multimodal.py::test_mini_model_multimodal[mini_qwen2_vl-32-0.0001-dtype0-1e-08-1e-05-0.005-1e-05-0.005-1e-05] - AttributeError: 'Qwen2VLConfig' object has no attribute 'video_token_id'. Did you mean: 'vision_token_id'?
FAILED test/convergence/test_mini_models_multimodal.py::test_mini_model_multimodal[mini_qwen2_vl-32-0.0001-dtype1-0.001-0.01-0.1-0.01-0.01-0.01] - AttributeError: 'Qwen2VLConfig' object has no attribute 'video_token_id'. Did you mean: 'vision_token_id'?
FAILED test/convergence/test_mini_models_no_logits.py::test_mini_model[mini_qwen2_vl-32-0.0001-dtype6-1e-08-1e-05-0.005-1e-05-0.005-1e-05] - AttributeError: 'Qwen2VLConfig' object has no attribute 'image_token_id'. Did you mean: 'pad_token_id'?
FAILED test/convergence/test_mini_models_no_logits.py::test_mini_model[mini_qwen2_vl-32-0.0001-dtype7-0.001-0.01-0.1-0.01-0.01-0.01] - AttributeError: 'Qwen2VLConfig' object has no attribute 'image_token_id'. Did you mean: 'pad_token_id'?

## Summary
<!--- This is a required section; please describe the main purpose of
this proposed code change. --->
This is a PR to suggests the use of Docker Images to run the Liger
Kernel CI on AMD Machine.

<!---
## Details
This is an optional section; is there anything specific that reviewers
should be aware of?
--->

## Details
When running the `docker run` you might need to run with `sudo`. In this
version, I have not included it. If you find problem using AMD GPU
within the docker container. Please add back the `sudo`. E.g.
```bash
sudo docker run \
   --network=host \
   --group-add=video \
   --ipc=host \
   --cap-add=SYS_PTRACE \
   --security-opt seccomp=unconfined \
   --device /dev/kfd \
   --device /dev/dri \
   liger-ci \
   /bin/bash -c "make test-convergence; make test"
```

## Additional Details

The `transformers` version has been pinned to `4.46.3` as there is a API
changes in the later version of `transformers`. E.g. I have tested with
`transformers==4.47.0` and the following error occured.
```bash
FAILED test/convergence/test_mini_models_multimodal.py::test_mini_model_multimodal[mini_qwen2_vl-32-0.0001-dtype0-1e-08-1e-05-0.005-1e-05-0.005-1e-05] - AttributeError: 'Qwen2VLConfig' object has no attribute 'video_token_id'. Did you mean: 'vision_token_id'?
FAILED test/convergence/test_mini_models_multimodal.py::test_mini_model_multimodal[mini_qwen2_vl-32-0.0001-dtype1-0.001-0.01-0.1-0.01-0.01-0.01] - AttributeError: 'Qwen2VLConfig' object has no attribute 'video_token_id'. Did you mean: 'vision_token_id'?
FAILED test/convergence/test_mini_models_no_logits.py::test_mini_model[mini_qwen2_vl-32-0.0001-dtype6-1e-08-1e-05-0.005-1e-05-0.005-1e-05] - AttributeError: 'Qwen2VLConfig' object has no attribute 'image_token_id'. Did you mean: 'pad_token_id'?
FAILED test/convergence/test_mini_models_no_logits.py::test_mini_model[mini_qwen2_vl-32-0.0001-dtype7-0.001-0.01-0.1-0.01-0.01-0.01] - AttributeError: 'Qwen2VLConfig' object has no attribute 'image_token_id'. Did you mean: 'pad_token_id'?
```

## Testing Done
<!--- This is a required section; please describe how this change was
tested. --->

<!-- 
Replace BLANK with your device type. For example, A100-80G-PCIe

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them. 
-->

- Hardware Type: <BLANK>
- [ ] run `make test` to ensure correctness
- [ ] run `make checkstyle` to ensure code style
- [ ] run `make test-convergence` to ensure convergence

Co-authored-by: tjtanaa <[email protected]>
@ByronHsu
Copy link
Collaborator

ByronHsu commented Dec 6, 2024

Thanks @tjtanaa! I have given you the maintainer access. You can now push to saienduri/amd-ci in the main repo. In the meantime, @tyler-romero can you look at the qwen2_vl issue?

@ByronHsu ByronHsu added the AMD label Dec 7, 2024
@tjtanaa
Copy link
Collaborator

tjtanaa commented Dec 7, 2024

@ByronHsu
All tests have passed through the following changes.

I have treated the test_rms_norm::test_correctness as flaky test for now by adding the decorator @pytest.mark.flaky(reruns=3, reruns_delay=2) to the test case.

I have also temporary added the condition to change the num_warps of the test/transformers/test_cross_entropy.py::test_float32_internal()

However, I do think that the flakiness requires more investigations later one. It is not flaky in natural, it consistently failed and the numerical values are always the same.

cd ..
python -m pip install pytest pytest-xdist pytest-rerunfailures pytest-flakefinder pytest-cpp
python -m pip uninstall -y torch torchvision
python -m pip install --pre \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit nasty. Can we have a easy way to install amd dep? Like pip install liger-kernel[amd]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressing this in PR #436.

@@ -74,6 +74,7 @@ def forward(self, x):
return output.type_as(x)


@pytest.mark.flaky(reruns=3, reruns_delay=2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this count as "pass" after all rerun "fails"?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be counted as "FAILED".

@ByronHsu
Copy link
Collaborator

ByronHsu commented Dec 7, 2024

Merging for now! Thanks for the great work @saienduri @tjtanaa! Let's resolve my comments in the following PRs

@ByronHsu ByronHsu merged commit 189c411 into main Dec 7, 2024
5 checks passed
@ByronHsu ByronHsu deleted the saienduri/amd-ci branch December 7, 2024 19:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants