Switch amd-ci to use MI300X runner. #428

saienduri · 2024-12-05T07:03:39Z

This commit switches the amd-ci workflow to use MI300x gpu provided by AMD for testing coverage.

ByronHsu · 2024-12-05T07:35:44Z

The tests are failing. That was i what i meant by the need for an interactive env to debug. @tjtanaa can you help us look into the failure since you have AMD GPU access? Thanks!

saienduri · 2024-12-05T07:41:46Z

Yeah fair, but hopefully once we get past this initial bring up, it will be easier to identify the issues when failing. If @tjtanaa is not able to resolve easily, we can provide dedicated access to one node for short term to make this initial bring up easier :)

tjtanaa · 2024-12-05T08:31:30Z

@ByronHsu

CI Setup

I am running my test on MI250 and I manage to run all the test successfully without experiencing the error that you are facing.

I do not have clear details about the environment on your CI machine, but my suspicion is that the pytorch version and triton is installed might not be fully compatible.

Could you try to reinstall pytorch version and set triton to triton==3.0.0?

python3 -m pip uninstall -y torch torchvision \
            && python3 -m pip install --pre \
                torch==2.6.0.dev20241113+rocm6.2 \
                'setuptools-scm>=8' \
                torchvision==0.20.0.dev20241113+rocm6.2 \
                --extra-index-url https://download.pytorch.org/whl/nightly/rocm6.2

Note: You could also try using the Pytorch Stable version. However, I still use this nightly version because this is what the vLLM Dockerfile.rocm is using, which should be proven to be a stable one.

Install triton from upstream python3 -m pip install triton==3.0.0, to resolve the issue with triton's cache_manager triton-lang/triton#5013 .

Side note about Unit Test case

In the test/transformers/test_cross_entropy.py::test_float32_internal, the num_warp needs to be set to 16 on AMD hardware.

With this, all the tests should pass.

Have you ever consider running the CI tests within a docker environment? That could make the env more controlled and replicable on different AMD CI machines.

ByronHsu · 2024-12-05T08:40:16Z

@saienduri is it possible to run in docker?
@tjtanaa We are using triton 3.1.0 and torch 2.5.1. Is only 3.1.0 working on AMD?

tjtanaa · 2024-12-05T10:10:26Z

@ByronHsu
I have tested triton==3.0.0 and triton==3.1.0 with torch==2.6.0.dev20241113+rocm6.2.
I don't quite understand the question Is only 3.1.0 working on AMD?. I think any triton version that is >3.0.0 should work on AMD.

I could see from your screenshot that your environment has NVIDIA dependencies installed. I suspect that the torch in the environment is for NVIDIA. If you install torch that is for AMD machine you will also see this dependency in your environment pytorch-triton-rocm

saienduri · 2024-12-05T16:30:57Z

Yes, we can run in a docker. Please provide which one we would like to run within along with docker pull and docker run commands that works for you @tjtanaa

ByronHsu · 2024-12-05T18:57:26Z

Glad to know that triton >= 3.0.0 works on AMD. The screenshot is from https://github.com/linkedin/Liger-Kernel/actions/runs/12174881791/job/33957645620?pr=428.

I love the docker idea. @tjtanaa can you create a docker image for liger and hand over to @saienduri ? Would be great if we can publish the image through our CI too.

tjtanaa · 2024-12-06T06:32:08Z

@saienduri @ByronHsu
I have opened a PR #430 into your branch (saienduri/amd-ci). Can you validate if the steps are working. I have validated the Dockerfile.rocm and the command added into the .github/workflow/amd-ci.yml. The only thing that is not tested is the whether the amd-ci.yml workflow is setup correctly.

The followings are some additional details which are also highlighted in the PR description.

Details

When running the docker run you might need to run with sudo. In this version, I have not included it. If you find problem using AMD GPU within the docker container. Please add back the sudo. E.g.

sudo docker run \
   --network=host \
   --group-add=video \
   --ipc=host \
   --cap-add=SYS_PTRACE \
   --security-opt seccomp=unconfined \
   --device /dev/kfd \
   --device /dev/dri \
   liger-ci \
   /bin/bash -c "make test-convergence; make test"

The transformers version has been pinned to 4.46.3 as there is a API changes in the later version of transformers. E.g. I have tested with transformers==4.47.0 and the following error occured.

FAILED test/convergence/test_mini_models_multimodal.py::test_mini_model_multimodal[mini_qwen2_vl-32-0.0001-dtype0-1e-08-1e-05-0.005-1e-05-0.005-1e-05] - AttributeError: 'Qwen2VLConfig' object has no attribute 'video_token_id'. Did you mean: 'vision_token_id'?
FAILED test/convergence/test_mini_models_multimodal.py::test_mini_model_multimodal[mini_qwen2_vl-32-0.0001-dtype1-0.001-0.01-0.1-0.01-0.01-0.01] - AttributeError: 'Qwen2VLConfig' object has no attribute 'video_token_id'. Did you mean: 'vision_token_id'?
FAILED test/convergence/test_mini_models_no_logits.py::test_mini_model[mini_qwen2_vl-32-0.0001-dtype6-1e-08-1e-05-0.005-1e-05-0.005-1e-05] - AttributeError: 'Qwen2VLConfig' object has no attribute 'image_token_id'. Did you mean: 'pad_token_id'?
FAILED test/convergence/test_mini_models_no_logits.py::test_mini_model[mini_qwen2_vl-32-0.0001-dtype7-0.001-0.01-0.1-0.01-0.01-0.01] - AttributeError: 'Qwen2VLConfig' object has no attribute 'image_token_id'. Did you mean: 'pad_token_id'?

## Summary  This is a PR to suggests the use of Docker Images to run the Liger Kernel CI on AMD Machine.  ## Details When running the `docker run` you might need to run with `sudo`. In this version, I have not included it. If you find problem using AMD GPU within the docker container. Please add back the `sudo`. E.g. ```bash sudo docker run \ --network=host \ --group-add=video \ --ipc=host \ --cap-add=SYS_PTRACE \ --security-opt seccomp=unconfined \ --device /dev/kfd \ --device /dev/dri \ liger-ci \ /bin/bash -c "make test-convergence; make test" ``` ## Additional Details The `transformers` version has been pinned to `4.46.3` as there is a API changes in the later version of `transformers`. E.g. I have tested with `transformers==4.47.0` and the following error occured. ```bash FAILED test/convergence/test_mini_models_multimodal.py::test_mini_model_multimodal[mini_qwen2_vl-32-0.0001-dtype0-1e-08-1e-05-0.005-1e-05-0.005-1e-05] - AttributeError: 'Qwen2VLConfig' object has no attribute 'video_token_id'. Did you mean: 'vision_token_id'? FAILED test/convergence/test_mini_models_multimodal.py::test_mini_model_multimodal[mini_qwen2_vl-32-0.0001-dtype1-0.001-0.01-0.1-0.01-0.01-0.01] - AttributeError: 'Qwen2VLConfig' object has no attribute 'video_token_id'. Did you mean: 'vision_token_id'? FAILED test/convergence/test_mini_models_no_logits.py::test_mini_model[mini_qwen2_vl-32-0.0001-dtype6-1e-08-1e-05-0.005-1e-05-0.005-1e-05] - AttributeError: 'Qwen2VLConfig' object has no attribute 'image_token_id'. Did you mean: 'pad_token_id'? FAILED test/convergence/test_mini_models_no_logits.py::test_mini_model[mini_qwen2_vl-32-0.0001-dtype7-0.001-0.01-0.1-0.01-0.01-0.01] - AttributeError: 'Qwen2VLConfig' object has no attribute 'image_token_id'. Did you mean: 'pad_token_id'? ``` ## Testing Done   - Hardware Type: <BLANK> - [ ] run `make test` to ensure correctness - [ ] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence Co-authored-by: tjtanaa <[email protected]>

ByronHsu · 2024-12-06T08:17:06Z

Thanks @tjtanaa! I have given you the maintainer access. You can now push to saienduri/amd-ci in the main repo. In the meantime, @tyler-romero can you look at the qwen2_vl issue?

tjtanaa · 2024-12-07T05:31:06Z

@ByronHsu
All tests have passed through the following changes.

I have treated the test_rms_norm::test_correctness as flaky test for now by adding the decorator @pytest.mark.flaky(reruns=3, reruns_delay=2) to the test case.

I have also temporary added the condition to change the num_warps of the test/transformers/test_cross_entropy.py::test_float32_internal()

However, I do think that the flakiness requires more investigations later one. It is not flaky in natural, it consistently failed and the numerical values are always the same.

ByronHsu · 2024-12-07T19:49:25Z

.github/workflows/amd-ci.yml

+        cd ..
+        python -m pip install pytest pytest-xdist pytest-rerunfailures pytest-flakefinder pytest-cpp
+        python -m pip uninstall -y torch torchvision
+        python -m pip install --pre \


this is a bit nasty. Can we have a easy way to install amd dep? Like pip install liger-kernel[amd]

Addressing this in PR #436.

ByronHsu · 2024-12-07T19:51:24Z

test/transformers/test_rms_norm.py

@@ -74,6 +74,7 @@ def forward(self, x):
        return output.type_as(x)


+@pytest.mark.flaky(reruns=3, reruns_delay=2)


will this count as "pass" after all rerun "fails"?

It will be counted as "FAILED".

ByronHsu · 2024-12-07T19:52:17Z

Merging for now! Thanks for the great work @saienduri @tjtanaa! Let's resolve my comments in the following PRs

temporarily run amd ci on any path with mi300 runner

18bce05

saienduri mentioned this pull request Dec 5, 2024

Switch amd-ci to use MI300X runner #427

Closed

checkstyle on ubuntu; tests on amd gpu

171e96d

tjtanaa added 17 commits December 6, 2024 09:17

skip installing system dependencies

f88ed31

validate if adding sudo to docker build will grant permission

6d37fc4

check the workspace location and what is in there

b094e7c

fix Dockerfile

69872db

Skip ci in docker image

07a3a62

temporary fix test

96e9fe1

fix checkstyle

163e89b

upgrade torch

28f67c5

temporary skip test_cross_entropy::test_float32_internal

36f83d6

check amd ci machine environment

cb5e232

muted modal gpu ci while setting up amd ci

1d999d8

fix syntax

d0521ad

run test using docker

42e12a9

skip to test-convergence

cca2aae

switch back to not use docker

5971ffa

reenable crossentropy _test_float32_internal test

1330456

use docker in amd ci

d2571d8

tjtanaa added 12 commits December 6, 2024 15:31

test torch latest dev version

223c054

fix test_cross_entropy_test

4e90d3f

run only failed test

c2cc168

run only failed test

b16a7bc

downgrade triton to 3.0.0

9c8d119

turn back triton version to 3.1.0

f70eb6c

reenable convergence test

334b8b5

set pytest num_process to 1 and install amdsmi

44a1335

set pytest num_process to 1

f0d8b30

install pytest plugins

f34cd33

downgrade torch 2.6.0 to 20241113

333d6ba

check python environment

3afa73e

ByronHsu added the AMD label Dec 7, 2024

tjtanaa added 5 commits December 7, 2024 03:59

log more of the CI machine info; set to use 1 gpu only for unittest

1f992b9

show more rocm info; set numpy to 1.26.4

3107286

add reruns to test_rms_norm::test_correctness

eec5b88

remove HIP_VISIBLE_DEVICE from Makefile

427064a

fix amd-ci.yml syntax

3bd7901

remove Dockerfile.rocm

f6ad875

ByronHsu reviewed Dec 7, 2024

View reviewed changes

ByronHsu merged commit 189c411 into main Dec 7, 2024
5 checks passed

ByronHsu deleted the saienduri/amd-ci branch December 7, 2024 19:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch amd-ci to use MI300X runner. #428

Switch amd-ci to use MI300X runner. #428

saienduri commented Dec 5, 2024

ByronHsu commented Dec 5, 2024

saienduri commented Dec 5, 2024 •

edited

Loading

tjtanaa commented Dec 5, 2024 •

edited

Loading

ByronHsu commented Dec 5, 2024

tjtanaa commented Dec 5, 2024 •

edited

Loading

saienduri commented Dec 5, 2024 •

edited

Loading

ByronHsu commented Dec 5, 2024

tjtanaa commented Dec 6, 2024 •

edited

Loading

ByronHsu commented Dec 6, 2024

tjtanaa commented Dec 7, 2024

ByronHsu Dec 7, 2024

tjtanaa Dec 8, 2024

ByronHsu Dec 7, 2024

tjtanaa Dec 8, 2024

ByronHsu commented Dec 7, 2024

		@@ -74,6 +74,7 @@ def forward(self, x):
		return output.type_as(x)


		@pytest.mark.flaky(reruns=3, reruns_delay=2)

Switch amd-ci to use MI300X runner. #428

Switch amd-ci to use MI300X runner. #428

Conversation

saienduri commented Dec 5, 2024

ByronHsu commented Dec 5, 2024

saienduri commented Dec 5, 2024 • edited Loading

tjtanaa commented Dec 5, 2024 • edited Loading

CI Setup

Side note about Unit Test case

ByronHsu commented Dec 5, 2024

tjtanaa commented Dec 5, 2024 • edited Loading

saienduri commented Dec 5, 2024 • edited Loading

ByronHsu commented Dec 5, 2024

tjtanaa commented Dec 6, 2024 • edited Loading

ByronHsu commented Dec 6, 2024

tjtanaa commented Dec 7, 2024

ByronHsu Dec 7, 2024

Choose a reason for hiding this comment

tjtanaa Dec 8, 2024

Choose a reason for hiding this comment

ByronHsu Dec 7, 2024

Choose a reason for hiding this comment

tjtanaa Dec 8, 2024

Choose a reason for hiding this comment

ByronHsu commented Dec 7, 2024

saienduri commented Dec 5, 2024 •

edited

Loading

tjtanaa commented Dec 5, 2024 •

edited

Loading

tjtanaa commented Dec 5, 2024 •

edited

Loading

saienduri commented Dec 5, 2024 •

edited

Loading

tjtanaa commented Dec 6, 2024 •

edited

Loading