[CI] Add ROCm 6.3 CI #506

tjtanaa · 2024-12-30T15:53:00Z

Summary

This is a PR to enable ROCm 6.3 CI given that the pytorch nightly build includes ROCm 6.3 now. However, this PR should not be merged yet as there is unexplained reasons as to why one of the convergence tests is passing on ROCm 6.2 but failed on ROCm 6.3.

Testing Done

Hardware Type:
run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

Additional Info 2025-1-2

All the tests passed on ROCm 6.2

Only one failed test in ROCm 6.3

FAILED CONVERGENCE TEST

====================================================== short test summary info =======================================================
FAILED test/convergence/test_mini_models.py::test_mini_model[mini_mllama-32-0.0001-dtype2-1e-08-1e-05-0.005-1e-05-0.005-1e-05] - AssertionError: Number of mismatched elements: 2
Mismatch at index (0, 7): tensor1[(0, 7)] = 3.0651497840881348, tensor2[(0, 7)] = 3.0652356147766113
Mismatch at index (0, 9): tensor1[(0, 9)] = 1.470238447189331, tensor2[(0, 9)] = 1.4702625274658203
======================================== 1 failed, 16 passed, 2 warnings in 94.82s (0:01:34) ==================

merged from main

jagadish-amd · 2025-01-06T22:04:23Z

Hello @tjtanaa
I am able to repro the issue. debugging it.
Had to move numpy version to 1.26.4, to run the test.

Command to run specific test.
HF_DATASETS_OFFLINE=1 python -m pytest --disable-warnings test/convergence/test_mini_models.py -k mini_mllama-32-0.0001-dtype2-1e-08-1e-05-0.005-1e-05-0.005-1e-05

jagadish-amd · 2025-01-09T02:34:12Z

Updating the debug status.
Took ROCm 6.2 whl https://download.pytorch.org/whl/nightly/rocm6.2 to identify if any blas calls difference over ROCm 6.3.
However, on my mi210 node, I see that loss has nan values for rocm6.2. The entire loss_list has nan values. This is true for both with_liger and without liger. Hence https://github.com/linkedin/Liger-Kernel/blob/main/test/utils.py#L87 turns out to be False and the test is deemed as PASSED.
..
Step 27 True, Loss: nan
Step 28 True, Loss: nan
Step 29 True, Loss: nan
Step 30 True, Loss: nan
Step 31 True, Loss: nan

I guess the ROCm 6.2 results might be false positive.

(Enabled log for passing test case
HF_DATASETS_OFFLINE=1 python -m pytest -rP --disable-warnings test/convergence/test_mini_models.py -k mini_mllama-32-0.0001-dtype2-1e-08-1e-05-0.005-1e-05-0.005-1e-05)

tjtanaa · 2025-01-09T04:00:23Z

Updating the debug status. Took ROCm 6.2 whl https://download.pytorch.org/whl/nightly/rocm6.2 to identify if any blas calls difference over ROCm 6.3. However, on my mi210 node, I see that loss has nan values for rocm6.2. The entire loss_list has nan values. This is true for both with_liger and without liger. Hence https://github.com/linkedin/Liger-Kernel/blob/main/test/utils.py#L87 turns out to be False and the test is deemed as PASSED. .. Step 27 True, Loss: nan Step 28 True, Loss: nan Step 29 True, Loss: nan Step 30 True, Loss: nan Step 31 True, Loss: nan

I guess the ROCm 6.2 results might be false positive.

(Enabled log for passing test case HF_DATASETS_OFFLINE=1 python -m pytest -rP --disable-warnings test/convergence/test_mini_models.py -k mini_mllama-32-0.0001-dtype2-1e-08-1e-05-0.005-1e-05-0.005-1e-05)

@austin362667
@jagadish-amd

I have also try to look other PASS cases using rocm6.2

The following cases which PASS also contains NaN Loss

test_mini_model[mini_llama3-32-0.0001-dtype0-1e-08-2e-05-0.0001-1e-05-0.005-1e-05]
test_mini_model[mini_llama3-32-0.0001-dtype1-0.001-0.01-0.1-0.01-0.01-0.01]
test_mini_model[mini_mllama-32-0.0001-dtype2-1e-08-1e-05-0.005-1e-05-0.005-1e-05]
test_mini_model[mini_mllama-32-0.0001-dtype3-0.001-0.01-0.1-0.01-0.01-0.01]
test_mini_model[mini_qwen2-32-0.0001-dtype4-1e-08-1e-05-0.005-1e-05-0.005-1e-05]
test_mini_model[mini_qwen2-32-0.0001-dtype5-0.001-0.01-0.1-0.01-0.01-0.01]
test_mini_model[mini_qwen2_vl-32-0.0001-dtype6-1e-05-0.1-0.005-1e-05-0.005-1e-05] 
test_mini_model[mini_qwen2_vl-32-0.0001-dtype7-0.001-0.05-0.1-0.01-0.01-0.01] 
test_mini_model[mini_mistral-32-0.0001-dtype10-1e-08-1e-05-0.005-1e-05-0.005-1e-05]
test_mini_model[mini_mistral-32-0.0001-dtype11-0.001-0.01-0.1-0.01-0.01-0.01]
test_mini_model[mini_gemma1-32-0.0001-dtype12-1e-08-0.0001-0.005-1e-05-0.005-1e-05] 
test_mini_model[mini_gemma1-32-0.0001-dtype13-0.001-0.01-0.1-0.01-0.01-0.01]
test_mini_model[mini_gemma1.1-32-0.0001-dtype14-1e-08-0.0001-0.005-1e-05-0.005-1e-05]
test_mini_model[mini_gemma1.1-32-0.0001-dtype15-0.001-0.01-0.1-0.01-0.01-0.01]

P.S.
@austin362667 should we

disable this test for AMD for now?

FAILED test/convergence/test_mini_models.py::test_mini_model[mini_llama3-32-0.0001-dtype1-0.001-0.01-0.1-0.01-0.01-0.01] - AssertionError: Number of mismatched elements: 2
Mismatch at index (0, 7): tensor1[(0, 7)] = 3.0651497840881348, tensor2[(0, 7)] = 3.0652356147766113
Mismatch at index (0, 9): tensor1[(0, 9)] = 1.470238447189331, tensor2[(0, 9)] = 1.4702625274658203
======================================== 1 failed, 16 passed, 2 warnings in 94.82s (0:01:34) ==================

disable ROCm 6.2 and only run ROCm 6.3.
REASONS:
1. @jagadish-amd has validated on MI300X that the convergence tests outputs NaN when using pytorch ROCm 6.2, although all the tests are passing.
2. MI300X tests are generating valid values (non-NaN) when using pytorch ROCm 6.3. And it fails only one test case
  FAILED test/convergence/test_mini_models.py::test_mini_model[mini_llama3-32-0.0001-dtype1-0.001-0.01-0.1-0.01-0.01-0.01] - AssertionError: Number of mismatched elements:

austin362667 · 2025-01-20T03:38:24Z

@tjtanaa @jagadish-amd Thanks for diving into this!
Is it okay to just use ROCm 6.3 without understanding why the bugs occurred in ROCm 6.2? What's your opinion? @hebiao064 @Tcc0403

add rocm6.3 ci

e4affc4

tjtanaa requested a review from ByronHsu December 30, 2024 15:53

austin362667 approved these changes Dec 31, 2024

View reviewed changes

Merge remote-tracking branch 'origin/main' into tjtanaa/rocm6.3-ci

4ed5deb

merged from main

tjtanaa mentioned this pull request Jan 2, 2025

[AMD] [ROCm] Numerical difference between Pytorch 2.6.0.dev of ROCm 6.2 and ROCm 6.3 pytorch/pytorch#144069

Open

austin362667 mentioned this pull request Jan 20, 2025

Add unit tests for shared prefix masked attention with torch.FlexAttention #504

Open

3 tasks

Merge branch 'main' into tjtanaa/rocm6.3-ci

c26068e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Add ROCm 6.3 CI #506

[CI] Add ROCm 6.3 CI #506

tjtanaa commented Dec 30, 2024 •

edited

Loading

jagadish-amd commented Jan 6, 2025

jagadish-amd commented Jan 9, 2025 •

edited

Loading

tjtanaa commented Jan 9, 2025 •

edited

Loading

austin362667 commented Jan 20, 2025 •

edited

Loading

[CI] Add ROCm 6.3 CI #506

Are you sure you want to change the base?

[CI] Add ROCm 6.3 CI #506

Conversation

tjtanaa commented Dec 30, 2024 • edited Loading

Summary

Testing Done

jagadish-amd commented Jan 6, 2025

jagadish-amd commented Jan 9, 2025 • edited Loading

tjtanaa commented Jan 9, 2025 • edited Loading

austin362667 commented Jan 20, 2025 • edited Loading

tjtanaa commented Dec 30, 2024 •

edited

Loading

jagadish-amd commented Jan 9, 2025 •

edited

Loading

tjtanaa commented Jan 9, 2025 •

edited

Loading

austin362667 commented Jan 20, 2025 •

edited

Loading