Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Add ROCm 6.3 CI #506

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

[CI] Add ROCm 6.3 CI #506

wants to merge 3 commits into from

Conversation

tjtanaa
Copy link
Collaborator

@tjtanaa tjtanaa commented Dec 30, 2024

Summary

This is a PR to enable ROCm 6.3 CI given that the pytorch nightly build includes ROCm 6.3 now. However, this PR should not be merged yet as there is unexplained reasons as to why one of the convergence tests is passing on ROCm 6.2 but failed on ROCm 6.3.

Testing Done

  • Hardware Type:
  • run make test to ensure correctness
  • run make checkstyle to ensure code style
  • run make test-convergence to ensure convergence

Additional Info 2025-1-2

  • All the tests passed on ROCm 6.2
  • Only one failed test in ROCm 6.3
    • FAILED CONVERGENCE TEST
    ====================================================== short test summary info =======================================================
    FAILED test/convergence/test_mini_models.py::test_mini_model[mini_mllama-32-0.0001-dtype2-1e-08-1e-05-0.005-1e-05-0.005-1e-05] - AssertionError: Number of mismatched elements: 2
    Mismatch at index (0, 7): tensor1[(0, 7)] = 3.0651497840881348, tensor2[(0, 7)] = 3.0652356147766113
    Mismatch at index (0, 9): tensor1[(0, 9)] = 1.470238447189331, tensor2[(0, 9)] = 1.4702625274658203
    ======================================== 1 failed, 16 passed, 2 warnings in 94.82s (0:01:34) ==================
    

@tjtanaa tjtanaa requested a review from ByronHsu December 30, 2024 15:53
@jagadish-amd
Copy link

Hello @tjtanaa
I am able to repro the issue. debugging it.
Had to move numpy version to 1.26.4, to run the test.

Command to run specific test.
HF_DATASETS_OFFLINE=1 python -m pytest --disable-warnings test/convergence/test_mini_models.py -k mini_mllama-32-0.0001-dtype2-1e-08-1e-05-0.005-1e-05-0.005-1e-05

@jagadish-amd
Copy link

jagadish-amd commented Jan 9, 2025

Updating the debug status.
Took ROCm 6.2 whl https://download.pytorch.org/whl/nightly/rocm6.2 to identify if any blas calls difference over ROCm 6.3.
However, on my mi210 node, I see that loss has nan values for rocm6.2. The entire loss_list has nan values. This is true for both with_liger and without liger. Hence https://github.com/linkedin/Liger-Kernel/blob/main/test/utils.py#L87 turns out to be False and the test is deemed as PASSED.
..
Step 27 True, Loss: nan
Step 28 True, Loss: nan
Step 29 True, Loss: nan
Step 30 True, Loss: nan
Step 31 True, Loss: nan

I guess the ROCm 6.2 results might be false positive.

(Enabled log for passing test case
HF_DATASETS_OFFLINE=1 python -m pytest -rP --disable-warnings test/convergence/test_mini_models.py -k mini_mllama-32-0.0001-dtype2-1e-08-1e-05-0.005-1e-05-0.005-1e-05)

@tjtanaa
Copy link
Collaborator Author

tjtanaa commented Jan 9, 2025

Updating the debug status. Took ROCm 6.2 whl https://download.pytorch.org/whl/nightly/rocm6.2 to identify if any blas calls difference over ROCm 6.3. However, on my mi210 node, I see that loss has nan values for rocm6.2. The entire loss_list has nan values. This is true for both with_liger and without liger. Hence https://github.com/linkedin/Liger-Kernel/blob/main/test/utils.py#L87 turns out to be False and the test is deemed as PASSED. .. Step 27 True, Loss: nan Step 28 True, Loss: nan Step 29 True, Loss: nan Step 30 True, Loss: nan Step 31 True, Loss: nan

I guess the ROCm 6.2 results might be false positive.

(Enabled log for passing test case HF_DATASETS_OFFLINE=1 python -m pytest -rP --disable-warnings test/convergence/test_mini_models.py -k mini_mllama-32-0.0001-dtype2-1e-08-1e-05-0.005-1e-05-0.005-1e-05)

@austin362667
@jagadish-amd

I have also try to look other PASS cases using rocm6.2

The following cases which PASS also contains NaN Loss

test_mini_model[mini_llama3-32-0.0001-dtype0-1e-08-2e-05-0.0001-1e-05-0.005-1e-05]
test_mini_model[mini_llama3-32-0.0001-dtype1-0.001-0.01-0.1-0.01-0.01-0.01]
test_mini_model[mini_mllama-32-0.0001-dtype2-1e-08-1e-05-0.005-1e-05-0.005-1e-05]
test_mini_model[mini_mllama-32-0.0001-dtype3-0.001-0.01-0.1-0.01-0.01-0.01]
test_mini_model[mini_qwen2-32-0.0001-dtype4-1e-08-1e-05-0.005-1e-05-0.005-1e-05]
test_mini_model[mini_qwen2-32-0.0001-dtype5-0.001-0.01-0.1-0.01-0.01-0.01]
test_mini_model[mini_qwen2_vl-32-0.0001-dtype6-1e-05-0.1-0.005-1e-05-0.005-1e-05] 
test_mini_model[mini_qwen2_vl-32-0.0001-dtype7-0.001-0.05-0.1-0.01-0.01-0.01] 
test_mini_model[mini_mistral-32-0.0001-dtype10-1e-08-1e-05-0.005-1e-05-0.005-1e-05]
test_mini_model[mini_mistral-32-0.0001-dtype11-0.001-0.01-0.1-0.01-0.01-0.01]
test_mini_model[mini_gemma1-32-0.0001-dtype12-1e-08-0.0001-0.005-1e-05-0.005-1e-05] 
test_mini_model[mini_gemma1-32-0.0001-dtype13-0.001-0.01-0.1-0.01-0.01-0.01]
test_mini_model[mini_gemma1.1-32-0.0001-dtype14-1e-08-0.0001-0.005-1e-05-0.005-1e-05]
test_mini_model[mini_gemma1.1-32-0.0001-dtype15-0.001-0.01-0.1-0.01-0.01-0.01]

P.S.
@austin362667 should we

  1. disable this test for AMD for now?
FAILED test/convergence/test_mini_models.py::test_mini_model[mini_llama3-32-0.0001-dtype1-0.001-0.01-0.1-0.01-0.01-0.01] - AssertionError: Number of mismatched elements: 2
Mismatch at index (0, 7): tensor1[(0, 7)] = 3.0651497840881348, tensor2[(0, 7)] = 3.0652356147766113
Mismatch at index (0, 9): tensor1[(0, 9)] = 1.470238447189331, tensor2[(0, 9)] = 1.4702625274658203
======================================== 1 failed, 16 passed, 2 warnings in 94.82s (0:01:34) ==================
  1. disable ROCm 6.2 and only run ROCm 6.3.
    REASONS:
    1. @jagadish-amd has validated on MI300X that the convergence tests outputs NaN when using pytorch ROCm 6.2, although all the tests are passing.
    2. MI300X tests are generating valid values (non-NaN) when using pytorch ROCm 6.3. And it fails only one test case
      FAILED test/convergence/test_mini_models.py::test_mini_model[mini_llama3-32-0.0001-dtype1-0.001-0.01-0.1-0.01-0.01-0.01] - AssertionError: Number of mismatched elements:

@austin362667
Copy link
Collaborator

austin362667 commented Jan 20, 2025

@tjtanaa @jagadish-amd Thanks for diving into this!
Is it okay to just use ROCm 6.3 without understanding why the bugs occurred in ROCm 6.2? What's your opinion? @hebiao064 @Tcc0403

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants