Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch amd-ci to use MI300X runner. #428

Merged
merged 38 commits into from
Dec 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
18bce05
temporarily run amd ci on any path with mi300 runner
saienduri Dec 5, 2024
171e96d
checkstyle on ubuntu; tests on amd gpu
saienduri Dec 5, 2024
e985b84
[AMD] [CI] Added Dockerfile and AMD-CI test workflow (#430)
tjtanaa Dec 6, 2024
f88ed31
skip installing system dependencies
tjtanaa Dec 6, 2024
6d37fc4
validate if adding sudo to docker build will grant permission
tjtanaa Dec 6, 2024
b094e7c
check the workspace location and what is in there
tjtanaa Dec 6, 2024
69872db
fix Dockerfile
tjtanaa Dec 6, 2024
07a3a62
Skip ci in docker image
tjtanaa Dec 6, 2024
96e9fe1
temporary fix test
tjtanaa Dec 6, 2024
163e89b
fix checkstyle
tjtanaa Dec 6, 2024
28f67c5
upgrade torch
tjtanaa Dec 6, 2024
36f83d6
temporary skip test_cross_entropy::test_float32_internal
tjtanaa Dec 6, 2024
cb5e232
check amd ci machine environment
tjtanaa Dec 6, 2024
1d999d8
muted modal gpu ci while setting up amd ci
tjtanaa Dec 6, 2024
d0521ad
fix syntax
tjtanaa Dec 6, 2024
42e12a9
run test using docker
tjtanaa Dec 6, 2024
cca2aae
skip to test-convergence
tjtanaa Dec 6, 2024
5971ffa
switch back to not use docker
tjtanaa Dec 6, 2024
1330456
reenable crossentropy _test_float32_internal test
tjtanaa Dec 6, 2024
d2571d8
use docker in amd ci
tjtanaa Dec 6, 2024
223c054
test torch latest dev version
tjtanaa Dec 6, 2024
4e90d3f
fix test_cross_entropy_test
tjtanaa Dec 6, 2024
c2cc168
run only failed test
tjtanaa Dec 6, 2024
b16a7bc
run only failed test
tjtanaa Dec 6, 2024
9c8d119
downgrade triton to 3.0.0
tjtanaa Dec 6, 2024
f70eb6c
turn back triton version to 3.1.0
tjtanaa Dec 6, 2024
334b8b5
reenable convergence test
tjtanaa Dec 6, 2024
44a1335
set pytest num_process to 1 and install amdsmi
tjtanaa Dec 6, 2024
f0d8b30
set pytest num_process to 1
tjtanaa Dec 6, 2024
f34cd33
install pytest plugins
tjtanaa Dec 6, 2024
333d6ba
downgrade torch 2.6.0 to 20241113
tjtanaa Dec 6, 2024
3afa73e
check python environment
tjtanaa Dec 6, 2024
1f992b9
log more of the CI machine info; set to use 1 gpu only for unittest
tjtanaa Dec 7, 2024
3107286
show more rocm info; set numpy to 1.26.4
tjtanaa Dec 7, 2024
eec5b88
add reruns to test_rms_norm::test_correctness
tjtanaa Dec 7, 2024
427064a
remove HIP_VISIBLE_DEVICE from Makefile
tjtanaa Dec 7, 2024
3bd7901
fix amd-ci.yml syntax
tjtanaa Dec 7, 2024
f6ad875
remove Dockerfile.rocm
tjtanaa Dec 7, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 51 additions & 24 deletions .github/workflows/amd-ci.yml
Original file line number Diff line number Diff line change
@@ -1,23 +1,23 @@
name: GitHub Actions CI (AMD)

# on:
# push:
# branches:
# - main
# paths:
# - "src/**"
# - "test/**"
# pull_request:
# branches:
# - main
# paths:
# - "src/**"
# - "test/**"
on:
push:
branches:
- main
paths:
- "src/**"
- "test/**"
pull_request:
branches:
- main
# paths:
# - "src/**"
# - "test/**"

# concurrency:
# # This causes it to cancel previous in-progress actions on the same PR / branch,
# group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
# cancel-in-progress: true
concurrency:
# This causes it to cancel previous in-progress actions on the same PR / branch,
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true

jobs:
checkstyle:
Expand All @@ -36,12 +36,11 @@ jobs:
run: |
python -m pip install --upgrade pip
pip install flake8 isort black

- name: Run checkstyle
run: make checkstyle

tests:
runs-on: ubuntu-latest
runs-on: linux-mi300-gpu-1
needs: [checkstyle]

steps:
Expand All @@ -53,12 +52,40 @@ jobs:
with:
python-version: '3.10'

- name: Install dependencies
- name: Check Docker Version
run: docker version

- name: Check Ubuntu version
run: lsb_release -a

- name: Check Hardware Specs
run: lscpu

- name: ROCM-SMI Output
run: |
python -m pip install --upgrade pip
pip install -e ".[dev]"
rocm-smi
rocm-smi --showproductname

- name: Run tests
- name: Setup Dependencies
run: |
cp -r /opt/rocm/share/amd_smi ./
cd amd_smi
python -m pip install -e .
cd ..
python -m pip install pytest pytest-xdist pytest-rerunfailures pytest-flakefinder pytest-cpp
python -m pip uninstall -y torch torchvision
python -m pip install --pre \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit nasty. Can we have a easy way to install amd dep? Like pip install liger-kernel[amd]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressing this in PR #436.

torch==2.6.0.dev20241113+rocm6.2 \
'setuptools-scm>=8' \
torchvision==0.20.0.dev20241113+rocm6.2 \
--extra-index-url https://download.pytorch.org/whl/nightly/rocm6.2
python -m pip install triton==3.1.0 transformers==4.46.3
python -m pip install -e .[dev]

- name: List Python Environments
run: python -m pip list

- name: Run Unit Tests
run: |
make test
make test-convergence
make test-convergence
5 changes: 3 additions & 2 deletions test/transformers/test_cross_entropy.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
from liger_kernel.transformers.cross_entropy import LigerCrossEntropyLoss
from liger_kernel.transformers.functional import liger_cross_entropy
from liger_kernel.utils import infer_device
from liger_kernel.ops.utils import is_hip

device = infer_device()
set_seed(42)
Expand Down Expand Up @@ -763,7 +764,7 @@ def test_float32_internal():
RETURN_Z_LOSS=0, # False
HAS_SOFTCAPPING=False,
BLOCK_SIZE=BLOCK_SIZE,
num_warps=32,
num_warps=32 if not is_hip() else 16,
)

# Run kernel for float32
Expand All @@ -787,7 +788,7 @@ def test_float32_internal():
RETURN_Z_LOSS=0, # False
HAS_SOFTCAPPING=False,
BLOCK_SIZE=BLOCK_SIZE,
num_warps=32,
num_warps=32 if not is_hip() else 16,
)

torch.allclose(X_bf16, X_fp32.bfloat16())
Expand Down
1 change: 1 addition & 0 deletions test/transformers/test_rms_norm.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@ def forward(self, x):
return output.type_as(x)


@pytest.mark.flaky(reruns=3, reruns_delay=2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this count as "pass" after all rerun "fails"?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be counted as "FAILED".

@pytest.mark.parametrize(
"bs, sl, hd",
[
Expand Down
Loading