-
Notifications
You must be signed in to change notification settings - Fork 264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch amd-ci to use MI300X runner. #428
Merged
Merged
Changes from all commits
Commits
Show all changes
38 commits
Select commit
Hold shift + click to select a range
18bce05
temporarily run amd ci on any path with mi300 runner
saienduri 171e96d
checkstyle on ubuntu; tests on amd gpu
saienduri e985b84
[AMD] [CI] Added Dockerfile and AMD-CI test workflow (#430)
tjtanaa f88ed31
skip installing system dependencies
tjtanaa 6d37fc4
validate if adding sudo to docker build will grant permission
tjtanaa b094e7c
check the workspace location and what is in there
tjtanaa 69872db
fix Dockerfile
tjtanaa 07a3a62
Skip ci in docker image
tjtanaa 96e9fe1
temporary fix test
tjtanaa 163e89b
fix checkstyle
tjtanaa 28f67c5
upgrade torch
tjtanaa 36f83d6
temporary skip test_cross_entropy::test_float32_internal
tjtanaa cb5e232
check amd ci machine environment
tjtanaa 1d999d8
muted modal gpu ci while setting up amd ci
tjtanaa d0521ad
fix syntax
tjtanaa 42e12a9
run test using docker
tjtanaa cca2aae
skip to test-convergence
tjtanaa 5971ffa
switch back to not use docker
tjtanaa 1330456
reenable crossentropy _test_float32_internal test
tjtanaa d2571d8
use docker in amd ci
tjtanaa 223c054
test torch latest dev version
tjtanaa 4e90d3f
fix test_cross_entropy_test
tjtanaa c2cc168
run only failed test
tjtanaa b16a7bc
run only failed test
tjtanaa 9c8d119
downgrade triton to 3.0.0
tjtanaa f70eb6c
turn back triton version to 3.1.0
tjtanaa 334b8b5
reenable convergence test
tjtanaa 44a1335
set pytest num_process to 1 and install amdsmi
tjtanaa f0d8b30
set pytest num_process to 1
tjtanaa f34cd33
install pytest plugins
tjtanaa 333d6ba
downgrade torch 2.6.0 to 20241113
tjtanaa 3afa73e
check python environment
tjtanaa 1f992b9
log more of the CI machine info; set to use 1 gpu only for unittest
tjtanaa 3107286
show more rocm info; set numpy to 1.26.4
tjtanaa eec5b88
add reruns to test_rms_norm::test_correctness
tjtanaa 427064a
remove HIP_VISIBLE_DEVICE from Makefile
tjtanaa 3bd7901
fix amd-ci.yml syntax
tjtanaa f6ad875
remove Dockerfile.rocm
tjtanaa File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -74,6 +74,7 @@ def forward(self, x): | |
return output.type_as(x) | ||
|
||
|
||
@pytest.mark.flaky(reruns=3, reruns_delay=2) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. will this count as "pass" after all rerun "fails"? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It will be counted as "FAILED". |
||
@pytest.mark.parametrize( | ||
"bs, sl, hd", | ||
[ | ||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a bit nasty. Can we have a easy way to install amd dep? Like
pip install liger-kernel[amd]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressing this in PR #436.