-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Training on AMD / ROCm #302
Conversation
they only need to update rocm 6.0.5 ? |
Pytorch 2.3.0 was released which seems to have improved FlashAttention support on ROCm builtin. Need to wait for the new rocm/pytorch tag: Also I haven't seen any update to the LUMI driver yet: |
This is the current information from the LUMI team.
|
Things are moving at LUMI but at a glacial pace: "We will be taking the system offline for maintenance starting on Monday, 19 August, 2024. LUMI won't be accessible as this will affect all the partitions. |
I could resolve the memory but not the multiple GPUs |
Fixed now in #344 with the latest LUMI upgrade. |
Currently, there are the following issues on LUMI / AMD MI250x for pytorch:
With the custom flash attention package, single-GPU training works OK though.
I followed the LUMI pytorch setup from here: https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/p/PyTorch/#example-for-distributed-learning