Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Training on AMD / ROCm #302

Closed
wants to merge 58 commits into from
Closed

[WIP] Training on AMD / ROCm #302

wants to merge 58 commits into from

Conversation

jpata
Copy link
Owner

@jpata jpata commented Mar 22, 2024

Currently, there are the following issues on LUMI / AMD MI250x for pytorch:

With the custom flash attention package, single-GPU training works OK though.

I followed the LUMI pytorch setup from here: https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/p/PyTorch/#example-for-distributed-learning

Joosep Pata added 2 commits March 22, 2024 20:57
@paolodalberto
Copy link

paolodalberto commented Mar 25, 2024

they only need to update rocm 6.0.5 ?
waiting for a new docker ... you could try as other did to use an uptodate rocm repository and do not wait for a proper docker ...

@jpata
Copy link
Owner Author

jpata commented Apr 26, 2024

Pytorch 2.3.0 was released which seems to have improved FlashAttention support on ROCm builtin.
https://github.com/pytorch/pytorch/releases/tag/v2.3.0
Screenshot 2024-04-26 at 15 42 50

Need to wait for the new rocm/pytorch tag:
https://hub.docker.com/r/rocm/pytorch/tags

Also I haven't seen any update to the LUMI driver yet:
https://lumi-supercomputer.github.io/LUMI-training-materials/User-Updates/

@jpata
Copy link
Owner Author

jpata commented Apr 29, 2024

This is the current information from the LUMI team.

Thank you for sending this improvement request. We are very aware of issues
associated with ROCm environment not being updated frequently. The next system
upgrade that would happen in summer will include GPU driver and ROCm update.
Details will be announced whenever precise schedule is decided. 

@jpata
Copy link
Owner Author

jpata commented Jun 12, 2024

Things are moving at LUMI but at a glacial pace:

"We will be taking the system offline for maintenance starting on Monday, 19 August, 2024. LUMI won't be accessible as this will affect all the partitions.
Significant parts of the system software will be updated, and more particularly the system software stack, in order to get a more stable and up-to-date system after the break. We expect the system to be back in production on Monday, 9 September, 2024."

@jpata jpata self-assigned this Jun 12, 2024
@jpata jpata marked this pull request as draft June 12, 2024 11:40
@paolodalberto
Copy link

I could resolve the memory but not the multiple GPUs

@jpata
Copy link
Owner Author

jpata commented Sep 16, 2024

Fixed now in #344 with the latest LUMI upgrade.

@jpata jpata closed this Sep 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants