[WIP] Training on AMD / ROCm #302

jpata · 2024-03-22T19:17:14Z

Currently, there are the following issues on LUMI / AMD MI250x for pytorch:

ROCm flash attention is not supported natively in pytorch, and must be used via the external package https://github.com/ROCm/flash-attention/
- I think having native support is behind this PR: AOTriton V2 Integration ROCm/pytorch#1353
when flash attention is enabled, multi-GPU does not work (crashes with HIP memory allocation errors)
- I think this is caused by pytorch using rocm 5.6, while the LUMI system driver officially only supports ROCm up to 5.4. According to LUMI people, a driver upgrade is planned for ~April, let's see.
- I think a related error is appearing for rocm/tensorflow: 2023-10-30 23:26:05.514405: E tensorflow/compiler/xla/stream_executor/rocm/rocm_driver.cc:1284] failed to query device memory info: HIP_ERROR_InvalidValue ROCm/tensorflow-upstream#2289 (comment)

With the custom flash attention package, single-GPU training works OK though.

I followed the LUMI pytorch setup from here: https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/p/PyTorch/#example-for-distributed-learning

paolodalberto · 2024-03-25T18:00:01Z

they only need to update rocm 6.0.5 ?
waiting for a new docker ... you could try as other did to use an uptodate rocm repository and do not wait for a proper docker ...

jpata · 2024-04-26T12:59:28Z

Pytorch 2.3.0 was released which seems to have improved FlashAttention support on ROCm builtin.
https://github.com/pytorch/pytorch/releases/tag/v2.3.0

Need to wait for the new rocm/pytorch tag:
https://hub.docker.com/r/rocm/pytorch/tags

Also I haven't seen any update to the LUMI driver yet:
https://lumi-supercomputer.github.io/LUMI-training-materials/User-Updates/

jpata · 2024-04-29T07:19:50Z

This is the current information from the LUMI team.

Thank you for sending this improvement request. We are very aware of issues
associated with ROCm environment not being updated frequently. The next system
upgrade that would happen in summer will include GPU driver and ROCm update.
Details will be announced whenever precise schedule is decided.

jpata · 2024-06-12T11:40:16Z

Things are moving at LUMI but at a glacial pace:

"We will be taking the system offline for maintenance starting on Monday, 19 August, 2024. LUMI won't be accessible as this will affect all the partitions.
Significant parts of the system software will be updated, and more particularly the system software stack, in order to get a more stable and up-to-date system after the break. We expect the system to be back in production on Monday, 9 September, 2024."

paolodalberto · 2024-06-16T17:23:02Z

I could resolve the memory but not the multiple GPUs

jpata · 2024-09-16T11:58:59Z

Fixed now in #344 with the latest LUMI upgrade.

Joosep Pata added 2 commits March 22, 2024 20:57

rocm

3d7f3c6

up

0f9f2d3

jpata self-assigned this Jun 12, 2024

jpata marked this pull request as draft June 12, 2024 11:40

Joosep Pata and others added 21 commits July 22, 2024 11:22

Merge remote-tracking branch 'origin/main' into rocm

0dc757d

update lumi scripts

4c11393

remove correct dir

23395b3

update samples

4d2bfc7

add samples

8c07fba

up

86a0395

fix supervised key

3867972

resubmit training

4bd6c5f

add missing

308e154

Merge remote-tracking branch 'origin/fixes_20240724' into rocm

38b882e

Merge

c15cf3e

add mpgun

16bfb44

plt target and gen separately

0549b33

separate submission scripts

f7ac402

add stats

bbde4a5

fix softmax bug

96b1e42

Merge remote-tracking branch 'origin/fixes_20240724' into rocm

6d425d3

Merge remote-tracking branch 'origin/fixes_20240724' into rocm

c2da035

add CLD sample

78701f8

add CLD sample

fc26057

add finetuning script

f9683dd

jpata and others added 28 commits September 6, 2024 10:37

binned regression

cc926d0

fix

4c170f7

fix

725ed38

use only energy bins

9bf4fa2

fix bin loss

9008621

add decoding tokens

23322c4

prepare cms tfds 2.3

a3c94a0

remove mamba, bins configurable

7ca47c3

up

8c96c78

key_padding_mask

5034203

save attention

bced81a

fixes for distributed

bf1364d

up

948b8e8

up

c1acc52

mass term

7987761

add jet eta plot

bb6336c

change mass loss coef

26f70cd

runs of sep14

8b0432f

restrict to pos mass

8c0bdb2

disable layernorm

f1c36a0

change jet ptcut

949dcf1

split apart reg losses

28406b1

Merge remote-tracking branch 'origin/fixes_sep6' into rocm

81ac447

up

fd17174

up

640d617

up

4ddb48c

up

caa3ef6

up

5719b54

jpata closed this Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Training on AMD / ROCm #302

[WIP] Training on AMD / ROCm #302

jpata commented Mar 22, 2024 •

edited

Loading

paolodalberto commented Mar 25, 2024 •

edited

Loading

jpata commented Apr 26, 2024

jpata commented Apr 29, 2024

jpata commented Jun 12, 2024

paolodalberto commented Jun 16, 2024

jpata commented Sep 16, 2024

[WIP] Training on AMD / ROCm #302

[WIP] Training on AMD / ROCm #302

Conversation

jpata commented Mar 22, 2024 • edited Loading

paolodalberto commented Mar 25, 2024 • edited Loading

jpata commented Apr 26, 2024

jpata commented Apr 29, 2024

jpata commented Jun 12, 2024

paolodalberto commented Jun 16, 2024

jpata commented Sep 16, 2024

jpata commented Mar 22, 2024 •

edited

Loading

paolodalberto commented Mar 25, 2024 •

edited

Loading