[train] Add TorchAwsNeuronXLABackend and XLAConfig #39130

chappidim · 2023-08-30T21:29:14Z

Why are these changes needed?

This change adds new TorchXLA config and NeuronBackend with XLA setup. We initialize the setup with appropriate environment variables and the distributed group.
Please note, torch-neuronx (aws provided) doesn't support pjrt and currently uses xrt server.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Manual testing

Manual testing

Single trn1.2xl machine with 2 neuron_cores
Train function: All_reduce

from ray.air import session
import ray
from ray.train.backend import Backend, BackendConfig
from ray.train._internal.utils import get_address_and_port
from ray.train.torch.config import TorchXLAConfig
from ray.train._internal.worker_group import WorkerGroup
from ray.train.torch import TorchTrainer

#
import os
import torch
import uuid

ray.init()


class NN(torch.nn.Module):
    def __init__(self):
        super().__init__()

        self.layer1 = torch.nn.Linear(4, 4)
        self.nl1 = torch.nn.ReLU()
        self.layer2 = torch.nn.Linear(4, 2)
        self.nl2 = torch.nn.Tanh()

    def forward(self, x):
        x = self.nl1(self.layer1(x))
        return self.nl2(self.layer2(x))


def log(txt):
    rank = os.environ.get("RANK", "unk")
    print(f"{rank}: {txt}", flush=True)


def train_func():
    import torch_xla.core.xla_model as xm

    log("before 1st rendezvous")
    xm.rendezvous('first')
    device = xm.xla_device()
    for c in range(1000):
        ones = torch.ones((2, 3))
        xones = ones.to(device)
        result = xm.all_reduce('sum', xones)
        xm.mark_step()
        result_cpu = result.cpu()
        expected = torch.ones((2, 3)) * int(os.environ.get("WORLD_SIZE", 0))
        log(f"result: {c}: {result}  result.size(): {result.size()}")
        assert torch.all(result_cpu == expected), f'ERROR: {result_cpu} != {expected}'
    log("before final rendezvous")
    xm.rendezvous('last')
    log("done!")


train_dataset = ray.data.from_items([1, 2, 3])
assert train_dataset.count() == 3
trainer = TorchTrainer(
    train_loop_per_worker=train_func,
    torch_config=TorchXLAConfig(),
    scaling_config=ray.air.config.ScalingConfig(num_workers=2, resources_per_worker={"neuron_cores": 1}),
    datasets={"train": train_dataset},
)
result = trainer.fit()
print(result)

Train results


 Memory Usage Summary
 Host Used Memory          Total: 426.0MB                    Tensors: 0.0B                     Constants: 0.0B                   DMA Buffers: 128.0KB              App. Memory: 425.9MB
 Device Used Memory        Total: 1.5GB                      Tensors: 120.0B                   Constants: 334.0MB                Model Code: 192.3MB               Runtime Memory: 2.1KB             Model Scratchpad: 1.0GB

 Memory Usage Details
                                                                                                                                Model ID                          Device Memory                     Host Memory
  [-] ND 0                                                                                                                                                         1.5GB                             40.0KB
      [+] NC 0                                                                                                                                                     775.2MB                           20.0KB
      [+] NC 1                                                                                                                                                     775.2MB                           20.0KB

python/ray/train/torch/config.py

Signed-off-by: woshiyyya <[email protected]>

cosnicolaou · 2023-12-01T16:31:31Z

Any progress on this?

woshiyyya · 2024-01-05T00:01:31Z

@cosnicolaou we are working on make multi-node training works. Will merge this PR after the release test is passed.

Update _TorchAwsNeuronXLABackend to enable neuron_parallel_compile

release/ray_release/byod/byod_train_trainium.sh

stale · 2024-03-17T09:05:31Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

Signed-off-by: Yunxuan Xiao <[email protected]>

release/train_tests/trainium/compute_aws.yaml

Signed-off-by: woshiyyya <[email protected]>

woshiyyya · 2024-03-20T20:38:18Z

python/ray/train/torch/xla/config.py

+
+        # Compile the extracted graphs. This must run at end of training.
+        if backend_config.neuron_parallel_compile:
+            worker_group.execute(_neuron_compile_extracted_graphs)


Why are we compiling the graph at the end? How would the people use the compiled graph?

This code only runs during a pre-compilation step which happens when neuron_parallel_compile is set to true. _neuron_compile_extracted_graphs() must run at the end of the job (after a short number of training iterations) when all the graphs have been encountered.

See: 1 and 2 and 3.

After precompilation, the user simply runs without neuron_parallel_compile to use the cached graphs and this avoids _neuron_compile_extracted_graphs()

@woshiyyya does the above make sense?

Ok that makes sense to me. Can we add a docstring for neuron_parallel_compile in TorchConfig to explain the behavior?

woshiyyya

The CI tests are failing. Can you fix it before we merge it?

woshiyyya · 2024-03-25T19:23:29Z

There are two Train premerge tests [1, 2] are failing.

Signed-off-by: woshiyyya <[email protected]>

chappidim force-pushed the feat-train-config branch from 3ff1bdc to 6d83e06 Compare August 31, 2023 17:51

chappidim requested review from richardliaw, krfricke, xwjiang2010, amogkam, matthewdeng, Yard1, maxpumperla and a team as code owners August 31, 2023 17:51

matthewdeng reviewed Sep 7, 2023

View reviewed changes

python/ray/train/torch/config.py Outdated Show resolved Hide resolved

python/ray/train/torch/config.py Outdated Show resolved Hide resolved

python/ray/train/torch/config.py Outdated Show resolved Hide resolved

python/ray/train/torch/config.py Outdated Show resolved Hide resolved

woshiyyya added 2 commits November 17, 2023 11:22

fix shell lint

34620f0

Signed-off-by: woshiyyya <[email protected]>

Merge remote-tracking branch 'upstream/master' into feat-train-config

c88cdff

Merge branch 'master' into feat-train-config

74251f4

woshiyyya approved these changes Jan 5, 2024

View reviewed changes

5cp and others added 6 commits January 9, 2024 17:46

Avoid runtime warning when NEURON_RT_VISIBLE_CORES is empty

beece14

Add support for neuron_parallel_compile to pre-populate Neuron cache

1164d73

Clean up parallel_compile_workdir before parallel compilation

8b1a416

Minor updates to address PR comments

6d995dd

Merge pull request #1 from 5cp/feat-train-config

39eb872

Update _TorchAwsNeuronXLABackend to enable neuron_parallel_compile

Merge branch 'ray-project:master' into feat-train-config

bb28b81

can-anyscale reviewed Jan 17, 2024

View reviewed changes

release/ray_release/byod/byod_train_trainium.sh Outdated Show resolved Hide resolved

stale bot added stale The issue is stale. It will be closed within 7 days unless there are further conversation and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Mar 17, 2024

Merge branch 'master' into feat-train-config

f514f07

Signed-off-by: Yunxuan Xiao <[email protected]>

chappidim requested a review from a team as a code owner March 19, 2024 02:06

aslonnie reviewed Mar 19, 2024

View reviewed changes

release/train_tests/trainium/compute_aws.yaml Outdated Show resolved Hide resolved

remove release test

66264f9

Signed-off-by: woshiyyya <[email protected]>

woshiyyya reviewed Mar 20, 2024

View reviewed changes

woshiyyya requested changes Mar 21, 2024

View reviewed changes

woshiyyya mentioned this pull request Apr 2, 2024

[Train] Ray Train should support AWS trainium instances #33504

Open

woshiyyya and others added 2 commits April 3, 2024 12:43

fix ci test

b340222

Signed-off-by: woshiyyya <[email protected]>

Merge branch 'master' into feat-train-config

4eeb74d

woshiyyya approved these changes Apr 3, 2024

View reviewed changes

justinvyu changed the title ~~Add TorchAwsNeuronXLABackend and XLAConfig~~ [train] Add TorchAwsNeuronXLABackend and XLAConfig Apr 3, 2024

justinvyu merged commit 0115c3b into ray-project:master Apr 3, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] Add TorchAwsNeuronXLABackend and XLAConfig #39130

[train] Add TorchAwsNeuronXLABackend and XLAConfig #39130

chappidim commented Aug 30, 2023

cosnicolaou commented Dec 1, 2023

woshiyyya commented Jan 5, 2024

stale bot commented Mar 17, 2024

woshiyyya Mar 20, 2024

5cp Mar 21, 2024

anyscalesam Mar 25, 2024

woshiyyya Mar 25, 2024

woshiyyya left a comment

woshiyyya commented Mar 25, 2024

[train] Add TorchAwsNeuronXLABackend and XLAConfig #39130

[train] Add TorchAwsNeuronXLABackend and XLAConfig #39130

Conversation

chappidim commented Aug 30, 2023

Why are these changes needed?

Related issue number

Checks

Manual testing

cosnicolaou commented Dec 1, 2023

woshiyyya commented Jan 5, 2024

stale bot commented Mar 17, 2024

woshiyyya Mar 20, 2024

Choose a reason for hiding this comment

5cp Mar 21, 2024

Choose a reason for hiding this comment

anyscalesam Mar 25, 2024

Choose a reason for hiding this comment

woshiyyya Mar 25, 2024

Choose a reason for hiding this comment

woshiyyya left a comment

Choose a reason for hiding this comment

woshiyyya commented Mar 25, 2024