Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[train] Add TorchAwsNeuronXLABackend and XLAConfig #39130

Merged
merged 33 commits into from
Apr 3, 2024

Conversation

chappidim
Copy link
Contributor

Why are these changes needed?

This change adds new TorchXLA config and NeuronBackend with XLA setup. We initialize the setup with appropriate environment variables and the distributed group.
Please note, torch-neuronx (aws provided) doesn't support pjrt and currently uses xrt server.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Manual testing

Manual testing

  • Single trn1.2xl machine with 2 neuron_cores
  • Train function: All_reduce
from ray.air import session
import ray
from ray.train.backend import Backend, BackendConfig
from ray.train._internal.utils import get_address_and_port
from ray.train.torch.config import TorchXLAConfig
from ray.train._internal.worker_group import WorkerGroup
from ray.train.torch import TorchTrainer

#
import os
import torch
import uuid

ray.init()


class NN(torch.nn.Module):
    def __init__(self):
        super().__init__()

        self.layer1 = torch.nn.Linear(4, 4)
        self.nl1 = torch.nn.ReLU()
        self.layer2 = torch.nn.Linear(4, 2)
        self.nl2 = torch.nn.Tanh()

    def forward(self, x):
        x = self.nl1(self.layer1(x))
        return self.nl2(self.layer2(x))


def log(txt):
    rank = os.environ.get("RANK", "unk")
    print(f"{rank}: {txt}", flush=True)


def train_func():
    import torch_xla.core.xla_model as xm

    log("before 1st rendezvous")
    xm.rendezvous('first')
    device = xm.xla_device()
    for c in range(1000):
        ones = torch.ones((2, 3))
        xones = ones.to(device)
        result = xm.all_reduce('sum', xones)
        xm.mark_step()
        result_cpu = result.cpu()
        expected = torch.ones((2, 3)) * int(os.environ.get("WORLD_SIZE", 0))
        log(f"result: {c}: {result}  result.size(): {result.size()}")
        assert torch.all(result_cpu == expected), f'ERROR: {result_cpu} != {expected}'
    log("before final rendezvous")
    xm.rendezvous('last')
    log("done!")


train_dataset = ray.data.from_items([1, 2, 3])
assert train_dataset.count() == 3
trainer = TorchTrainer(
    train_loop_per_worker=train_func,
    torch_config=TorchXLAConfig(),
    scaling_config=ray.air.config.ScalingConfig(num_workers=2, resources_per_worker={"neuron_cores": 1}),
    datasets={"train": train_dataset},
)
result = trainer.fit()
print(result)

  • Train results

 Memory Usage Summary
 Host Used Memory          Total: 426.0MB                    Tensors: 0.0B                     Constants: 0.0B                   DMA Buffers: 128.0KB              App. Memory: 425.9MB
 Device Used Memory        Total: 1.5GB                      Tensors: 120.0B                   Constants: 334.0MB                Model Code: 192.3MB               Runtime Memory: 2.1KB             Model Scratchpad: 1.0GB

 Memory Usage Details
                                                                                                                                Model ID                          Device Memory                     Host Memory
  [-] ND 0                                                                                                                                                         1.5GB                             40.0KB
      [+] NC 0                                                                                                                                                     775.2MB                           20.0KB
      [+] NC 1                                                                                                                                                     775.2MB                           20.0KB


python/ray/train/torch/config.py Outdated Show resolved Hide resolved
python/ray/train/torch/config.py Outdated Show resolved Hide resolved
python/ray/train/torch/config.py Outdated Show resolved Hide resolved
python/ray/train/torch/config.py Outdated Show resolved Hide resolved
@cosnicolaou
Copy link

Any progress on this?

@woshiyyya
Copy link
Member

@cosnicolaou we are working on make multi-node training works. Will merge this PR after the release test is passed.

Copy link

stale bot commented Mar 17, 2024

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stale stale bot added stale The issue is stale. It will be closed within 7 days unless there are further conversation and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Mar 17, 2024
@chappidim chappidim requested a review from a team as a code owner March 19, 2024 02:06
Signed-off-by: woshiyyya <[email protected]>

# Compile the extracted graphs. This must run at end of training.
if backend_config.neuron_parallel_compile:
worker_group.execute(_neuron_compile_extracted_graphs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we compiling the graph at the end? How would the people use the compiled graph?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code only runs during a pre-compilation step which happens when neuron_parallel_compile is set to true. _neuron_compile_extracted_graphs() must run at the end of the job (after a short number of training iterations) when all the graphs have been encountered.

See: 1 and 2 and 3.

After precompilation, the user simply runs without neuron_parallel_compile to use the cached graphs and this avoids _neuron_compile_extracted_graphs()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@woshiyyya does the above make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok that makes sense to me. Can we add a docstring for neuron_parallel_compile in TorchConfig to explain the behavior?

Copy link
Member

@woshiyyya woshiyyya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI tests are failing. Can you fix it before we merge it?

@woshiyyya
Copy link
Member

There are two Train premerge tests [1, 2] are failing.

Screenshot 2024-03-25 at 12 22 01 PM

@justinvyu justinvyu changed the title Add TorchAwsNeuronXLABackend and XLAConfig [train] Add TorchAwsNeuronXLABackend and XLAConfig Apr 3, 2024
@justinvyu justinvyu merged commit 0115c3b into ray-project:master Apr 3, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Train] Add TorchXLA config and backend
10 participants