-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[train] Add TorchAwsNeuronXLABackend and XLAConfig #39130
Conversation
3ff1bdc
to
6d83e06
Compare
Signed-off-by: woshiyyya <[email protected]>
Any progress on this? |
@cosnicolaou we are working on make multi-node training works. Will merge this PR after the release test is passed. |
Update _TorchAwsNeuronXLABackend to enable neuron_parallel_compile
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
|
Signed-off-by: Yunxuan Xiao <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
|
||
# Compile the extracted graphs. This must run at end of training. | ||
if backend_config.neuron_parallel_compile: | ||
worker_group.execute(_neuron_compile_extracted_graphs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we compiling the graph at the end? How would the people use the compiled graph?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code only runs during a pre-compilation step which happens when neuron_parallel_compile is set to true. _neuron_compile_extracted_graphs()
must run at the end of the job (after a short number of training iterations) when all the graphs have been encountered.
After precompilation, the user simply runs without neuron_parallel_compile to use the cached graphs and this avoids _neuron_compile_extracted_graphs()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@woshiyyya does the above make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok that makes sense to me. Can we add a docstring for neuron_parallel_compile
in TorchConfig
to explain the behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CI tests are failing. Can you fix it before we merge it?
Signed-off-by: woshiyyya <[email protected]>
Why are these changes needed?
This change adds new TorchXLA config and NeuronBackend with XLA setup. We initialize the setup with appropriate environment variables and the distributed group.
Please note, torch-neuronx (aws provided) doesn't support pjrt and currently uses xrt server.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.Manual testing