Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2.6 backport PR request list #8455

Open
tengyifei opened this issue Dec 5, 2024 · 27 comments
Open

2.6 backport PR request list #8455

tengyifei opened this issue Dec 5, 2024 · 27 comments

Comments

@tengyifei
Copy link
Collaborator

tengyifei commented Dec 5, 2024

This is a tracker for backport/cherry-picks into 2.6. For any PRs you want to backport to 2.6, please reply with following:

  • Original PR link (this PR should merge into master)
  • Reason to backport
  • 2.6 backport PR link (a separate PR should be created, and that PR should merge into r2.6)

This process is similar to the backport request thread in 2.5 release #7977

Please note the criterion for cherrypicking into 2.6: #7203

@tengyifei
Copy link
Collaborator Author

tengyifei commented Dec 11, 2024

@lsy323 libtpu pin update to 0.0.6: #8480.
Reason: pick stable libtpu for Trillium
Backport PR link: manually pushed to r2.6 since there's no delta between these branches

@yaochengji
Copy link
Collaborator

Fix a DDP graph capture issue #8489
Reason: The DDP result before this patch is wrong
Backport PR link: #8500

@mcuiaws
Copy link
Contributor

mcuiaws commented Dec 18, 2024

Original PR: Compute and hash buffer_donor_indices for step marker #8467
Reason: Fixes tensor corruption issue
Backport PR: #8503

@mcuiaws
Copy link
Contributor

mcuiaws commented Dec 18, 2024

Original PR: xm.save() should not set sync_xla_data=True when sync'ing. #8484
Reason: Fixes tensor corruption issues, easily reproducible by running huggingface tutorials.
Backport PR: #8504

@mcuiaws
Copy link
Contributor

mcuiaws commented Dec 18, 2024

Original PR: Add xm.xla_device_kind() to return XLA device kind string. #8493
Reason: Requested by Neuron customers. Feature available in JAX but not PyTorch/XLA. Should be very low risk.
Backport PR: #8506

@jeffhataws
Copy link
Collaborator

Original PR: When modifying IR node, make sure to not lose the read_only bit #8505
Reason: Fixed a bug where 0-dimensional tensors result in aliasing errors
Backport PR: TBD

@mcuiaws
Copy link
Contributor

mcuiaws commented Dec 20, 2024

Original PR: When modifying IR node, make sure to not lose the read_only bit #8505 Reason: Fixed a bug where 0-dimensional tensors result in aliasing errors Backport PR: TBD

Backport PR: #8508

@avizon-aws
Copy link
Collaborator

avizon-aws commented Dec 20, 2024

Cherry-Pick PR for softmax autocast:
Reason: This PR fixes the precision issues related to softmax being done in BF16 as it was not a part of the autocast policy which leads to convergence issues.
Original PR:#8509
Cherry-pick PR: #8511

@savitha-aws
Copy link
Contributor

Originial PR: Add xla autocast support, update autocast APIs in checkpointing #8523
Reason: PR adds missing XLA autocast support in gradient checkpointing and updates deprecated APIs for cuda and cpu autocast.
Backport PR: #8527

@rpsilva-aws
Copy link
Collaborator

Original PR: Introduce deterministic hash for user computations #8539
Original issue: #8537
Reason: Fixes a day-one bug with the user computation hash
Backport PR: #8554

@rpsilva-aws
Copy link
Collaborator

Original PR: Metadata agnostic hash for user computations #8550
Original feature: #8538
Reason: User computation is cache agnostic to OpMetadata in the HLO module proto, not influencing execution semantics of the computation.
Backport PR: TBD (dependency on #8554)

@rpsilva-aws
Copy link
Collaborator

Original PR: Metadata agnostic hash for user computations #8550 Original feature: #8538 Reason: User computation is cache agnostic to OpMetadata in the HLO module proto, not influencing execution semantics of the computation. Backport PR: TBD (dependency on #8554)

Backport PR: #8557

@tengyifei
Copy link
Collaborator Author

Original PR: #8521
Reason: prerequisite to fix DDP tests in CI
Backport PR link: #8563

@rpsilva-aws
Copy link
Collaborator

Original PR: #8551
Reason: Minimal binding to help debug HLO proto binaries
Backport PR: #8564

@tengyifei
Copy link
Collaborator Author

Original PR: #8558
Reason: fix DDP tests in CI
Backport PR link: #8567

@tengyifei
Copy link
Collaborator Author

Original PR: #8524
Reason: low-risk perf optimization for the experimental scan feature
Backport PR link: #8572

@tengyifei
Copy link
Collaborator Author

Original PR: #8576
Reason: fix broken r2-6-0-rc5-libtpu-3-10-tpuvm build (http://shortn/_U3q858o5GH)
Backport PR link: #8580

@tengyifei
Copy link
Collaborator Author

Original PR: #8529
Reason: fix scan crash when an output aliases an input
Backport PR link: #8581

@rpsilva-aws
Copy link
Collaborator

rpsilva-aws commented Jan 15, 2025

Original PR: #8571
Reason: Allow customers to implicitly downcast to INT32, and make Neuron an opt-in.
Backport PR link: Pending approvals

@rpsilva-aws
Copy link
Collaborator

Original PR: #8571 Reason: Allow customers to implicitly downcast to INT32, and make Neuron an opt-in. Backport PR link: Pending approvals

Backport PR: #8589

@rpsilva-aws
Copy link
Collaborator

Original PR: #8561
Reason: Safe test-only improvements with added coverage for checkpointing and A/B testing.
Backport PR: #8590

@tengyifei
Copy link
Collaborator Author

Original PR: #8530
Reason: fix scan crash when output dtype is not float32
Backport PR link: #8604

@tengyifei
Copy link
Collaborator Author

Original PR: #8555
Reason: without this, scan is unusable under debugging env vars
Backport PR link: #8605

@rpsilva-aws
Copy link
Collaborator

rpsilva-aws commented Jan 22, 2025

Original PR: #8584
Reason: Required for Neuron - safe and well-scoped interface for gradient accumulation using XLA's while loop.
Backport PR link: #8608

@tengyifei
Copy link
Collaborator Author

Original PR: #8562
Reason: low-risk perf optimization for the experimental scan feature
Backport PR link: #8611

@zpcore
Copy link
Collaborator

zpcore commented Jan 23, 2025

Original PR: #8609
Reason: Need for multipod run with libtpu temporary hack.
Backport PR link: #8618

@jeffhataws
Copy link
Collaborator

Original PR: #8622
Reason: In earlier torch-xla versions nprocs != 1 or None was ignored. In 2.6 we make it a value error. However the message was not clear. This change improve the messaging to be more clear and actionable. It would be good to cherry-pick, as customers will start to see errors when nprocs not equal to 1 or None (whereas before 2.6 there was a warning only).
Backport PR link: #8623

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants