2.6 backport PR request list #8455

tengyifei · 2024-12-05T19:05:56Z

This is a tracker for backport/cherry-picks into 2.6. For any PRs you want to backport to 2.6, please reply with following:

Original PR link (this PR should merge into master)
Reason to backport
2.6 backport PR link (a separate PR should be created, and that PR should merge into r2.6)

This process is similar to the backport request thread in 2.5 release #7977

Please note the criterion for cherrypicking into 2.6: #7203

The text was updated successfully, but these errors were encountered:

tengyifei · 2024-12-11T19:09:14Z

@lsy323 libtpu pin update to 0.0.6: #8480.
Reason: pick stable libtpu for Trillium
Backport PR link: manually pushed to r2.6 since there's no delta between these branches

yaochengji · 2024-12-18T01:44:12Z

Fix a DDP graph capture issue #8489
Reason: The DDP result before this patch is wrong
Backport PR link: #8500

mcuiaws · 2024-12-18T17:35:24Z

Original PR: Compute and hash buffer_donor_indices for step marker #8467
Reason: Fixes tensor corruption issue
Backport PR: #8503

mcuiaws · 2024-12-18T17:41:14Z

Original PR: xm.save() should not set sync_xla_data=True when sync'ing. #8484
Reason: Fixes tensor corruption issues, easily reproducible by running huggingface tutorials.
Backport PR: #8504

mcuiaws · 2024-12-18T18:24:37Z

Original PR: Add xm.xla_device_kind() to return XLA device kind string. #8493
Reason: Requested by Neuron customers. Feature available in JAX but not PyTorch/XLA. Should be very low risk.
Backport PR: #8506

jeffhataws · 2024-12-20T00:04:05Z

Original PR: When modifying IR node, make sure to not lose the read_only bit #8505
Reason: Fixed a bug where 0-dimensional tensors result in aliasing errors
Backport PR: TBD

mcuiaws · 2024-12-20T18:31:20Z

Original PR: When modifying IR node, make sure to not lose the read_only bit #8505 Reason: Fixed a bug where 0-dimensional tensors result in aliasing errors Backport PR: TBD

Backport PR: #8508

avizon-aws · 2024-12-20T23:49:13Z

Cherry-Pick PR for softmax autocast:
Reason: This PR fixes the precision issues related to softmax being done in BF16 as it was not a part of the autocast policy which leads to convergence issues.
Original PR:#8509
Cherry-pick PR: #8511

savitha-aws · 2025-01-03T08:15:01Z

Originial PR: Add xla autocast support, update autocast APIs in checkpointing #8523
Reason: PR adds missing XLA autocast support in gradient checkpointing and updates deprecated APIs for cuda and cpu autocast.
Backport PR: #8527

rpsilva-aws · 2025-01-10T22:35:00Z

Original PR: Introduce deterministic hash for user computations #8539
Original issue: #8537
Reason: Fixes a day-one bug with the user computation hash
Backport PR: #8554

rpsilva-aws · 2025-01-10T22:37:38Z

Original PR: Metadata agnostic hash for user computations #8550
Original feature: #8538
Reason: User computation is cache agnostic to OpMetadata in the HLO module proto, not influencing execution semantics of the computation.
Backport PR: TBD (dependency on #8554)

rpsilva-aws · 2025-01-11T02:30:13Z

Original PR: Metadata agnostic hash for user computations #8550 Original feature: #8538 Reason: User computation is cache agnostic to OpMetadata in the HLO module proto, not influencing execution semantics of the computation. Backport PR: TBD (dependency on #8554)

Backport PR: #8557

tengyifei · 2025-01-14T01:27:58Z

Original PR: #8521
Reason: prerequisite to fix DDP tests in CI
Backport PR link: #8563

rpsilva-aws · 2025-01-14T01:36:44Z

Original PR: #8551
Reason: Minimal binding to help debug HLO proto binaries
Backport PR: #8564

tengyifei · 2025-01-14T19:15:11Z

Original PR: #8558
Reason: fix DDP tests in CI
Backport PR link: #8567

tengyifei · 2025-01-14T23:27:00Z

Original PR: #8524
Reason: low-risk perf optimization for the experimental scan feature
Backport PR link: #8572

tengyifei · 2025-01-15T20:27:08Z

Original PR: #8576
Reason: fix broken r2-6-0-rc5-libtpu-3-10-tpuvm build (http://shortn/_U3q858o5GH)
Backport PR link: #8580

tengyifei · 2025-01-15T21:31:09Z

Original PR: #8529
Reason: fix scan crash when an output aliases an input
Backport PR link: #8581

rpsilva-aws · 2025-01-15T21:54:47Z

Original PR: #8571
Reason: Allow customers to implicitly downcast to INT32, and make Neuron an opt-in.
Backport PR link: Pending approvals

rpsilva-aws · 2025-01-17T21:53:59Z

Original PR: #8571 Reason: Allow customers to implicitly downcast to INT32, and make Neuron an opt-in. Backport PR link: Pending approvals

Backport PR: #8589

rpsilva-aws · 2025-01-17T22:44:43Z

Original PR: #8561
Reason: Safe test-only improvements with added coverage for checkpointing and A/B testing.
Backport PR: #8590

tengyifei · 2025-01-22T18:28:07Z

Original PR: #8530
Reason: fix scan crash when output dtype is not float32
Backport PR link: #8604

tengyifei · 2025-01-22T18:28:12Z

Original PR: #8555
Reason: without this, scan is unusable under debugging env vars
Backport PR link: #8605

rpsilva-aws · 2025-01-22T18:32:02Z

Original PR: #8584
Reason: Required for Neuron - safe and well-scoped interface for gradient accumulation using XLA's while loop.
Backport PR link: #8608

tengyifei · 2025-01-22T21:35:57Z

Original PR: #8562
Reason: low-risk perf optimization for the experimental scan feature
Backport PR link: #8611

zpcore · 2025-01-23T23:02:17Z

Original PR: #8609
Reason: Need for multipod run with libtpu temporary hack.
Backport PR link: #8618

jeffhataws · 2025-01-24T19:03:21Z

Original PR: #8622
Reason: In earlier torch-xla versions nprocs != 1 or None was ignored. In 2.6 we make it a value error. However the message was not clear. This change improve the messaging to be more clear and actionable. It would be good to cherry-pick, as customers will start to see errors when nprocs not equal to 1 or None (whereas before 2.6 there was a warning only).
Backport PR link: #8623

tengyifei added the backport_2.6 label Dec 5, 2024

jeffhataws mentioned this issue Dec 18, 2024

Fix a DDP graph capture issue #8489

Merged

tengyifei mentioned this issue Jan 23, 2025

use env to skip PJRT initialize #8609

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2.6 backport PR request list #8455

2.6 backport PR request list #8455

tengyifei commented Dec 5, 2024 •

edited

Loading

tengyifei commented Dec 11, 2024 •

edited

Loading

yaochengji commented Dec 18, 2024

mcuiaws commented Dec 18, 2024

mcuiaws commented Dec 18, 2024

mcuiaws commented Dec 18, 2024

jeffhataws commented Dec 20, 2024

mcuiaws commented Dec 20, 2024

avizon-aws commented Dec 20, 2024 •

edited

Loading

savitha-aws commented Jan 3, 2025

rpsilva-aws commented Jan 10, 2025

rpsilva-aws commented Jan 10, 2025

rpsilva-aws commented Jan 11, 2025

tengyifei commented Jan 14, 2025

rpsilva-aws commented Jan 14, 2025

tengyifei commented Jan 14, 2025

tengyifei commented Jan 14, 2025

tengyifei commented Jan 15, 2025

tengyifei commented Jan 15, 2025

rpsilva-aws commented Jan 15, 2025 •

edited

Loading

rpsilva-aws commented Jan 17, 2025

rpsilva-aws commented Jan 17, 2025

tengyifei commented Jan 22, 2025

tengyifei commented Jan 22, 2025

rpsilva-aws commented Jan 22, 2025 •

edited

Loading

tengyifei commented Jan 22, 2025

zpcore commented Jan 23, 2025

jeffhataws commented Jan 24, 2025

2.6 backport PR request list #8455

2.6 backport PR request list #8455

Comments

tengyifei commented Dec 5, 2024 • edited Loading

tengyifei commented Dec 11, 2024 • edited Loading

yaochengji commented Dec 18, 2024

mcuiaws commented Dec 18, 2024

mcuiaws commented Dec 18, 2024

mcuiaws commented Dec 18, 2024

jeffhataws commented Dec 20, 2024

mcuiaws commented Dec 20, 2024

avizon-aws commented Dec 20, 2024 • edited Loading

savitha-aws commented Jan 3, 2025

rpsilva-aws commented Jan 10, 2025

rpsilva-aws commented Jan 10, 2025

rpsilva-aws commented Jan 11, 2025

tengyifei commented Jan 14, 2025

rpsilva-aws commented Jan 14, 2025

tengyifei commented Jan 14, 2025

tengyifei commented Jan 14, 2025

tengyifei commented Jan 15, 2025

tengyifei commented Jan 15, 2025

rpsilva-aws commented Jan 15, 2025 • edited Loading

rpsilva-aws commented Jan 17, 2025

rpsilva-aws commented Jan 17, 2025

tengyifei commented Jan 22, 2025

tengyifei commented Jan 22, 2025

rpsilva-aws commented Jan 22, 2025 • edited Loading

tengyifei commented Jan 22, 2025

zpcore commented Jan 23, 2025

jeffhataws commented Jan 24, 2025

tengyifei commented Dec 5, 2024 •

edited

Loading

tengyifei commented Dec 11, 2024 •

edited

Loading

avizon-aws commented Dec 20, 2024 •

edited

Loading

rpsilva-aws commented Jan 15, 2025 •

edited

Loading

rpsilva-aws commented Jan 22, 2025 •

edited

Loading