[BE] Lr schduler flatten #794

mori360 · 2025-01-16T03:49:26Z

Currently, lr_scheduler is stored differently as optimizer, model and data_loader, with keys to be "lr_scheduler_0", "lr_scheduler_1", ... stored in the state
This PR aims to flatten lr_shceduler so that all the schedulers would be stored as a list of state_dict under self.state['lr_scheduler'], which is consistent with optimizer

Here we have the assumption that all the optimziers have the same lr_scheduler, thus only to save a single lr_scheduler's state_dict and load it to all the schedulers works here.

The lr_scheduler has the state_dict like:
{'base_lrs': [0.0003], 'last_epoch': 1, 'verbose': False, '_step_count': 2, '_get_lr_called_within_step': False, '_last_lr': [2.985074626865671e-06], 'lr_lambdas': [{}]}

The PR is tested by 2 parts:

test lr_scheduler value before and after checkpoint, resharding with degree changes on tp and pp.
[dp=2, tp=4, pp=1] -> [dp=2, tp=1, pp=4]
[dp=2, tp=1, pp=4] -> [dp=2, tp=4, pp=1]
date_loader does not support resharding right now.

logs:
[dp=2, tp=4, pp=1]
step 5 before saving to checkpoint:
[{'lr': 8.955223880597014e-06, ...}]

step 10 after loading from checkpoint and reshard to [dp=2, tp=2, pp=2]:
[{'lr': 1.6417910447761194e-05, ...}, {'lr': 1.6417910447761194e-05, ...}]

[dp=8, tp=1, pp=1]
step 5 without checkpoint:
[{'lr': 8.955223880597014e-06, ...}]
step 10 without checkpoint:
[{'lr': 1.6417910447761194e-05, ...}]

Memory trace:
Before the flatten, rerun llama3_8b.toml from step 5 to step 10:

After the flatten, rerun llama3_8b.toml from step 5 to step 10:

tianyu-l · 2025-01-23T02:12:24Z

torchtitan/checkpoint.py

@@ -183,9 +183,9 @@ def __init__(
                "model": ModelWrapper(model_parts),
                "optimizer": optimizers,
                "dataloader": dataloader,
+                "lr_scheduler": lr_schedulers,


I think it won't be this simple. Both OptimizersContainer and ModelWrapper define state_dict and load_state_dict to handle flattening and unflattening. Since we don't have things like get_model_state_dict and set_model_state_dict for lr scheduler in torch.distributed.checkpoint.state_dict, we likely will need to manually write something for the LambdaLR we are using. See #738 (comment)

Let's work with @fegin on this.

Compared lr_schedulers before and after flattening, with/without checkpoint
lr_scheduler values are consistent with changes here

does it support DCP resharding? e.g. PP degree from 2 to 4 across two jobs

I think this PR doesn't address the resharding issue, hence the [BE] prefix. Supporting lr resharding deserve a separate PR.

tianyu-l · 2025-01-31T00:12:21Z

torchtitan/optimizer.py

+    def load_state_dict(self, state_dict: Dict[str, Any]) -> None:
+        # Load the same state_dict for all schedulers
+        for scheduler in self.schedulers:
+            scheduler.load_state_dict(state_dict)


We may need to explicitly copy state_dict before loading. Otherwise there could be silent errors. See details of behavior here https://github.com/pytorch/pytorch/blob/v2.6.0/torch/optim/lr_scheduler.py#L359

Please add more detailed comment/NOTE here in the code.
Please add verified experiment results in the PR summary.

We should consider add unit test under the test folder to guard the behavior, but feel free to do this in a later PR.

I checked the LambdaLR scheduler code. It seems the only thing matters in the state is the current step which is an int, so load_state_dict will automatically make copies. See last_epoch in https://github.com/pytorch/pytorch/blob/v2.6.0/torch/optim/lr_scheduler.py#L122

Therefore, the behavior should be correct, as long as we don't modify training.steps and training.warmup_steps when resuming from a checkpoint.

But for safety, let's still explicitly call .copy() on the state_dict, as the overhead is small.

Let's document this in the code here.

Thanks for the comments. Add more detailed comments here with our discussion results.
Update summary with multi-optimizer results, with the same lr values after checkpoint and resharding

tianyu-l

Looks awesome. Thanks for the effort!

mori360 added 2 commits January 15, 2025 19:47

flatten lr scheduler

83d01fa

flatten lr scheduler

dbf1f07

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 16, 2025

mori360 changed the title ~~[do not review] Lr schduler flatten~~ [BE] Lr schduler flatten Jan 17, 2025

mori360 marked this pull request as ready for review January 17, 2025 22:39

mori360 marked this pull request as draft January 17, 2025 22:39

remove get_lr_scheduler_state

9da918b

mori360 marked this pull request as ready for review January 17, 2025 23:22

mori360 requested a review from fegin January 17, 2025 23:22

tianyu-l reviewed Jan 23, 2025

View reviewed changes

tianyu-l added this to the torchtitan v1.0.0 release milestone Jan 23, 2025

mori360 marked this pull request as draft January 29, 2025 22:00

mori360 added 4 commits January 29, 2025 14:04

add schedulercontainer as Stateful

2b7e95f

save state_dict to checkpoint

c0f061f

save and load only one state_dict for lr_scheduler

c0a4057

change for loop

4d776f0

mori360 marked this pull request as ready for review January 30, 2025 23:06

mori360 requested a review from tianyu-l January 30, 2025 23:06

change comment

12a0bb2

tianyu-l reviewed Jan 31, 2025

View reviewed changes

update comments

fd683e6

mori360 requested a review from tianyu-l January 31, 2025 21:47

tianyu-l approved these changes Jan 31, 2025

View reviewed changes

mori360 merged commit cca0702 into pytorch:main Jan 31, 2025
6 checks passed

tianyu-l mentioned this pull request Feb 1, 2025

Loss metrics dramatically change after resuming from checkpoint #809

Open

mori360 mentioned this pull request Feb 4, 2025

Enable optional checkpoint at loading #819

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BE] Lr schduler flatten #794

[BE] Lr schduler flatten #794

mori360 commented Jan 16, 2025 •

edited

Loading

tianyu-l Jan 23, 2025

mori360 Jan 23, 2025

tianyu-l Jan 23, 2025

fegin Jan 24, 2025

tianyu-l Jan 31, 2025 •

edited

Loading

tianyu-l Jan 31, 2025

mori360 Jan 31, 2025

tianyu-l left a comment

[BE] Lr schduler flatten #794

[BE] Lr schduler flatten #794

Conversation

mori360 commented Jan 16, 2025 • edited Loading

tianyu-l Jan 23, 2025

Choose a reason for hiding this comment

mori360 Jan 23, 2025

Choose a reason for hiding this comment

tianyu-l Jan 23, 2025

Choose a reason for hiding this comment

fegin Jan 24, 2025

Choose a reason for hiding this comment

tianyu-l Jan 31, 2025 • edited Loading

Choose a reason for hiding this comment

tianyu-l Jan 31, 2025

Choose a reason for hiding this comment

mori360 Jan 31, 2025

Choose a reason for hiding this comment

tianyu-l left a comment

Choose a reason for hiding this comment

mori360 commented Jan 16, 2025 •

edited

Loading

tianyu-l Jan 31, 2025 •

edited

Loading