add cautious option to RAdamScheduleFree #54

nhamanasu · 2024-12-19T13:38:10Z

What does this PR do?

I added cautious option to RAdamScheduleFree.

Cautious optimizer is proposed in https://arxiv.org/abs/2411.16085 and https://github.com/kyleliang919/C-Optim .

More details

As I wrote in the docstring, the combination of cautious and schedulefree is non-trivial.
In cautious optimizer, by aligning momentum update with the each gradient direction, it leads to faster convergence.

But in schedulefree, the gradient update term in z doesn't contain momentum, which means the cautious mask is meaningless.
So I chose to apply cautious mask to y update (after contracting x implicitly) instead, though I guess it's a bit tricky.
But in some sense, I believe the training parameter y should be aligned in a cautious spirit.

Experimental Results

The toy-experimental results show its faster and promising convergence ability.
Below are the single run result of example/mnist/main.py, which I think is superior to both "Default PyTorch Implementation" and AdamWScheduleFree results.

Test set: Average loss: 0.0394, Accuracy: 9871/10000 (98.71%)
Test set: Average loss: 0.0291, Accuracy: 9897/10000 (98.97%)
Test set: Average loss: 0.0268, Accuracy: 9920/10000 (99.20%)
Test set: Average loss: 0.0248, Accuracy: 9931/10000 (99.31%)
Test set: Average loss: 0.0241, Accuracy: 9934/10000 (99.34%)
Test set: Average loss: 0.0233, Accuracy: 9939/10000 (99.39%)
Test set: Average loss: 0.0254, Accuracy: 9938/10000 (99.38%)
Test set: Average loss: 0.0264, Accuracy: 9936/10000 (99.36%)
Test set: Average loss: 0.0269, Accuracy: 9935/10000 (99.35%)
Test set: Average loss: 0.0271, Accuracy: 9938/10000 (99.38%)
Test set: Average loss: 0.0278, Accuracy: 9938/10000 (99.38%)
Test set: Average loss: 0.0269, Accuracy: 9942/10000 (99.42%)
Test set: Average loss: 0.0280, Accuracy: 9942/10000 (99.42%)
Test set: Average loss: 0.0280, Accuracy: 9942/10000 (99.42%)

adefazio · 2024-12-26T17:57:34Z

Thanks for this pull request. I will look into merging it in the New Year.

LoganBooker · 2024-12-26T23:35:55Z

schedulefree/radam_schedulefree.py

+                    # These operations update y in-place,
+                    # without computing x explicitly.
+                    torch._foreach_lerp_(y, z, weight=ckp1)
+                    torch._foreach_sub_(y, grad, alpha=adaptive_y_lr)


Hey @nhamanasu, I might be missing something, but is the subtraction correct here (it also appears in the non-foreach and closure versions)? I'm wondering if this is an error that might have been introduced unintentionally.

Thank you for the comment!

You're exactly right. In my test branch, I reversed the sign of adaptive_y_lr and used sub functions, but I somehow forgot to reflect these changes to this c-radam branch. This might have led to completely opposite results. Thank you for catching this critical issue!

..sorry, after considering the combination with cautious update, I'm not sure which sign is correct for this part. Let me re-think this block!

In the end, I concluded your concern was right. Thank you again for the valuable comments!

nhamanasu · 2024-12-27T02:18:02Z

Based on @LoganBooker 's comment, I've fixed the gradient update part.

With the default examples/mnist experimental settings, the final test results at 14th epoch were:

[cautious=True] Test set: Average loss: 0.0329, Accuracy: 9935/10000 (99.35%)
[cautious=False] Test set: Average loss: 0.0288, Accuracy: 9928/10000 (99.28%)

As the difference is negligible, I set the default cautious value to False while keeping it as an option.

gesen2egee · 2024-12-31T02:16:12Z

Grams: Gradient Descent with Adaptive Momentum Scaling
Can you test this? It seems to converge faster than C-ADAM.
u.copy_(torch.sign(grad) * u.abs())

gesen2egee · 2024-12-31T02:44:01Z

Additional, if difine u = (y - z).mul_(ckp1).add_(grad, alpha=adaptive_y_lr)
would allow storing an additional 𝛽3 𝑢2 to implement AdemaMix?

nhamanasu · 2024-12-31T04:20:39Z

@gesen2egee
Thank you for sharing these interesting research papers!

I wasn't familiar with Grams, so I'll take some time to read it in the next few days.
Regarding the application of Schedule-Free to AdEMAMix, I think it would be non-trivial since Schedule-Free doesn't directly incorporate a momentum term. However, exploring their combination might still be worthwhile.

Anyway, I think we should separate the PRs if we really seek to implement these ideas (e.g., for AdEMAMix, we already have the related issue: #46)

nhamanasu · 2024-12-31T04:32:19Z

@adefazio

By the way, I apologize for raising this after opening the PR myself, but I've recently started to wonder if adding new experimental features to this repository is the best approach or not.

On the one hand, we could continue expanding the schedule-free library with experimental features (including this cautious option), and leave the choice of using them up to the users.

On the other hand, we could maintain this repository as one that only includes well-established optimizers like the three we've already implemented, those with theoretical guarantees and proven practical utility.

* facebookresearch/schedule_free#54

add cautious option to RAdamScheduleFree

4c7faab

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 19, 2024

nhamanasu added 3 commits December 19, 2024 22:53

remove _pycahce__ and add .gitignore

b6a3dd0

refine docstring

2fc7228

set default cautious value of RAdamScheduleFree to False

51ee0b9

LoganBooker reviewed Dec 26, 2024

View reviewed changes

nhamanasu added 2 commits December 27, 2024 10:37

update adaptive_y_lr

7b060d9

update ckp1 calculation w/o try-catch block

c832e9e

nhamanasu marked this pull request as draft December 27, 2024 01:56

nhamanasu marked this pull request as ready for review December 27, 2024 02:18

LoganBooker added a commit to LoganBooker/prodigy-plus-schedule-free that referenced this pull request Jan 6, 2025

Add C-Optim implementation for schedule-free by nhamanasu

71e590f

* facebookresearch/schedule_free#54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add cautious option to RAdamScheduleFree #54

add cautious option to RAdamScheduleFree #54

nhamanasu commented Dec 19, 2024 •

edited

Loading

adefazio commented Dec 26, 2024

LoganBooker Dec 26, 2024

nhamanasu Dec 27, 2024

nhamanasu Dec 27, 2024

nhamanasu Dec 27, 2024

nhamanasu commented Dec 27, 2024

gesen2egee commented Dec 31, 2024 •

edited

Loading

gesen2egee commented Dec 31, 2024

nhamanasu commented Dec 31, 2024 •

edited

Loading

nhamanasu commented Dec 31, 2024

add cautious option to RAdamScheduleFree #54

Are you sure you want to change the base?

add cautious option to RAdamScheduleFree #54

Conversation

nhamanasu commented Dec 19, 2024 • edited Loading

What does this PR do?

More details

Experimental Results

adefazio commented Dec 26, 2024

LoganBooker Dec 26, 2024

Choose a reason for hiding this comment

nhamanasu Dec 27, 2024

Choose a reason for hiding this comment

nhamanasu Dec 27, 2024

Choose a reason for hiding this comment

nhamanasu Dec 27, 2024

Choose a reason for hiding this comment

nhamanasu commented Dec 27, 2024

gesen2egee commented Dec 31, 2024 • edited Loading

gesen2egee commented Dec 31, 2024

nhamanasu commented Dec 31, 2024 • edited Loading

nhamanasu commented Dec 31, 2024

nhamanasu commented Dec 19, 2024 •

edited

Loading

gesen2egee commented Dec 31, 2024 •

edited

Loading

nhamanasu commented Dec 31, 2024 •

edited

Loading