Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add cautious option to RAdamScheduleFree #54

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

nhamanasu
Copy link
Contributor

@nhamanasu nhamanasu commented Dec 19, 2024

What does this PR do?

I added cautious option to RAdamScheduleFree.

Cautious optimizer is proposed in https://arxiv.org/abs/2411.16085 and https://github.com/kyleliang919/C-Optim .

More details

As I wrote in the docstring, the combination of cautious and schedulefree is non-trivial.
In cautious optimizer, by aligning momentum update with the each gradient direction, it leads to faster convergence.

But in schedulefree, the gradient update term in z doesn't contain momentum, which means the cautious mask is meaningless.
So I chose to apply cautious mask to y update (after contracting x implicitly) instead, though I guess it's a bit tricky.
But in some sense, I believe the training parameter y should be aligned in a cautious spirit.

Experimental Results

The toy-experimental results show its faster and promising convergence ability.
Below are the single run result of example/mnist/main.py, which I think is superior to both "Default PyTorch Implementation" and AdamWScheduleFree results.

Test set: Average loss: 0.0394, Accuracy: 9871/10000 (98.71%)
Test set: Average loss: 0.0291, Accuracy: 9897/10000 (98.97%)
Test set: Average loss: 0.0268, Accuracy: 9920/10000 (99.20%)
Test set: Average loss: 0.0248, Accuracy: 9931/10000 (99.31%)
Test set: Average loss: 0.0241, Accuracy: 9934/10000 (99.34%)
Test set: Average loss: 0.0233, Accuracy: 9939/10000 (99.39%)
Test set: Average loss: 0.0254, Accuracy: 9938/10000 (99.38%)
Test set: Average loss: 0.0264, Accuracy: 9936/10000 (99.36%)
Test set: Average loss: 0.0269, Accuracy: 9935/10000 (99.35%)
Test set: Average loss: 0.0271, Accuracy: 9938/10000 (99.38%)
Test set: Average loss: 0.0278, Accuracy: 9938/10000 (99.38%)
Test set: Average loss: 0.0269, Accuracy: 9942/10000 (99.42%)
Test set: Average loss: 0.0280, Accuracy: 9942/10000 (99.42%)
Test set: Average loss: 0.0280, Accuracy: 9942/10000 (99.42%)

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 19, 2024
@adefazio
Copy link
Contributor

Thanks for this pull request. I will look into merging it in the New Year.

# These operations update y in-place,
# without computing x explicitly.
torch._foreach_lerp_(y, z, weight=ckp1)
torch._foreach_sub_(y, grad, alpha=adaptive_y_lr)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @nhamanasu, I might be missing something, but is the subtraction correct here (it also appears in the non-foreach and closure versions)? I'm wondering if this is an error that might have been introduced unintentionally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the comment!

You're exactly right. In my test branch, I reversed the sign of adaptive_y_lr and used sub functions, but I somehow forgot to reflect these changes to this c-radam branch. This might have led to completely opposite results. Thank you for catching this critical issue!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

..sorry, after considering the combination with cautious update, I'm not sure which sign is correct for this part. Let me re-think this block!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end, I concluded your concern was right. Thank you again for the valuable comments!

@nhamanasu nhamanasu marked this pull request as draft December 27, 2024 01:56
@nhamanasu
Copy link
Contributor Author

Based on @LoganBooker 's comment, I've fixed the gradient update part.

With the default examples/mnist experimental settings, the final test results at 14th epoch were:

  • [cautious=True] Test set: Average loss: 0.0329, Accuracy: 9935/10000 (99.35%)
  • [cautious=False] Test set: Average loss: 0.0288, Accuracy: 9928/10000 (99.28%)

As the difference is negligible, I set the default cautious value to False while keeping it as an option.

@nhamanasu nhamanasu marked this pull request as ready for review December 27, 2024 02:18
@gesen2egee
Copy link

gesen2egee commented Dec 31, 2024

Grams: Gradient Descent with Adaptive Momentum Scaling
Can you test this? It seems to converge faster than C-ADAM.
u.copy_(torch.sign(grad) * u.abs())

@gesen2egee
Copy link

Additional, if difine u = (y - z).mul_(ckp1).add_(grad, alpha=adaptive_y_lr)
would allow storing an additional 𝛽3 𝑢2 to implement AdemaMix?

@nhamanasu
Copy link
Contributor Author

nhamanasu commented Dec 31, 2024

@gesen2egee
Thank you for sharing these interesting research papers!

  • I wasn't familiar with Grams, so I'll take some time to read it in the next few days.
  • Regarding the application of Schedule-Free to AdEMAMix, I think it would be non-trivial since Schedule-Free doesn't directly incorporate a momentum term. However, exploring their combination might still be worthwhile.

Anyway, I think we should separate the PRs if we really seek to implement these ideas (e.g., for AdEMAMix, we already have the related issue: #46)

@nhamanasu
Copy link
Contributor Author

@adefazio

By the way, I apologize for raising this after opening the PR myself, but I've recently started to wonder if adding new experimental features to this repository is the best approach or not.

On the one hand, we could continue expanding the schedule-free library with experimental features (including this cautious option), and leave the choice of using them up to the users.

On the other hand, we could maintain this repository as one that only includes well-established optimizers like the three we've already implemented, those with theoretical guarantees and proven practical utility.

LoganBooker added a commit to LoganBooker/prodigy-plus-schedule-free that referenced this pull request Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants