-
-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support slice_p in Prodigy optimizer #550
base: master
Are you sure you want to change the base?
Conversation
Enabling it by default makes sense. The only concern would be that it will break backups because the optimizer state can't be loaded anymore. But maybe it's worth it to make that breaking change for the huge reduction in memory usage |
I agree that we should default to slice 11. It's an INSANE memory saving. And upstream already verified that the training result with 11 is as good as slice 1. They recommended slice 11 (or any other odd numbers) because it creates a rolling window that never skips the exact same parameters every time. I recommend that we change our tooltip to mention this, and also bring in their improved reasoning from their readme about why 11 is good. Also improved the grammar a bit:
While I was manually merging the new prodigy and this PR to test it, I saw that
I actually did a diff and oh boy they've gone crazy rewriting Prodigy in |
I have verified this "bleeding edge" config now:
I didn't need offloading but wanted to test everything at once. Looking very good. PS: We can safely bump |
@Nerogar @dxqbYD I just spotted that Prodigy Optimizer 1.1 is out. So we can merge this and bump the prodigyopt requirement to pip install prodigyopt
Collecting prodigyopt
Downloading prodigyopt-1.1-py3-none-any.whl.metadata (4.6 kB)
Downloading prodigyopt-1.1-py3-none-any.whl (7.3 kB)
Installing collected packages: prodigyopt
Successfully installed prodigyopt-1.1 Diff vs their latest master code (nothing important missing from PyPi): diff -u prodigy.py master/prodigy.py
--- prodigy.py 2024-12-18 11:45:55.418152077 +0100
+++ master/prodigy.py 2024-12-18 11:46:37.157393027 +0100
@@ -53,7 +53,7 @@
will attempt to auto-detect this, but if you're using an implementation other
than PyTorch's builtin version, the auto-detection won't work.
slice_p (int): Reduce memory usage by calculating LR adaptation statistics on only every
- pth entry of each tensor. For values greater than 1 this an an approximation to standard
+ pth entry of each tensor. For values greater than 1 this is an approximation to standard
Prodigy. Values ~11 are reasonable (default 1).
"""
def __init__(self, params, lr=1.0, https://pypi.org/project/prodigyopt/#history But before merge, the current tooltip isn't great. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A more informative tooltip and changing requirements.txt to prodigyopt==1.1
would finish this PR.
modules/ui/OptimizerParamsWindow.py
Outdated
@@ -142,7 +142,7 @@ def create_dynamic_ui( | |||
'r': {'title': 'R', 'tooltip': 'EMA factor.', 'type': 'float'}, | |||
'adanorm': {'title': 'AdaNorm', 'tooltip': 'Whether to use the AdaNorm variant', 'type': 'bool'}, | |||
'adam_debias': {'title': 'Adam Debias', 'tooltip': 'Only correct the denominator to avoid inflating step sizes early in training.', 'type': 'bool'}, | |||
|
|||
'slice_p': {'title': 'Slice parameters', 'tooltip': 'Reduce memory usage by calculating LR adaptation statistics on only every pth entry of each tensor. For values greater than 1 this an an approximation to standard Prodigy. Values ~11 are reasonable.', 'type': 'int'}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend that we change our tooltip to mention that uneven values are the ONLY recommended values because it creates a rolling window that never skips the exact same parameters every time. And also bring in their improved reasoning from their readme about why 11 is good. Also improved the grammar a bit:
'slice_p': {'title': 'Slice parameters', 'tooltip': 'Reduce memory usage by calculating LR adaptation statistics only for the p-th values of each tensor. For values greater than 1, this is an approximation of standard Prodigy. Value should always be uneven to avoid skipping the same channels every time; 11 is a good trade-off between learning rate estimation and memory efficiency.', 'type': 'int'},
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imho it is an expert opinion that uneven numbers might work better. We've never shown that it actually is better. All my tests showing that it works were with slice_p == 10.
Could be something for expert users in the Wiki
Version 1.1.1 was also released to PyPi already, which is a (small) part of Adafactor: konstmish/prodigy#32 |
Support new hyperparameter in Prodigy optimizer that reduces vram usage by about half.
Wait with merge until the prodigyopt package is updated in PyPi.
In this PR I've chosen to enable it by default, while the package maintainers have chosen to disable it by default in order to not change the behaviour.
This is debatable, but I think we can be a bit less careful: We haven't found a case yet in which Prodigy performs worse when this is enabled. In my own tests, in tests by a team of testers on Discord, and in original experiments of the Prodigy paper, repeated by the original authors. Details here: konstmish/prodigy#22
I can disable it by default though if you prefer