Support slice_p in Prodigy optimizer #550

dxqbYD · 2024-11-08T14:45:41Z

Support new hyperparameter in Prodigy optimizer that reduces vram usage by about half.

Wait with merge until the prodigyopt package is updated in PyPi.

In this PR I've chosen to enable it by default, while the package maintainers have chosen to disable it by default in order to not change the behaviour.

This is debatable, but I think we can be a bit less careful: We haven't found a case yet in which Prodigy performs worse when this is enabled. In my own tests, in tests by a team of testers on Discord, and in original experiments of the Prodigy paper, repeated by the original authors. Details here: konstmish/prodigy#22

I can disable it by default though if you prefer

Nerogar · 2024-11-08T14:54:12Z

Enabling it by default makes sense. The only concern would be that it will break backups because the optimizer state can't be loaded anymore. But maybe it's worth it to make that breaking change for the huge reduction in memory usage

Arcitec · 2024-11-14T03:45:34Z

I agree that we should default to slice 11. It's an INSANE memory saving. And upstream already verified that the training result with 11 is as good as slice 1. They recommended slice 11 (or any other odd numbers) because it creates a rolling window that never skips the exact same parameters every time.

I recommend that we change our tooltip to mention this, and also bring in their improved reasoning from their readme about why 11 is good. Also improved the grammar a bit:

'slice_p': {'title': 'Slice parameters', 'tooltip': 'Reduce memory usage by calculating LR adaptation statistics only for the p-th values of each tensor. For values greater than 1, this is an approximation of standard Prodigy. Value should always be uneven to avoid skipping the same channels every time; 11 is a good trade-off between learning rate estimation and memory efficiency.', 'type': 'int'},

While I was manually merging the new prodigy and this PR to test it, I saw that pytorch_optimizer bundles their own copy of Prodigy btw. O_o

conda_env> find . -iname "*prodigy.py*"
./lib/python3.12/site-packages/prodigyopt/prodigy.py
./lib/python3.12/site-packages/pytorch_optimizer/optimizer/prodigy.py

I actually did a diff and oh boy they've gone crazy rewriting Prodigy in pytorch_optimizer. It's almost unrecognizable. Maybe 30% of the lines of code are the same anymore. Added/removed parameters, different calculations, heh. Makes me curious if any of their changes speeds up the code or fixed bugs, or if they just wanted to act cool by changing every line.

Arcitec · 2024-11-14T05:27:37Z

I have verified this "bleeding edge" config now:

Torch 2.5.1+cu124
Python 3.12 (fresh conda env)
SDXL Juggernaut checkpoint
Offloading 0.3
LoRa (DoHa)
Prodigyopt with their latest master code manually put into the conda_env/lib.
My any() optimization for prodigy (there is an ~~open~~ merged PR on their repo, it's just a small initialization speedup).
Prodigy slice_p=11.
VRAM usage is 10.8 GB of 24 GB. RAM usage is 14.1 GB of 64 GB.
Training went perfectly.
No more libnvrtc errors thanks to the new Torch version. So [Docs]: Documenting CUDA Toolkit installation on Linux #490 is finished.

I didn't need offloading but wanted to test everything at once.

Looking very good.

PS: We can safely bump lib.include.sh to OT_CONDA_USE_PYTHON_VERSION="3.12" if you want to, which will only affect fresh installations, and would be a good idea to prepare us for the future where we will inevitably be using Python 3.11/3.12 features and libraries. I tried it since I saw your readme update about supporting 3.10/11/12, and can confirm that all dependencies exist and training went perfectly on 3.12. Might as well bump it unless you see a reason to keep 3.10 as the default. Python 3.10 updates also end (EOL) in 22 months.

Arcitec · 2024-12-18T10:50:28Z

@Nerogar @dxqbYD I just spotted that Prodigy Optimizer 1.1 is out. So we can merge this and bump the prodigyopt requirement to ==1.1.

pip install prodigyopt
Collecting prodigyopt
  Downloading prodigyopt-1.1-py3-none-any.whl.metadata (4.6 kB)
Downloading prodigyopt-1.1-py3-none-any.whl (7.3 kB)
Installing collected packages: prodigyopt
Successfully installed prodigyopt-1.1

Diff vs their latest master code (nothing important missing from PyPi):

diff -u prodigy.py master/prodigy.py 
--- prodigy.py	2024-12-18 11:45:55.418152077 +0100
+++ master/prodigy.py	2024-12-18 11:46:37.157393027 +0100
@@ -53,7 +53,7 @@
             will attempt to auto-detect this, but if you're using an implementation other
             than PyTorch's builtin version, the auto-detection won't work.
         slice_p (int): Reduce memory usage by calculating LR adaptation statistics on only every 
-            pth entry of each tensor. For values greater than 1 this an an approximation to standard 
+            pth entry of each tensor. For values greater than 1 this is an approximation to standard 
             Prodigy. Values ~11 are reasonable (default 1).
     """
     def __init__(self, params, lr=1.0,

https://pypi.org/project/prodigyopt/#history

But before merge, the current tooltip isn't great.

Arcitec

A more informative tooltip and changing requirements.txt to prodigyopt==1.1 would finish this PR.

Arcitec · 2024-12-18T10:51:03Z

modules/ui/OptimizerParamsWindow.py

@@ -142,7 +142,7 @@ def create_dynamic_ui(
            'r': {'title': 'R', 'tooltip': 'EMA factor.', 'type': 'float'},
            'adanorm': {'title': 'AdaNorm', 'tooltip': 'Whether to use the AdaNorm variant', 'type': 'bool'},
            'adam_debias': {'title': 'Adam Debias', 'tooltip': 'Only correct the denominator to avoid inflating step sizes early in training.', 'type': 'bool'},
-
+            'slice_p': {'title': 'Slice parameters', 'tooltip': 'Reduce memory usage by calculating LR adaptation statistics on only every pth entry of each tensor. For values greater than 1 this an an approximation to standard Prodigy. Values ~11 are reasonable.', 'type': 'int'},


I recommend that we change our tooltip to mention that uneven values are the ONLY recommended values because it creates a rolling window that never skips the exact same parameters every time. And also bring in their improved reasoning from their readme about why 11 is good. Also improved the grammar a bit:

'slice_p': {'title': 'Slice parameters', 'tooltip': 'Reduce memory usage by calculating LR adaptation statistics only for the p-th values of each tensor. For values greater than 1, this is an approximation of standard Prodigy. Value should always be uneven to avoid skipping the same channels every time; 11 is a good trade-off between learning rate estimation and memory efficiency.', 'type': 'int'},

imho it is an expert opinion that uneven numbers might work better. We've never shown that it actually is better. All my tests showing that it works were with slice_p == 10.
Could be something for expert users in the Wiki

dxqbYD · 2024-12-18T20:14:49Z

Version 1.1.1 was also released to PyPi already, which is a (small) part of Adafactor: konstmish/prodigy#32
No change in this PR necessary to support it, but we might want to document it (Wiki)

initial

f3b185f

Nerogar marked this pull request as draft November 9, 2024 19:33

Merge branch 'Nerogar:master' into prodigy_slice_p

2373bf4

dxqbYD mentioned this pull request Nov 27, 2024

Slicing parameters - large VRAM savings konstmish/prodigy#22

Merged

Arcitec suggested changes Dec 18, 2024

View reviewed changes

dxqbYD marked this pull request as ready for review December 18, 2024 20:05

Merge branch 'master' into prodigy_slice_p

c8588de

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support slice_p in Prodigy optimizer #550

Support slice_p in Prodigy optimizer #550

dxqbYD commented Nov 8, 2024

Nerogar commented Nov 8, 2024

Arcitec commented Nov 14, 2024 •

edited

Loading

Arcitec commented Nov 14, 2024 •

edited

Loading

Arcitec commented Dec 18, 2024

Arcitec left a comment

Arcitec Dec 18, 2024 •

edited

Loading

dxqbYD Dec 18, 2024

dxqbYD commented Dec 18, 2024

Support slice_p in Prodigy optimizer #550

Are you sure you want to change the base?

Support slice_p in Prodigy optimizer #550

Conversation

dxqbYD commented Nov 8, 2024

Nerogar commented Nov 8, 2024

Arcitec commented Nov 14, 2024 • edited Loading

Arcitec commented Nov 14, 2024 • edited Loading

Arcitec commented Dec 18, 2024

Arcitec left a comment

Choose a reason for hiding this comment

Arcitec Dec 18, 2024 • edited Loading

Choose a reason for hiding this comment

dxqbYD Dec 18, 2024

Choose a reason for hiding this comment

dxqbYD commented Dec 18, 2024

Arcitec commented Nov 14, 2024 •

edited

Loading

Arcitec commented Nov 14, 2024 •

edited

Loading

Arcitec Dec 18, 2024 •

edited

Loading