Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support slice_p in Prodigy optimizer #550

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

dxqbYD
Copy link
Contributor

@dxqbYD dxqbYD commented Nov 8, 2024

Support new hyperparameter in Prodigy optimizer that reduces vram usage by about half.

Wait with merge until the prodigyopt package is updated in PyPi.

In this PR I've chosen to enable it by default, while the package maintainers have chosen to disable it by default in order to not change the behaviour.

This is debatable, but I think we can be a bit less careful: We haven't found a case yet in which Prodigy performs worse when this is enabled. In my own tests, in tests by a team of testers on Discord, and in original experiments of the Prodigy paper, repeated by the original authors. Details here: konstmish/prodigy#22

I can disable it by default though if you prefer

@Nerogar
Copy link
Owner

Nerogar commented Nov 8, 2024

Enabling it by default makes sense. The only concern would be that it will break backups because the optimizer state can't be loaded anymore. But maybe it's worth it to make that breaking change for the huge reduction in memory usage

@Nerogar Nerogar marked this pull request as draft November 9, 2024 19:33
@Arcitec
Copy link
Contributor

Arcitec commented Nov 14, 2024

I agree that we should default to slice 11. It's an INSANE memory saving. And upstream already verified that the training result with 11 is as good as slice 1. They recommended slice 11 (or any other odd numbers) because it creates a rolling window that never skips the exact same parameters every time.

I recommend that we change our tooltip to mention this, and also bring in their improved reasoning from their readme about why 11 is good. Also improved the grammar a bit:

  • 'slice_p': {'title': 'Slice parameters', 'tooltip': 'Reduce memory usage by calculating LR adaptation statistics only for the p-th values of each tensor. For values greater than 1, this is an approximation of standard Prodigy. Value should always be uneven to avoid skipping the same channels every time; 11 is a good trade-off between learning rate estimation and memory efficiency.', 'type': 'int'},

While I was manually merging the new prodigy and this PR to test it, I saw that pytorch_optimizer bundles their own copy of Prodigy btw. O_o

conda_env> find . -iname "*prodigy.py*"
./lib/python3.12/site-packages/prodigyopt/prodigy.py
./lib/python3.12/site-packages/pytorch_optimizer/optimizer/prodigy.py

I actually did a diff and oh boy they've gone crazy rewriting Prodigy in pytorch_optimizer. It's almost unrecognizable. Maybe 30% of the lines of code are the same anymore. Added/removed parameters, different calculations, heh. Makes me curious if any of their changes speeds up the code or fixed bugs, or if they just wanted to act cool by changing every line.

@Arcitec
Copy link
Contributor

Arcitec commented Nov 14, 2024

I have verified this "bleeding edge" config now:

  • Torch 2.5.1+cu124
  • Python 3.12 (fresh conda env)
  • SDXL Juggernaut checkpoint
  • Offloading 0.3
  • LoRa (DoHa)
  • Prodigyopt with their latest master code manually put into the conda_env/lib.
  • My any() optimization for prodigy (there is an open merged PR on their repo, it's just a small initialization speedup).
  • Prodigy slice_p=11.
  • VRAM usage is 10.8 GB of 24 GB. RAM usage is 14.1 GB of 64 GB.
  • Training went perfectly.
  • No more libnvrtc errors thanks to the new Torch version. So [Docs]: Documenting CUDA Toolkit installation on Linux #490 is finished.

I didn't need offloading but wanted to test everything at once.

Looking very good.

PS: We can safely bump lib.include.sh to OT_CONDA_USE_PYTHON_VERSION="3.12" if you want to, which will only affect fresh installations, and would be a good idea to prepare us for the future where we will inevitably be using Python 3.11/3.12 features and libraries. I tried it since I saw your readme update about supporting 3.10/11/12, and can confirm that all dependencies exist and training went perfectly on 3.12. Might as well bump it unless you see a reason to keep 3.10 as the default. Python 3.10 updates also end (EOL) in 22 months.

@Arcitec
Copy link
Contributor

Arcitec commented Dec 18, 2024

@Nerogar @dxqbYD I just spotted that Prodigy Optimizer 1.1 is out. So we can merge this and bump the prodigyopt requirement to ==1.1.

pip install prodigyopt
Collecting prodigyopt
  Downloading prodigyopt-1.1-py3-none-any.whl.metadata (4.6 kB)
Downloading prodigyopt-1.1-py3-none-any.whl (7.3 kB)
Installing collected packages: prodigyopt
Successfully installed prodigyopt-1.1

Diff vs their latest master code (nothing important missing from PyPi):

diff -u prodigy.py master/prodigy.py 
--- prodigy.py	2024-12-18 11:45:55.418152077 +0100
+++ master/prodigy.py	2024-12-18 11:46:37.157393027 +0100
@@ -53,7 +53,7 @@
             will attempt to auto-detect this, but if you're using an implementation other
             than PyTorch's builtin version, the auto-detection won't work.
         slice_p (int): Reduce memory usage by calculating LR adaptation statistics on only every 
-            pth entry of each tensor. For values greater than 1 this an an approximation to standard 
+            pth entry of each tensor. For values greater than 1 this is an approximation to standard 
             Prodigy. Values ~11 are reasonable (default 1).
     """
     def __init__(self, params, lr=1.0,

https://pypi.org/project/prodigyopt/#history


But before merge, the current tooltip isn't great.

Copy link
Contributor

@Arcitec Arcitec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more informative tooltip and changing requirements.txt to prodigyopt==1.1 would finish this PR.

@@ -142,7 +142,7 @@ def create_dynamic_ui(
'r': {'title': 'R', 'tooltip': 'EMA factor.', 'type': 'float'},
'adanorm': {'title': 'AdaNorm', 'tooltip': 'Whether to use the AdaNorm variant', 'type': 'bool'},
'adam_debias': {'title': 'Adam Debias', 'tooltip': 'Only correct the denominator to avoid inflating step sizes early in training.', 'type': 'bool'},

'slice_p': {'title': 'Slice parameters', 'tooltip': 'Reduce memory usage by calculating LR adaptation statistics on only every pth entry of each tensor. For values greater than 1 this an an approximation to standard Prodigy. Values ~11 are reasonable.', 'type': 'int'},
Copy link
Contributor

@Arcitec Arcitec Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend that we change our tooltip to mention that uneven values are the ONLY recommended values because it creates a rolling window that never skips the exact same parameters every time. And also bring in their improved reasoning from their readme about why 11 is good. Also improved the grammar a bit:

  • 'slice_p': {'title': 'Slice parameters', 'tooltip': 'Reduce memory usage by calculating LR adaptation statistics only for the p-th values of each tensor. For values greater than 1, this is an approximation of standard Prodigy. Value should always be uneven to avoid skipping the same channels every time; 11 is a good trade-off between learning rate estimation and memory efficiency.', 'type': 'int'},

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imho it is an expert opinion that uneven numbers might work better. We've never shown that it actually is better. All my tests showing that it works were with slice_p == 10.
Could be something for expert users in the Wiki

@dxqbYD dxqbYD marked this pull request as ready for review December 18, 2024 20:05
@dxqbYD
Copy link
Contributor Author

dxqbYD commented Dec 18, 2024

Version 1.1.1 was also released to PyPi already, which is a (small) part of Adafactor: konstmish/prodigy#32
No change in this PR necessary to support it, but we might want to document it (Wiki)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants