-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to compute eigendecomposition #13
Comments
Hi @aykamko, thanks for your question! This is a warning related to the eigendecomposition solver failing with lower precision. Are you seeing a subsequent error after this? This could be due to multiple reasons:
It would be helpful if you could provide your current configuration of Shampoo as well as the previous optimizer configuration you were using previously for your model. We can also help with setting the appropriate hyperparameters for your case. |
Thanks for the response! Previous config used AdamW:
Current Shampoo config:
In the meantime, I'll try to set a larger I also saw this warning in your README:
We have CUDA 12.2 driver installed, but our PyTorch is built for 12.1 (downloaded from pip). Could that be the issue? |
@aykamko, the settings look right here. Let's see what happens with a larger preconditioning step. 😊 |
Hi @aykamko, did increasing the |
Hi @vishaal27 , did you encounter similar issue like @aykamko in your usage? |
Running into the same issue, changing the start_preconditioning_step seems to alleviate the problem temporarily.
This test network, it has a dataset of size 13K and the network is 92M params. I'm using an amp grad scaler and the forward is done in bfloat 16 if that's relevant. At first this starts having only one or two of these eigendecomp warnings, but after 300 epochs it's just being spammed in the console and greatly increases the running time. Are their any other mitigations you'd recommend? This optimizer is effective enough I'm more than willing to go out of my way to architect around it, I think you've got something really good here. I'm working with a slightly altered version of this architecture, but it's a simply DiT and UNeT combination https://huggingface.co/Blackroot/SimplePixelDiffusion-AlphaDemo/blob/main/models/uvit.py Cheers |
We're seeing this error message about 5 minute into training.
Any ideas how we can fix this / avoid this?
The text was updated successfully, but these errors were encountered: