Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to compute eigendecomposition #13

Open
aykamko opened this issue Nov 3, 2023 · 6 comments
Open

Failed to compute eigendecomposition #13

aykamko opened this issue Nov 3, 2023 · 6 comments

Comments

@aykamko
Copy link

aykamko commented Nov 3, 2023

We're seeing this error message about 5 minute into training.

WARNING:distributed_shampoo.utils.matrix_functions:Failed to compute eigendecomposition in torch.float32 precision with exception linalg.eigh: The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated eigenvalues (error code: 1).! Retrying in double precision...

Any ideas how we can fix this / avoid this?

@hjmshi
Copy link
Contributor

hjmshi commented Nov 6, 2023

Hi @aykamko, thanks for your question! This is a warning related to the eigendecomposition solver failing with lower precision. Are you seeing a subsequent error after this?

This could be due to multiple reasons:

  1. If you're setting the learning rate too large, this can cause inf or nan values to get inserted into the preconditioner matrix. In this case, we would expect increasing the precision (being done after this warning) to not resolve the issue - you'll likely begin to see nan values in the loss after the matrix root inverse is computed.
  2. Alternatively, we have found that the eigendecomposition solver can be unstable, especially for some low-rank matrices. One approach to avoid this is to set a larger start_preconditioning_step, which will ensure that the matrix is more well-behaved prior to applying the eigh solver.

It would be helpful if you could provide your current configuration of Shampoo as well as the previous optimizer configuration you were using previously for your model. We can also help with setting the appropriate hyperparameters for your case.

@aykamko
Copy link
Author

aykamko commented Nov 11, 2023

Thanks for the response!

Previous config used AdamW:

lr = 1e-4
betas = (0.9, 0.999)
eps = 1e-8
weight_decay = 1e-2

Current Shampoo config:

        lr: 1e-4
        betas: [0.9, 0.999]
        epsilon: 1e-12
        weight_decay: 1e-02
        max_preconditioner_dim: 8192
        precondition_frequency: 100
        use_decoupled_weight_decay: True
        grafting_type: 4  # GraftingType.ADAM
        grafting_epsilon: 1e-08
        grafting_beta2: 0.999

In the meantime, I'll try to set a larger start_preconditioning_step.

I also saw this warning in your README:

Note: We have observed known instabilities with the torch.linalg.eigh operator on CUDA 11.6-12.1, specifically for low-rank matrices, which may appear with using a small start_preconditioning_step. Please avoid these versions of CUDA if possible. See: pytorch/pytorch#94772.

We have CUDA 12.2 driver installed, but our PyTorch is built for 12.1 (downloaded from pip). Could that be the issue?

@hjmshi
Copy link
Contributor

hjmshi commented Nov 15, 2023

@aykamko, the settings look right here. Let's see what happens with a larger preconditioning step. 😊

@vishaal27
Copy link

Hi @aykamko, did increasing the start_preconditioning_step work for you?

@tsunghsienlee
Copy link
Contributor

Hi @aykamko, did increasing the start_preconditioning_step work for you?

Hi @vishaal27 , did you encounter similar issue like @aykamko in your usage?

@CoffeeVampir3
Copy link

CoffeeVampir3 commented Jan 14, 2025

Running into the same issue, changing the start_preconditioning_step seems to alleviate the problem temporarily.

    optimizer = DistributedShampoo(
        model.parameters(),
        lr=0.0005,
        betas=(0.9, 0.999),
        epsilon=1e-10,
        weight_decay=1e-05,
        max_preconditioner_dim=2048,
        precondition_frequency=100,
        start_preconditioning_step=200,
        use_decoupled_weight_decay=False,
        grafting_config=AdamGraftingConfig(
            beta2=0.999,
            epsilon=1e-10,
        ),
    )

This test network, it has a dataset of size 13K and the network is 92M params. I'm using an amp grad scaler and the forward is done in bfloat 16 if that's relevant. At first this starts having only one or two of these eigendecomp warnings, but after 300 epochs it's just being spammed in the console and greatly increases the running time. Are their any other mitigations you'd recommend? This optimizer is effective enough I'm more than willing to go out of my way to architect around it, I think you've got something really good here.

I'm working with a slightly altered version of this architecture, but it's a simply DiT and UNeT combination https://huggingface.co/Blackroot/SimplePixelDiffusion-AlphaDemo/blob/main/models/uvit.py

Cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants