Failed to compute eigendecomposition #13

aykamko · 2023-11-03T22:32:42Z

We're seeing this error message about 5 minute into training.

WARNING:distributed_shampoo.utils.matrix_functions:Failed to compute eigendecomposition in torch.float32 precision with exception linalg.eigh: The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated eigenvalues (error code: 1).! Retrying in double precision...

Any ideas how we can fix this / avoid this?

The text was updated successfully, but these errors were encountered:

hjmshi · 2023-11-06T19:50:34Z

Hi @aykamko, thanks for your question! This is a warning related to the eigendecomposition solver failing with lower precision. Are you seeing a subsequent error after this?

This could be due to multiple reasons:

If you're setting the learning rate too large, this can cause inf or nan values to get inserted into the preconditioner matrix. In this case, we would expect increasing the precision (being done after this warning) to not resolve the issue - you'll likely begin to see nan values in the loss after the matrix root inverse is computed.
Alternatively, we have found that the eigendecomposition solver can be unstable, especially for some low-rank matrices. One approach to avoid this is to set a larger start_preconditioning_step, which will ensure that the matrix is more well-behaved prior to applying the eigh solver.

It would be helpful if you could provide your current configuration of Shampoo as well as the previous optimizer configuration you were using previously for your model. We can also help with setting the appropriate hyperparameters for your case.

aykamko · 2023-11-11T00:14:22Z

Thanks for the response!

Previous config used AdamW:

lr = 1e-4
betas = (0.9, 0.999)
eps = 1e-8
weight_decay = 1e-2

Current Shampoo config:

        lr: 1e-4
        betas: [0.9, 0.999]
        epsilon: 1e-12
        weight_decay: 1e-02
        max_preconditioner_dim: 8192
        precondition_frequency: 100
        use_decoupled_weight_decay: True
        grafting_type: 4  # GraftingType.ADAM
        grafting_epsilon: 1e-08
        grafting_beta2: 0.999

In the meantime, I'll try to set a larger start_preconditioning_step.

I also saw this warning in your README:

Note: We have observed known instabilities with the torch.linalg.eigh operator on CUDA 11.6-12.1, specifically for low-rank matrices, which may appear with using a small start_preconditioning_step. Please avoid these versions of CUDA if possible. See: pytorch/pytorch#94772.

We have CUDA 12.2 driver installed, but our PyTorch is built for 12.1 (downloaded from pip). Could that be the issue?

hjmshi · 2023-11-15T21:12:53Z

@aykamko, the settings look right here. Let's see what happens with a larger preconditioning step. 😊

vishaal27 · 2024-10-28T12:53:04Z

Hi @aykamko, did increasing the start_preconditioning_step work for you?

tsunghsienlee · 2024-10-30T18:51:29Z

Hi @aykamko, did increasing the start_preconditioning_step work for you?

Hi @vishaal27 , did you encounter similar issue like @aykamko in your usage?

CoffeeVampir3 · 2025-01-14T23:23:17Z

Running into the same issue, changing the start_preconditioning_step seems to alleviate the problem temporarily.

    optimizer = DistributedShampoo(
        model.parameters(),
        lr=0.0005,
        betas=(0.9, 0.999),
        epsilon=1e-10,
        weight_decay=1e-05,
        max_preconditioner_dim=2048,
        precondition_frequency=100,
        start_preconditioning_step=200,
        use_decoupled_weight_decay=False,
        grafting_config=AdamGraftingConfig(
            beta2=0.999,
            epsilon=1e-10,
        ),
    )

This test network, it has a dataset of size 13K and the network is 92M params. I'm using an amp grad scaler and the forward is done in bfloat 16 if that's relevant. At first this starts having only one or two of these eigendecomp warnings, but after 300 epochs it's just being spammed in the console and greatly increases the running time. Are their any other mitigations you'd recommend? This optimizer is effective enough I'm more than willing to go out of my way to architect around it, I think you've got something really good here.

I'm working with a slightly altered version of this architecture, but it's a simply DiT and UNeT combination https://huggingface.co/Blackroot/SimplePixelDiffusion-AlphaDemo/blob/main/models/uvit.py

Cheers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to compute eigendecomposition #13

Failed to compute eigendecomposition #13

aykamko commented Nov 3, 2023

hjmshi commented Nov 6, 2023 •

edited

Loading

aykamko commented Nov 11, 2023 •

edited

Loading

hjmshi commented Nov 15, 2023

vishaal27 commented Oct 28, 2024

tsunghsienlee commented Oct 30, 2024

CoffeeVampir3 commented Jan 14, 2025 •

edited

Loading

Failed to compute eigendecomposition #13

Failed to compute eigendecomposition #13

Comments

aykamko commented Nov 3, 2023

hjmshi commented Nov 6, 2023 • edited Loading

aykamko commented Nov 11, 2023 • edited Loading

hjmshi commented Nov 15, 2023

vishaal27 commented Oct 28, 2024

tsunghsienlee commented Oct 30, 2024

CoffeeVampir3 commented Jan 14, 2025 • edited Loading

hjmshi commented Nov 6, 2023 •

edited

Loading

aykamko commented Nov 11, 2023 •

edited

Loading

CoffeeVampir3 commented Jan 14, 2025 •

edited

Loading