PyTorch 2 Upgrade: Instabilities in training losses. Is this fixed? (Thank you!) #501

leiterenato · 2024-11-10T15:00:59Z

Regarding the note from @jnwei on May 3rd, have this been addressed?
"A quick note on the pytorch 2 / CUDA 12 upgrade:

We've run into some technical issues with the pytorch 2 upgrade. Briefly, we observe large instabilities in our training losses in the pytorch2 version relative to our pytorch 1 version.

For inference, we're also observing a slight difference between model outputs in pytorch 1 and pytorch 2. The difference in final output coordinates is about RMSD~0.05A for the proteins I've looked at While these differences might seem small, it may point to a larger issue that is also occurring in training; we're currently looking into it.

Until we find the root cause of the discrepancy, or a way around the training instability, we're not ready to update the main branch to pytorch 2.

Meanwhile, we will upgrade the main branch to use pytorch lightning 2, which has a few features that the team has found useful. I'll also push some changes to pl_upgrades that integrate some of the changes from the main branch, and cleans up the conda environment / docker for a CUDA 12 / pytorch 2.

We are actively working on debugging the instability, and we'll keep you posted as soon as we are ready to upgrade. Thank you all for your interest and your patience.

Originally posted by @jnwei in #403 (comment)"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyTorch 2 Upgrade: Instabilities in training losses. Is this fixed? (Thank you!) #501

PyTorch 2 Upgrade: Instabilities in training losses. Is this fixed? (Thank you!) #501

leiterenato commented Nov 10, 2024

PyTorch 2 Upgrade: Instabilities in training losses. Is this fixed? (Thank you!) #501

PyTorch 2 Upgrade: Instabilities in training losses. Is this fixed? (Thank you!) #501

Comments

leiterenato commented Nov 10, 2024