Save/load from checkpoints leading to strange loss shifts #217
Closed
LWprogramming
started this conversation in
General
Replies: 2 comments 5 replies
-
It shouldn't, as long as optimizer states are properly saved and loaded |
Beta Was this translation helpful? Give feedback.
5 replies
-
actually I realized the issue i mentioned in PR #222 should probably get a new discussion topic once I try a few more times. (basically there was a problem before we reach the point in the graph, because it's gotten pretty much overtrained-- the train loss here is well below the validation loss, so I'll post after I try a few more training runs to see if I can get non-overfitted stuff) |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Not sure if this is a bug so I think this fits as a discussion for now.
Here's a loss diagram for a segment of training steps for a recent run. I'm using encodec, so the only things I actually need to train are the semantic, coarse, and fine transformers. The loss seems to consistently trend downwards, but I'm finding that loading from checkpoint makes the loss shift around (consistent across all the transformers).
I'm training on a cluster with pre-emptible compute, so my jobs periodically get kicked off and automatically restarted (I have to load from checkpoint). Based on my logs, I find that every load significantly shifts the loss around, while the save itself doesn't automatically cause changes in loss (see checkpoint at step 115200, which doesn't register as significantly shifting the loss).
Is reloading from checkpoint normally supposed to make the loss move around a bit? This behavior surprised me.
--
training code here, training on 8xA100 GPUs
Beta Was this translation helpful? Give feedback.
All reactions