Save/load from checkpoints leading to strange loss shifts #217

LWprogramming · 2023-07-28T03:30:41Z

LWprogramming
Jul 28, 2023

Not sure if this is a bug so I think this fits as a discussion for now.

Here's a loss diagram for a segment of training steps for a recent run. I'm using encodec, so the only things I actually need to train are the semantic, coarse, and fine transformers. The loss seems to consistently trend downwards, but I'm finding that loading from checkpoint makes the loss shift around (consistent across all the transformers).

I'm training on a cluster with pre-emptible compute, so my jobs periodically get kicked off and automatically restarted (I have to load from checkpoint). Based on my logs, I find that every load significantly shifts the loss around, while the save itself doesn't automatically cause changes in loss (see checkpoint at step 115200, which doesn't register as significantly shifting the loss).

Is reloading from checkpoint normally supposed to make the loss move around a bit? This behavior surprised me.

--

training code here, training on 8xA100 GPUs

lucidrains · 2023-07-28T14:01:34Z

lucidrains
Jul 28, 2023
Maintainer

It shouldn't, as long as optimizer states are properly saved and loaded

5 replies

LWprogramming Aug 1, 2023
Author

Hm, I did some digging and nothing seems obviously wrong, except I'm curious if there's a reason you used torch.save instead of accelerate's save and load functions, although I did skim the source and didn't see anything that was super out of place.

Other things I looked at:

The graph is right. I manually checked some data points (including around checkpoint reload) by hand and confirmed it's not my data analysis script's bug.
The model and optimizer are being saved properly afaict. There are no weird things like EMA for the three transformers.
The spike is associated specifically with the moment at which we load from checkpoint, ie. unaffected when taking the validation loss (so it's not affected by accidentally leaving something in training mode), or by varying grad_accum_every
It looks like there are two distinct "modes" that are training separately, as though I'm training on two separate sets of checkpoints.

LWprogramming Aug 1, 2023
Author

I just tried manually running it for a very small number of steps (i.e. load from ckpt N, run for small number x of steps, save as N + x, then load from N + x, run for another x, save as N + 2x) and logging the input data immediately after re-loading from checkpoint each time. Turns out the data is exactly the same because we're not saving that state, which might be a result of me manually setting the seeds and because we don't save dataset/dataloader into ckpt.

I'm not entirely sure this is the entirety of the issue because I found out last night that the drop at around 100k in the above graph coincides with a save but not a load, but this might at least be one of the issues

lucidrains Aug 2, 2023
Maintainer

nice debugging! a hacky solution would be to just keep track of total samples seen (in case batch size is changed between runs), and just run the dataloader up to the next sample not seen yet before resuming training

LWprogramming Aug 2, 2023
Author

Yeah, I think that's going to be the way to go. I was hoping to be able to cleanly just save the index but unfortunately as long as shuffle is true, we're going to end up with a RandomSampler so this is cleanest. Created a PR to try implementing this although I did test it and the fastforwarding is really annoyingly slow.

lucidrains Aug 2, 2023
Maintainer

haha indeed

took a look at the PR and it is great! let's merge that once you iron out any issues

as they say "make it work, make it right, make it fast" always in that order

LWprogramming · 2023-08-03T18:56:42Z

LWprogramming
Aug 3, 2023
Author

actually I realized the issue i mentioned in PR #222 should probably get a new discussion topic once I try a few more times. (basically there was a problem before we reach the point in the graph, because it's gotten pretty much overtrained-- the train loss here is well below the validation loss, so I'll post after I try a few more training runs to see if I can get non-overfitted stuff)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save/load from checkpoints leading to strange loss shifts #217

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Save/load from checkpoints leading to strange loss shifts #217

LWprogramming Jul 28, 2023

Replies: 2 comments · 5 replies

lucidrains Jul 28, 2023 Maintainer

LWprogramming Aug 1, 2023 Author

LWprogramming Aug 1, 2023 Author

lucidrains Aug 2, 2023 Maintainer

LWprogramming Aug 2, 2023 Author

lucidrains Aug 2, 2023 Maintainer

LWprogramming Aug 3, 2023 Author

LWprogramming
Jul 28, 2023

Replies: 2 comments 5 replies

lucidrains
Jul 28, 2023
Maintainer

LWprogramming Aug 1, 2023
Author

LWprogramming Aug 1, 2023
Author

lucidrains Aug 2, 2023
Maintainer

LWprogramming Aug 2, 2023
Author

lucidrains Aug 2, 2023
Maintainer

LWprogramming
Aug 3, 2023
Author