Restarting is not the same as a single training run #283

frostedoyster · 2024-07-05T10:57:20Z

Using SOAP-BPNN, restarting from a checkpoint does not afford exactly the same numbers as a longer training run. (Training is good however and the numbers make sense). The epoch saved inside the checkpoint is also wrong for the final checkpoint (but fine for the others).

frostedoyster · 2024-12-17T03:47:31Z

The issue might be partially due to be the dataloader. Even this (and saving the state to the checkpoint) doesn't work

import numpy as np
import random
import torch
import os


def set_seed(seed):
    torch.manual_seed(seed)
    np.random.seed(seed)
    random.seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)


def get_rng_state():
    rng_state = {
        'torch': torch.get_rng_state(),
        'numpy': np.random.get_state(),
        'random': random.getstate(),
        'cuda': torch.cuda.get_rng_state_all()  # Save GPU RNG state
    }
    return rng_state


def set_rng_state(rng_state: dict):
    torch.set_rng_state(rng_state['torch'])
    np.random.set_state(rng_state['numpy'])
    random.setstate(rng_state['random'])
    torch.cuda.set_rng_state_all(rng_state['cuda'])

The workaround here doesn't really work on distributed environments: https://stackoverflow.com/questions/60993677/how-can-i-save-pytorchs-dataloader-instance

However, even setting the shuffling of the dataloader to False doesn't guarantee reproducibility, so there must be further issues

frostedoyster added SOAP BPNN SOAP BPNN experimental architecture Priority: Medium Important issues to address after high priority. labels Jul 5, 2024

frostedoyster self-assigned this Jul 5, 2024

frostedoyster added the NanoPET Nanopet model experimental architecture label Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restarting is not the same as a single training run #283

Restarting is not the same as a single training run #283

frostedoyster commented Jul 5, 2024

frostedoyster commented Dec 17, 2024

Restarting is not the same as a single training run #283

Restarting is not the same as a single training run #283

Comments

frostedoyster commented Jul 5, 2024

frostedoyster commented Dec 17, 2024