Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restarting is not the same as a single training run #283

Open
frostedoyster opened this issue Jul 5, 2024 · 1 comment
Open

Restarting is not the same as a single training run #283

frostedoyster opened this issue Jul 5, 2024 · 1 comment
Assignees
Labels
NanoPET Nanopet model experimental architecture Priority: Medium Important issues to address after high priority. SOAP BPNN SOAP BPNN experimental architecture

Comments

@frostedoyster
Copy link
Collaborator

Using SOAP-BPNN, restarting from a checkpoint does not afford exactly the same numbers as a longer training run. (Training is good however and the numbers make sense). The epoch saved inside the checkpoint is also wrong for the final checkpoint (but fine for the others).

@frostedoyster frostedoyster added SOAP BPNN SOAP BPNN experimental architecture Priority: Medium Important issues to address after high priority. labels Jul 5, 2024
@frostedoyster frostedoyster self-assigned this Jul 5, 2024
@frostedoyster frostedoyster added the NanoPET Nanopet model experimental architecture label Dec 17, 2024
@frostedoyster
Copy link
Collaborator Author

The issue might be partially due to be the dataloader. Even this (and saving the state to the checkpoint) doesn't work

import numpy as np
import random
import torch
import os


def set_seed(seed):
    torch.manual_seed(seed)
    np.random.seed(seed)
    random.seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)


def get_rng_state():
    rng_state = {
        'torch': torch.get_rng_state(),
        'numpy': np.random.get_state(),
        'random': random.getstate(),
        'cuda': torch.cuda.get_rng_state_all()  # Save GPU RNG state
    }
    return rng_state


def set_rng_state(rng_state: dict):
    torch.set_rng_state(rng_state['torch'])
    np.random.set_state(rng_state['numpy'])
    random.setstate(rng_state['random'])
    torch.cuda.set_rng_state_all(rng_state['cuda'])

The workaround here doesn't really work on distributed environments: https://stackoverflow.com/questions/60993677/how-can-i-save-pytorchs-dataloader-instance

However, even setting the shuffling of the dataloader to False doesn't guarantee reproducibility, so there must be further issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NanoPET Nanopet model experimental architecture Priority: Medium Important issues to address after high priority. SOAP BPNN SOAP BPNN experimental architecture
Projects
None yet
Development

No branches or pull requests

1 participant