Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GST Checkpointing/Warmstarting #275

Closed
sserita opened this issue Nov 15, 2022 · 2 comments
Closed

GST Checkpointing/Warmstarting #275

sserita opened this issue Nov 15, 2022 · 2 comments
Assignees
Labels
enhancement Request for a new feature or a change to an existing feature

Comments

@sserita
Copy link
Contributor

sserita commented Nov 15, 2022

Fitting GST protocol is currently done at once. However, there are cases where the fit fails or is interrupted, e.g. running out of memory or wall clock time limits in an HPC environment. It would be nice if GST fits could be restarted easily, which will require some sort of checkpointing.

The ideal procedure would look something like:

  1. PyGSTi dumps checkpoint files at the completion of each circuit list iteration, and also probably at each outer iteration of the optimizer.
  2. On an unexpected fit failure, the restarted fit can load the most recent checkpoint file and continue the fitting procedure.

Something like this is ALMOST possible currently for circuit list iterations but is not straightforward to do. Users can run each iteration themselves and dump the results as a "checkpoint", and then manually set the starting point of the next iteration as the "warmstart". But the user would be pretty hardpressed to do anything restarting partway through a circuit list iteration.

We would probably need the checkpoint the model (once, likely already done), the parameter vector (at both circuit list iterations and outer optimization iterations), and any state information in the optimizer (at each outer optimization iteration). The new serialization code should actually make this pretty easy, but we should check that things are being serialized at the right time with all the needed info.

Critically, we do not want to checkpoint the entire CircuitOutcomeProbabilityArrayLayout (COPALayout) - this would relatively expensive, and also hardware configuration specific. Better to reconstruct this on the fly from the critical model param information.

@sserita sserita added the enhancement Request for a new feature or a change to an existing feature label Nov 15, 2022
@sserita sserita self-assigned this Nov 15, 2022
@sserita
Copy link
Contributor Author

sserita commented Oct 18, 2023

A first pass at this was merged with #347. We will probably want more in-depth checkpointing in the future, but we will reopen an issue to examine that when it becomes a priority again.

@sserita sserita added the fixed-but-not-in-release-yet Bug has been fixed, but isn't in an official release yet (just exists on a development branch) label Oct 18, 2023
@sserita
Copy link
Contributor Author

sserita commented Nov 29, 2023

Closed with release of 0.9.12.

@sserita sserita closed this as completed Nov 29, 2023
@sserita sserita removed the fixed-but-not-in-release-yet Bug has been fixed, but isn't in an official release yet (just exists on a development branch) label Dec 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Request for a new feature or a change to an existing feature
Projects
None yet
Development

No branches or pull requests

1 participant