GST Checkpointing/Warmstarting #275

sserita · 2022-11-15T18:14:36Z

Fitting GST protocol is currently done at once. However, there are cases where the fit fails or is interrupted, e.g. running out of memory or wall clock time limits in an HPC environment. It would be nice if GST fits could be restarted easily, which will require some sort of checkpointing.

The ideal procedure would look something like:

PyGSTi dumps checkpoint files at the completion of each circuit list iteration, and also probably at each outer iteration of the optimizer.
On an unexpected fit failure, the restarted fit can load the most recent checkpoint file and continue the fitting procedure.

Something like this is ALMOST possible currently for circuit list iterations but is not straightforward to do. Users can run each iteration themselves and dump the results as a "checkpoint", and then manually set the starting point of the next iteration as the "warmstart". But the user would be pretty hardpressed to do anything restarting partway through a circuit list iteration.

We would probably need the checkpoint the model (once, likely already done), the parameter vector (at both circuit list iterations and outer optimization iterations), and any state information in the optimizer (at each outer optimization iteration). The new serialization code should actually make this pretty easy, but we should check that things are being serialized at the right time with all the needed info.

Critically, we do not want to checkpoint the entire CircuitOutcomeProbabilityArrayLayout (COPALayout) - this would relatively expensive, and also hardware configuration specific. Better to reconstruct this on the fly from the critical model param information.

sserita · 2023-10-18T15:41:48Z

A first pass at this was merged with #347. We will probably want more in-depth checkpointing in the future, but we will reopen an issue to examine that when it becomes a priority again.

sserita · 2023-11-29T03:10:05Z

Closed with release of 0.9.12.

sserita added the enhancement Request for a new feature or a change to an existing feature label Nov 15, 2022

sserita self-assigned this Nov 15, 2022

sserita added the planned-for-next-release label May 17, 2023

sserita removed the planned-for-next-release label Jun 5, 2023

sserita added the fixed-but-not-in-release-yet Bug has been fixed, but isn't in an official release yet (just exists on a development branch) label Oct 18, 2023

sserita closed this as completed Nov 29, 2023

sserita removed the fixed-but-not-in-release-yet Bug has been fixed, but isn't in an official release yet (just exists on a development branch) label Dec 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GST Checkpointing/Warmstarting #275

GST Checkpointing/Warmstarting #275

sserita commented Nov 15, 2022

sserita commented Oct 18, 2023

sserita commented Nov 29, 2023

GST Checkpointing/Warmstarting #275

GST Checkpointing/Warmstarting #275

Comments

sserita commented Nov 15, 2022

sserita commented Oct 18, 2023

sserita commented Nov 29, 2023