You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fitting GST protocol is currently done at once. However, there are cases where the fit fails or is interrupted, e.g. running out of memory or wall clock time limits in an HPC environment. It would be nice if GST fits could be restarted easily, which will require some sort of checkpointing.
The ideal procedure would look something like:
PyGSTi dumps checkpoint files at the completion of each circuit list iteration, and also probably at each outer iteration of the optimizer.
On an unexpected fit failure, the restarted fit can load the most recent checkpoint file and continue the fitting procedure.
Something like this is ALMOST possible currently for circuit list iterations but is not straightforward to do. Users can run each iteration themselves and dump the results as a "checkpoint", and then manually set the starting point of the next iteration as the "warmstart". But the user would be pretty hardpressed to do anything restarting partway through a circuit list iteration.
We would probably need the checkpoint the model (once, likely already done), the parameter vector (at both circuit list iterations and outer optimization iterations), and any state information in the optimizer (at each outer optimization iteration). The new serialization code should actually make this pretty easy, but we should check that things are being serialized at the right time with all the needed info.
Critically, we do not want to checkpoint the entire CircuitOutcomeProbabilityArrayLayout (COPALayout) - this would relatively expensive, and also hardware configuration specific. Better to reconstruct this on the fly from the critical model param information.
The text was updated successfully, but these errors were encountered:
A first pass at this was merged with #347. We will probably want more in-depth checkpointing in the future, but we will reopen an issue to examine that when it becomes a priority again.
Fitting GST protocol is currently done at once. However, there are cases where the fit fails or is interrupted, e.g. running out of memory or wall clock time limits in an HPC environment. It would be nice if GST fits could be restarted easily, which will require some sort of checkpointing.
The ideal procedure would look something like:
Something like this is ALMOST possible currently for circuit list iterations but is not straightforward to do. Users can run each iteration themselves and dump the results as a "checkpoint", and then manually set the starting point of the next iteration as the "warmstart". But the user would be pretty hardpressed to do anything restarting partway through a circuit list iteration.
We would probably need the checkpoint the model (once, likely already done), the parameter vector (at both circuit list iterations and outer optimization iterations), and any state information in the optimizer (at each outer optimization iteration). The new serialization code should actually make this pretty easy, but we should check that things are being serialized at the right time with all the needed info.
Critically, we do not want to checkpoint the entire CircuitOutcomeProbabilityArrayLayout (COPALayout) - this would relatively expensive, and also hardware configuration specific. Better to reconstruct this on the fly from the critical model param information.
The text was updated successfully, but these errors were encountered: