Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpoint batch job on cloud implementations #433

Open
nmerket opened this issue Feb 21, 2024 · 0 comments
Open

Checkpoint batch job on cloud implementations #433

nmerket opened this issue Feb 21, 2024 · 0 comments
Labels
aws enhancement New feature or request

Comments

@nmerket
Copy link
Member

nmerket commented Feb 21, 2024

We use the spot market to save money on batch simulations. Problem is that the jobs can be interrupted in the spot market. At this point we just start over each job when that happens, but that can cause problems in ComStock with larger building models and longer running jobs. There is a way to get warning and to checkpoint our work within a job. AWS has a blog post about it. The "inside a container on ECS" is the most relevant section. Basically we catch the SIGTERM signal using the signal library in python and save our progress to S3, then when the job is retried, it checks for that progress and picks up where it left off.

cc @asparke2

@nmerket nmerket added enhancement New feature or request aws labels Feb 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aws enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant