Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job get canceled due to time limit #439

Open
yingli-NREL opened this issue Mar 5, 2024 · 4 comments
Open

Job get canceled due to time limit #439

yingli-NREL opened this issue Mar 5, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@yingli-NREL
Copy link

Describe the bug
When running ResStock in Kestrel, I got some failed jobs for the time limit error. A few weeks ago, just a few jobs (around 5/50) failed. But this week, this problem becomes more serious. All 3 jobs, or 2 of 3 jobs failed.
The error message in the job.out-*, is

DEBUG:2024-03-04 16:35:42:buildstockbatch.base:Using OpenStudio version: 3.7.0 with SHA: d5269793f1
DEBUG:2024-03-04 16:35:42:__main__:Output directory = /kfs2/projects/redlineres/tcm/summer_phoenix_tcm1_0304
slurmstepd: error: *** JOB 2828658 ON x3001c0s33b0n0 CANCELLED AT 2024-03-04T16:45:29 DUE TO TIME LIMIT ***

For the job that successfully finished, it even took 10 min to finish the run for some jobs.

DEBUG:2024-03-05 12:48:35:buildstockbatch.base:Using OpenStudio version: 3.7.0 with SHA: d5269793f1
DEBUG:2024-03-05 12:48:35:__main__:Output directory = /kfs2/projects/redlineres/tcm/winter_boston_tcm1_0305
DEBUG:2024-03-05 12:58:07:__main__:Trimming buildstock.csv
DEBUG:2024-03-05 12:58:07:__main__:Buildstock.csv trimmed to 168 rows.
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 104 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 out of 208 | elapsed:   16.2s remaining:  2.0min
[Parallel(n_jobs=-1)]: Done  49 out of 208 | elapsed:   17.9s remaining:   58.2s
[Parallel(n_jobs=-1)]: Done  73 out of 208 | elapsed:   20.8s remaining:   38.5s
[Parallel(n_jobs=-1)]: Done  97 out of 208 | elapsed:   23.7s remaining:   27.2s
[Parallel(n_jobs=-1)]: Done 121 out of 208 | elapsed:   28.4s remaining:   20.4s
[Parallel(n_jobs=-1)]: Done 145 out of 208 | elapsed:   30.7s remaining:   13.4s
[Parallel(n_jobs=-1)]: Done 169 out of 208 | elapsed:   32.5s remaining:    7.5s
[Parallel(n_jobs=-1)]: Done 193 out of 208 | elapsed:   34.0s remaining:    2.6s
[Parallel(n_jobs=-1)]: Done 208 out of 208 | elapsed:   37.5s finished
INFO:2024-03-05 12:58:45:__main__:Simulation time: 0.63 minutes
INFO:2024-03-05 12:58:45:__main__:Writing results to /kfs2/projects/redlineres/tcm/winter_boston_tcm1_0305/results/simulation_output/results_job1.json.gz
INFO:2024-03-05 12:58:45:__main__:Compressing simulation outputs to /kfs2/projects/redlineres/tcm/winter_boston_tcm1_0305/results/simulation_output/simulations_job1.tar.gz
INFO:2024-03-05 12:58:46:__main__:batch complete
INFO:2024-03-05 12:58:46:__main__:Cleaning up /tmp/scratch
DEBUG:2024-03-05 12:58:46:__main__:Removing /tmp/scratch/buildstock
DEBUG:2024-03-05 12:58:46:__main__:Removing /tmp/scratch/weather
DEBUG:2024-03-05 12:58:47:__main__:Removing /tmp/scratch/output
DEBUG:2024-03-05 12:58:47:__main__:Removing /tmp/scratch/housing_characteristics
DEBUG:2024-03-05 12:58:47:__main__:Removing /tmp/scratch/openstudio.simg

real    10m31.240s
user    39m14.246s
sys     11m14.233s

Platform:

  • Simulation platform: Kestrel
  • BuildStockBatch version: buildstock-2023.11.0
  • resstock branch: develop

Workaround method
Increase the minutes_per_sim to a larger value. For example, I use minutes_per_sim=6 for a simulation with 600 models.
If only a few jobs failed, rerun the failed jobs. https://buildstockbatch.readthedocs.io/en/stable/run_sims.html#re-running-failed-array-jobs

@yingli-NREL yingli-NREL added the bug Something isn't working label Mar 5, 2024
@afontani
Copy link
Collaborator

afontani commented Mar 6, 2024

Discussing in the development meeting:

  • There might be some issues with the local hard drives causing larger overhead times to start the simulations
  • Could consider increasing the minimum number of simulations on each node to increase workload of individual nodes.

@yingli-NREL
Copy link
Author

For successfully finished job. It's pretty common that Trimming buildstock.csv takes around 8-10 minutes (14 from 21 jobs).

@nmerket
Copy link
Member

nmerket commented Mar 15, 2024

For successfully finished job. It's pretty common that Trimming buildstock.csv takes around 8-10 minutes (14 from 21 jobs).

Yeah, that shouldn't take that long. Something weird is going on there.

@nmerket
Copy link
Member

nmerket commented Apr 16, 2024

This might be fixed in #438.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants