Eagle Workflow Troubleshooting

Optimizing YML for simulation time

Wall Time thresholds:

<= 4 hours: Short nodes, which might get you scheduled sooner (depends upon how busy eagle short nodes are versus the regular nodes) and be done quicker.
<= 48 hours: Standard nodes
<= 240 hours: Long nodes.

No run is allowed to take more than 240 hours Wall Time. Visit the HPC website for canonical info, such as max nodes/user.

n_datapoints is number of buildings
- if resample: false, downselect during sampling will reduce this number
- if resample: true, this is roughly the number of buildings after downselect
n_jobs is number of Eagle nodes
36 is the number of processor cores in each node
MF homes take much longer than SF homes. 2 minutes_per_sim is appropriate for SF, 30 minutes_per_sim is appropriate for MF
- TimeOut errors could be caused by sampling including MF homes that you aren't expecting. Downselecting in your yml file may avoid this.

Calculating AU usage:

An upper limit is: AUs = 3 * ((n_datapoints * n_upgrades * minutes_per_sim) / cores_per_node + sampling.time + postprocessing.time * (postprocessing.n_workers + 1)) / minutes_per_hour

To get more accurate estimates, try the following:

Look at run results from similar, successful runs. The job.out, sampling.out, and postprocessing.out files will have elapsed time in the last few lines (in minutes).
job.out files are split fairly evenly, so looking at 1-2 and scaling up, then adding sampling and postproccessing should work fine
grep -E "^real\s+[0-9]+m" job.out-* in the directory with the output files will return the elapsed time from all job files
For a different sized run of similar complexity, scale results by the total number of simulations.
1 AU = 3*(total elapsed time in hours)

Calculating WallTime:

walltime(hours) = math.ceil((n_datapoints * n_upgrades * minutes_per_sim) / (n_jobs * minutes_per_hour * cores_per_node))

Largest `n_datapoints` to not exceed a given WallTime:

n_datapoints = math.floor((walltime(hours) * minutes_per_hour * cores_per_node * n_jobs) / (minutes_per_sim * number_of_upgrades))

Largest `n_jobs` to not exceed a given WallTime:

n_jobs = math.floor((n_datapoints * n_upgrades * minutes_per_sim) / (walltime(hours) * minutes_per_hour * cores_per_node))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eagle Workflow Troubleshooting