UFS is stuck running run_fcst task #22

mefrediani · 2023-08-04T20:05:18Z

mefrediani
Aug 4, 2023

Hi,

I'm trying to run a simulation on a 1km regional grid. I create the grid and prepare the data successfully using the SRWeatherApp. When I get to the run_fcst task, the model creates the initial files dynf000.nc and phyf000.nc but doesn't produce any other output. The logs don't receive any updates. It stays in the queue with a running status "R" until the end of the wallclock.

My grid size is 599 x 570 and I'm running it on Cheyenne (Intel) with 72 tasks (on 2 nodes). I'm using the latest develop branch (26b3ac89 but I have also tried it on an older version.

If that's helpful, my simulation files are here:
/glade/work/frediani/Projects/UFS-fire/UFS/develop-ufs-srw/expt_dirs/2020081318_1km_dev

I don't know what to look for. Can I set it up to write more log messages and get a clue about what's happening? How can I check if there's any issue with the 1km grid?
Could I get some advice on how to make this simulation work, please?

Thanks!
Maria

Answered by gspetro-NOAA

Aug 11, 2023

@mefrediani The default workflow should be printing info to the run_fcst_* task for each forecast time step. (Each forecast hour is broken up into 3600/DT_ATMOS timesteps.) For example, you'd see:

PASS: fcstRUN phase 1, n_atmsteps =        1 time is     1.380133
PASS: fcstRUN phase 2, n_atmsteps =        1 time is     0.275557

If your DT_ATMOS is 3600, then it would only update once per forecast hour, but for anything smaller, it should update more regularly. It is also possible that if output is being sent into the log file faster than the machine can handle it, the log file is buffered until the issue clears up (after each forecast hour). So you can check your value of DT_ATMOS and/or…

View full answer

mefrediani · 2023-08-07T22:31:22Z

mefrediani
Aug 7, 2023
Author

It looks like the simulation actually worked. It just took a little while (~1 hour) to write the forecast files. Is there a way to get more logs that will tell me the model is still running?

0 replies

gspetro-NOAA · 2023-08-10T18:55:39Z

gspetro-NOAA
Aug 10, 2023
Maintainer

Hi @mefrediani !

In your experiment directory, the log.launch_FV3LAM_wflow file maintains a log for the overall workflow. While the workflow is in progress, you can run tail -n 40 log.launch_FV3LAM_wflow and it will show you the bottom 40 lines of the file with a "Workflow Status" of "IN PROGRESS", "SUCCESS", "FAILURE", etc. That command also prints statuses for each task, and you can adjust how many lines you want to see using the -n argument.

There is also a log directory that contains detailed log files for each task, so you could view the log file for the run_fcst_* task if you haven't already. Normally, if the task has succeeded, it will output a message to that effect and indicate an exit code of zero (0). If it has failed, there is usually a message that indicates a failure/error, along with a nonzero exit code.

If you've already checked these files, I can look into whether there are other options for determining the state of the workflow. Just let me know!

For future reference, I'd recommend moving questions for specific applications/repositories to the GitHub Discussions section of those repositories (here for the SRW App). GitHub has better notification options on repositories than on the UFS organizational space, so we see/respond quicker to questions posted in the repositories we manage (i.e., SRW, WM, UPP, Land DA).

Best,
Gillian Petro

0 replies

mefrediani · 2023-08-10T20:26:09Z

mefrediani
Aug 10, 2023
Author

@gspetro-NOAA thanks for your reply.
I believe the additional logs I'm looking for are the actual model, not the workflow logs. The workflow logs are quite helpful already.

I'm not entirely sure to whom I should direct my question... I suppose the primary question is whether I can increase the messages from the model. In my simulation, the model ran for an entire hour without any update to the logs. I had no way of knowing if any calculations were being done or if it was completely stuck.

With a previous SRW App version, the run_fcst task was terminating when reaching the wallclock limit. The only information among the files created was related to the model initialization. It looks like the file logfile.000000.out is created at initialization, and the other log files for each the forecast hours are printed after the forecast is integrated. It seems that the model was taking more than 2 hours (which was the old default wallclock time) to integrate 1 forecast hour, and I had no way of knowing whether the model was actually running or just stuck for the entire time.

With the latest SRW App (and the corresponding FV3 branch specified in Externals.cfg) the model now takes about 1 hour to integrate 1 forecast hour. Yet, I still have to wait 1 hour to know the model is running correctly.

I would like to see more information about what is happening during the time between the files log.atm.f00*. Does it make sense?

0 replies

mefrediani · 2023-08-10T22:27:50Z

mefrediani
Aug 10, 2023
Author

@gspetro-NOAA,

Actually, I just realize the issue. It may indeed be related to the workflow. So it turns out the model logs are saved in the log file for the run_fcst_* task. However, I don't have this file available before the end of the task.

Is there a way to force an update of this file from time to time, or have it printed as it is written?

1 reply

gspetro-NOAA Aug 11, 2023
Maintainer

@mefrediani The default workflow should be printing info to the run_fcst_* task for each forecast time step. (Each forecast hour is broken up into 3600/DT_ATMOS timesteps.) For example, you'd see:

PASS: fcstRUN phase 1, n_atmsteps =        1 time is     1.380133
PASS: fcstRUN phase 2, n_atmsteps =        1 time is     0.275557

If your DT_ATMOS is 3600, then it would only update once per forecast hour, but for anything smaller, it should update more regularly. It is also possible that if output is being sent into the log file faster than the machine can handle it, the log file is buffered until the issue clears up (after each forecast hour). So you can check your value of DT_ATMOS and/or try flushing the buffer on the log file. For example:
tail -f run_fcst_mem000_*.log | stdbuf -oL cut -d ' ' -f1 | uniq

Beyond that, you would have to add print or log statements for the information you want to see. Adding print/log statements to the exregional_run_fcst.sh script (in the ufs-srweather-app/scripts directory) will tell you that the script is still running. If you want more specific information about potential issues, you would have to add print/log statements to the Weather Model code itself (in the ufs-srweather-app/sorc/ufs-weather-model directory). That is something that would depend on your use case, so we probably can't advise on what statements you need.

Best,
Gillian

Answer selected by natalie-perlin

mefrediani · 2023-08-22T21:48:45Z

mefrediani
Aug 22, 2023
Author

@gspetro-NOAA I currently have dt_atmos set to 36 and the SRW App workflow log is not written before the completion of the job. It's only when the job terminates that I have the information for all timesteps available on the log file. Could this be specific to Cheyenne?

You can see on the screenshot below that this job has been running for 18 minutes but the log file log/run_fcst_mem000_2012062518.log has not been created:

The command you suggested doesn't work because the file has not been created:

0 replies

natalie-perlin · 2023-08-30T05:46:59Z

natalie-perlin
Aug 30, 2023
Maintainer

@mefrediani -
Log files are written when the input/out buffer fills in, so it is system-specific. You could increase printout frequency my placing more diagnostic print statements in the code, or use some kind of "flush buffer" command, which FLUSH intrinsic procedure in Intel Fortran and gfortran, for example. Please remember to recompile the SRW after adding any statements to the code.

Do you still have any unresolved issues? There is no logs directory anymore under your /glade/work/frediani/Projects/UFS-fire/UFS/develop-ufs-srw/expt_dirs/LastChance_RAP_dev_RRFS/.

0 replies

mefrediani · 2023-08-30T15:26:30Z

mefrediani
Aug 30, 2023
Author

@natalie-perlin thanks for the tips. This discussion was helpful and I understand it a bit better now. I'll reach out again if I have other questions in the future. Please feel free to close this thread :)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unified Forecast System (UFS)

UFS is stuck running run_fcst task #22

{{title}}

Replies: 7 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Unified Forecast System (UFS)

UFS is stuck running run_fcst task #22

mefrediani Aug 4, 2023

Replies: 7 comments · 1 reply

mefrediani Aug 7, 2023 Author

gspetro-NOAA Aug 10, 2023 Maintainer

mefrediani Aug 10, 2023 Author

mefrediani Aug 10, 2023 Author

gspetro-NOAA Aug 11, 2023 Maintainer

mefrediani Aug 22, 2023 Author

natalie-perlin Aug 30, 2023 Maintainer

mefrediani Aug 30, 2023 Author

mefrediani
Aug 4, 2023

Replies: 7 comments 1 reply

mefrediani
Aug 7, 2023
Author

gspetro-NOAA
Aug 10, 2023
Maintainer

mefrediani
Aug 10, 2023
Author

mefrediani
Aug 10, 2023
Author

gspetro-NOAA Aug 11, 2023
Maintainer

mefrediani
Aug 22, 2023
Author

natalie-perlin
Aug 30, 2023
Maintainer

mefrediani
Aug 30, 2023
Author