UFS is stuck running run_fcst task #22
-
Hi, I'm trying to run a simulation on a 1km regional grid. I create the grid and prepare the data successfully using the SRWeatherApp. When I get to the run_fcst task, the model creates the initial files dynf000.nc and phyf000.nc but doesn't produce any other output. The logs don't receive any updates. It stays in the queue with a running status "R" until the end of the wallclock. My grid size is 599 x 570 and I'm running it on Cheyenne (Intel) with 72 tasks (on 2 nodes). I'm using the latest develop branch (26b3ac89 but I have also tried it on an older version. If that's helpful, my simulation files are here: I don't know what to look for. Can I set it up to write more log messages and get a clue about what's happening? How can I check if there's any issue with the 1km grid? Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 7 comments 1 reply
-
It looks like the simulation actually worked. It just took a little while (~1 hour) to write the forecast files. Is there a way to get more logs that will tell me the model is still running? |
Beta Was this translation helpful? Give feedback.
-
Hi @mefrediani ! In your experiment directory, the There is also a If you've already checked these files, I can look into whether there are other options for determining the state of the workflow. Just let me know! For future reference, I'd recommend moving questions for specific applications/repositories to the GitHub Discussions section of those repositories (here for the SRW App). GitHub has better notification options on repositories than on the UFS organizational space, so we see/respond quicker to questions posted in the repositories we manage (i.e., SRW, WM, UPP, Land DA). Best, |
Beta Was this translation helpful? Give feedback.
-
@gspetro-NOAA thanks for your reply. I'm not entirely sure to whom I should direct my question... I suppose the primary question is whether I can increase the messages from the model. In my simulation, the model ran for an entire hour without any update to the logs. I had no way of knowing if any calculations were being done or if it was completely stuck. With a previous SRW App version, the run_fcst task was terminating when reaching the wallclock limit. The only information among the files created was related to the model initialization. It looks like the file logfile.000000.out is created at initialization, and the other log files for each the forecast hours are printed after the forecast is integrated. It seems that the model was taking more than 2 hours (which was the old default wallclock time) to integrate 1 forecast hour, and I had no way of knowing whether the model was actually running or just stuck for the entire time. With the latest SRW App (and the corresponding FV3 branch specified in Externals.cfg) the model now takes about 1 hour to integrate 1 forecast hour. Yet, I still have to wait 1 hour to know the model is running correctly. I would like to see more information about what is happening during the time between the files log.atm.f00*. Does it make sense? |
Beta Was this translation helpful? Give feedback.
-
Actually, I just realize the issue. It may indeed be related to the workflow. So it turns out the model logs are saved in the log file for the run_fcst_* task. However, I don't have this file available before the end of the task. Is there a way to force an update of this file from time to time, or have it printed as it is written? |
Beta Was this translation helpful? Give feedback.
-
@gspetro-NOAA I currently have dt_atmos set to 36 and the SRW App workflow log is not written before the completion of the job. It's only when the job terminates that I have the information for all timesteps available on the log file. Could this be specific to Cheyenne? You can see on the screenshot below that this job has been running for 18 minutes but the log file log/run_fcst_mem000_2012062518.log has not been created: The command you suggested doesn't work because the file has not been created: |
Beta Was this translation helpful? Give feedback.
-
@mefrediani - Do you still have any unresolved issues? There is no logs directory anymore under your /glade/work/frediani/Projects/UFS-fire/UFS/develop-ufs-srw/expt_dirs/LastChance_RAP_dev_RRFS/. |
Beta Was this translation helpful? Give feedback.
-
@natalie-perlin thanks for the tips. This discussion was helpful and I understand it a bit better now. I'll reach out again if I have other questions in the future. Please feel free to close this thread :) |
Beta Was this translation helpful? Give feedback.
@mefrediani The default workflow should be printing info to the run_fcst_* task for each forecast time step. (Each forecast hour is broken up into 3600/DT_ATMOS timesteps.) For example, you'd see:
If your
DT_ATMOS
is 3600, then it would only update once per forecast hour, but for anything smaller, it should update more regularly. It is also possible that if output is being sent into the log file faster than the machine can handle it, the log file is buffered until the issue clears up (after each forecast hour). So you can check your value of DT_ATMOS and/or…