You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
SRW contains the config variable OMP_NUM_THREADS_RUN_FCST which should activate threading when set greater than 1 (default in SRW is 2, so should be every run unless specified otherwise).
Current behavior
Threading is not activated for the weather model, no matter what OMP_NUM_THREADS_RUN_FCST is set to. You can observe this by opening the run_fcst log file for any SRW run; here is just one example from the WE2E test grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0:
* . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * .
PROGRAM ufs-weather-model HAS BEGUN. COMPILED 0.00 ORG: np23
STARTING DATE-TIME OCT 14,2024 17:32:13.156 288 MON 2460598
...
...
atmosphere_init: current_time_seconds = 0.0
Using n_split from the namelist: 005
Off center implicit scheme param= 1.000000
p_fac= 0.1000000
Using n_sponge : 024
Using non_ortho : T
Starting PEs : 25
Starting Threads : 1
Cubed-sphere, single face domain decomposition
whalo = 3, ehalo = 3, shalo = 3, nhalo = 3
...
Note the line "Starting Threads : 1", which indicates only a single thread is being used, despite us requesting 2.
Machines affected
Confirmed on Orion, Hera, and Derecho, but this should affect all machines.
Steps To Reproduce
Run any SRW test that contains a run_fcst task
View the run_fcst log file, and confirm that threading is not being used, despite OMP_NUM_THREADS_RUN_FCST=2 being the default
Detailed Description of Fix
I honestly don't know how to fix this. I am confused by the latest instructions on how to set OMP_NUM_THREADS for the weather model. Currently, the app sets the threading value in the model_configure file (atmos_nthreads: 1, but this variable has been deprecated for years (as a side note, so has PE_MEMBER01, which is also still being set) so the setting has no effect. However, if I follow the instructions in the weather model users guide (setting ATM_omp_num_threads: 2 in ufs.configure), I get behavior that I don't understand from the model: it seems to be using way fewer MPI tasks than expected, and so fails with a corresponding error message:
* . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * .
PROGRAM ufs-weather-model HAS BEGUN. COMPILED 0.00 ORG: np23
STARTING DATE-TIME OCT 14,2024 17:23:00.856 288 MON 2460598
...
...
atmosphere_init: current_time_seconds = 0.0
Using n_split from the namelist: 005
Off center implicit scheme param= 1.000000
p_fac= 0.1000000
Using n_sponge : 024
Using non_ortho : T
Starting PEs : 11
Starting Threads : 2
FATAL from PE 3: mpp_domains_define.inc: not all the pe_end are in the pelist
FATAL from PE 1: mpp_domains_define.inc: not all the pe_end are in the pelist
FATAL from PE 2: mpp_domains_define.inc: not all the pe_end are in the pelist
FATAL from PE 4: mpp_domains_define.inc: not all the pe_end are in the pelist
...
...
Note the entry Starting PEs : 11, which doesn't make any sense to me with the settings provided (identical to the first test except for setting ATM_omp_num_threads: 2 in ufs.configure) and with 30 cores allocated.
Clearly there is something I am missing, but I don't have the time to dig deeper into this. However, this is something that should be solved sooner rather than later, as it likely represents a huge waste of resources due to not using threads that we expect to use (and likely impacts any external groups using recent SRW code to run the weather model).
Additional Information (optional)
I believe this bug was introduced in #1050, but I can not confirm that that is the case.
The text was updated successfully, but these errors were encountered:
After seeing this issue, I checked the results of my fundamental tests and saw the same behavior, atmos_nthreads: 2 is in the settings, but the model only uses a single thread, Starting Threads : 1. Changing to use the pelist and atm_omp_num_threads, I encountered the same FATAL from PE 3: mpp_domains_define.inc: not all the pe_end are in the pelist error messages that you had. Looking in weather model documentation, in order to use threading, you need to set the total number of tasks to OMP_NUM_THREADS_RUN_FCST * (LAYOUT_Y * LAYOUT_X + WRTCMP_write_groups * WRTCMP_write_tasks_per_group) (in other words, you need to multiply the number of tasks by the number of threads). With this change, I was able to successfully run the grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 WE2E test with 2 threads.
The erroneous single thread test ran with the following settings:
Starting PEs : 10
Starting Threads : 1
while the corrected 2 thread test ran with:
Starting PEs : 10
Starting Threads : 2
I will go ahead and get this work merged into my current feature/hash_update branch so that it can be included in PR #1136.
Working with Rusty and Dusan, I do have an answer to the reason why the starting PEs section isn't using your prescribed 30 cores. Currently, the SRW is using ESMF managed threading. While ESMF managed threading is turned on, the starting PEs will be set to the product of LAYOUT_X and LAYOUT_Y. The various mpp_domains_define.inc:-based error messages are due to the the value not being the product of LAYOUT_X and LAYOUT_Y.
Expected behavior
SRW contains the config variable
OMP_NUM_THREADS_RUN_FCST
which should activate threading when set greater than 1 (default in SRW is 2, so should be every run unless specified otherwise).Current behavior
Threading is not activated for the weather model, no matter what
OMP_NUM_THREADS_RUN_FCST
is set to. You can observe this by opening the run_fcst log file for any SRW run; here is just one example from the WE2E testgrid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0
:Note the line "
Starting Threads : 1
", which indicates only a single thread is being used, despite us requesting 2.Machines affected
Confirmed on Orion, Hera, and Derecho, but this should affect all machines.
Steps To Reproduce
OMP_NUM_THREADS_RUN_FCST=2
being the defaultDetailed Description of Fix
I honestly don't know how to fix this. I am confused by the latest instructions on how to set OMP_NUM_THREADS for the weather model. Currently, the app sets the threading value in the
model_configure
file (atmos_nthreads: 1
, but this variable has been deprecated for years (as a side note, so hasPE_MEMBER01
, which is also still being set) so the setting has no effect. However, if I follow the instructions in the weather model users guide (settingATM_omp_num_threads: 2
inufs.configure
), I get behavior that I don't understand from the model: it seems to be using way fewer MPI tasks than expected, and so fails with a corresponding error message:Note the entry
Starting PEs : 11
, which doesn't make any sense to me with the settings provided (identical to the first test except for settingATM_omp_num_threads: 2
inufs.configure
) and with 30 cores allocated.Clearly there is something I am missing, but I don't have the time to dig deeper into this. However, this is something that should be solved sooner rather than later, as it likely represents a huge waste of resources due to not using threads that we expect to use (and likely impacts any external groups using recent SRW code to run the weather model).
Additional Information (optional)
I believe this bug was introduced in #1050, but I can not confirm that that is the case.
The text was updated successfully, but these errors were encountered: