-
Notifications
You must be signed in to change notification settings - Fork 247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HRv4 hangs on orion and hercules #2486
Comments
This happens at high ATM resolution C1152. |
I made a HRv4 test run on orion as well. As reported previously, it hung at the beginning of the run. The log file is at /work2/noaa/stmp/rsun/ROTDIRS/HRv4 HOMEgfs=/work/noaa/global/rsun/git/global-workflow.hr.v4 (source) |
@RuiyuSun Denise reports that the privacy settings on your directories are preventing her from accessing them. Could you check on that and report back when it's fixed so others can look at your forecast? |
@DeniseWorthen I made the changes. Please try again. |
I've made a few test runs on my end and here are some observations:
Consistently all runs I have made, also the same as @RuiyuSun runs stall out here:
With high resolution runs (C768 & C1152) for various machines we've had to use different number of write grid tasks. I've tried a few and all are stalling though. This is using ESMF managed threading, so one thing to try might be moving away from that? To run a high res test case:
Change C1152 to C768 to run that resolution and also change your HPC_ACCOUNT, pslot, as desired. Lastly, if you want to turn off waves, you change that in C1152_S2SW.yaml. If you want to change resources, look in global-workflow/parm/config/gfs/config.ufs in the C768/C1152 section. If you want to run S2S only, change the app in global-workflow/ci/cases/hires/C1152_S2SW.yaml My latest run log files can be found at: |
@GeorgeVandenberghe-NOAA suggested trying 2 write groups with 240 tasks in them. I meant to try that but tried 2 write groups with 360 tasks per group unintentionally, but I did turn on all PET files as @LarissaReames-NOAA thought that might have helpful info. The rundirectory is here: /work2/noaa/marine/jmeixner/wavesforhr5/test01/STMP/RUNDIRS/C1152t06/gfs.2019120300/gfsfcst.2019120300/fcst.272800 The log file is here: /work2/noaa/marine/jmeixner/wavesforhr5/test01/C1152t06/COMROOT/C1152t06/logs/2019120300/gfs_fcst_seg0.log The PET logs to me also point to write group issues. Any help with this would be greatly appreciated. Tagging @aerorahul for awareness. |
Thanks to everyone for the work on this. Has anyone tried this configuration with the write component off? That might help isolate where there problem is (hopefully) and then we can direct this accordingly for further debugging. |
I have not tried this without the write component. |
@JessicaMeixner-NOAA and others, I grabbed the run directory from the last experiment you ran (/work2/noaa/marine/jmeixner/wavesforhr5/test01/STMP/RUNDIRS/C1152t06/gfs.2019120300/gfsfcst.2019120300/fcst.272800), changed it to run just ATM component and converted it to run with traditional threading. It is currently running in /work2/noaa/stmp/djovic/stmp/fcst.272800, and it passed the initialization phase and finished writing 000 and 003 hour outputs successfully. I submitted the job with just 30 min wall-clock time limit, so it will fail soon. I suggest you try running full coupled version with traditional threading if it's easy to reconfigure it. |
some good news: |
my 48hr run finished |
@DusanJovic-NOAA I tried running without ESMF threading - but am struggling to get it set-up correctly and go through. @aerorahul is it expected that turning off esmf managed threading in the workflow should work? I'm also trying on hercules to replicated @jiandewang's success but with S2SW. |
I also lanched one S2SW but it's still in pending status |
WRTTASK_PER_GROUP_PER_THREAD_PER_TILE_GFS=10 with S2S did not work on orion: /work2/noaa/marine/jmeixner/wavesforhr5/test01/C1152t03/COMROOT/C1152t03/logs/2019120300/gfs_fcst_seg0.log |
mine is on hercules |
@JessicaMeixner-NOAA my gut feeling is the issue is related to the memory/node, hercules has more than orion. Maybe you can try 5 on orion |
Traditional threading is not yet supported in the global-workflow as an option. We have the toggle for it, but it requires a different set of ufs_configure files and I think we are waiting for that kind of work to be in the ufs-weather-model repo. @DusanJovic-NOAA |
I only changed ufs.configure:
And, I added job_card by copying one of the job_card from regression test run and changed:
80 is then number of cores on hercules compute nodes |
Ok. Yes. That makes sense for the atm-only. ATM_omp_num_threads: @[atm_omp_num_threads]
The original value for |
OMP_NUM_THREADS performance is i*nconsistent and generally poor if*
ATM_omp_num_threads: @[atm_omp_num_threads]
is not removed when esmf managed threading is set to false.
…On Fri, Nov 8, 2024 at 7:52 PM Rahul Mahajan ***@***.***> wrote:
I only changed ufs.configure:
1. remove all components except ATM
2. change globalResourceControl: from true to false
3. change ATM_petlist_bounds: to be 0 3023 - this numbers are lowe and
upper bounds of MPI ranks used by the ATM model, in this case 24_16_6 +
2_360, where 24 and 16 are layout values from input.nml and 2_360 are write
comp values from model_configure
And, I added job_card by copying one of the job_card from regression test
run and changed:
1. export OMP_NUM_THREADS=4 - where 4 is a number of OMP threads
2. srun --label -n 3024 --cpus-per-task=4 ./ufs_model.x - here 3024 is
a number of MPI ranks, 4 is a number of threads
3. #SBATCH --nodes=152
#SBATCH --ntasks-per-node=80
80 is then number of cores on hercules compute nodes 152 is the minimal
number of nodes such that 152*80 >= 3024
Ok. Yes. That makes sense for the atm-only.
Does your ufs.configure have a line for
ATM_omp_num_threads: @[atm_omp_num_threads]
@[atm_omp_num_threads] would have been 4. Did you remove it? Or does it
not matter since globalResourceControl is set to false?
The original value for ATM_petlist_bounds must have been 0 755 that you
changed to 0 3023, I am assuming.
—
Reply to this email directly, view it on GitHub
<#2486 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FR2UXPLHUID674GWZLZ7UI7BAVCNFSM6AAAAABQ3GJUOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINRVGY2DCMBSGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
I just fixed my comment about The original value for |
Yes ESMF managed threading requires several times more ranks and ESMF fails
when rank count goes above 21000 or so. This is a VERY serious issue
for resolution increases unless it is fixed.. reported in February.
…On Fri, Nov 8, 2024 at 7:56 PM Dusan Jovic ***@***.***> wrote:
I just fixed my comment about ATM_omp_num_threads:. I set it to 1 from 4,
I'm not sure if it's ignored when globalResourceControl is set to false
The original value for ATM_petlist_bounds was something like 12 thousand
or something like that, that included MPI ranks times 4 threads.
—
Reply to this email directly, view it on GitHub
<#2486 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FW3WOMQFATDADHXU53Z7UJQJAVCNFSM6AAAAABQ3GJUOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINRVGY2TAMRYGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
@JessicaMeixner-NOAA BLUF, you/someone from the applications team could try traditional threading and we could gain some insight on performance at those resolutions. Thanks~ |
I have MANY test cases that use traditional threading and have converted
others from managed to traditional threading. It's generally
needed at high resolution to get decent run rates.
…On Fri, Nov 8, 2024 at 8:02 PM Rahul Mahajan ***@***.***> wrote:
@JessicaMeixner-NOAA <https://github.com/JessicaMeixner-NOAA>
I think the global-workflow is coded to use the correct ufs_configure
template and set the appropriate values for PETLIST_BOUNDS and
OMP_NUM_THREADS in the ufs_configure file.
The default in the global-workflow is to use ESMF_THREADING = YES. I am
pretty sure one could use traditional threading as well, but is an
unconfirmed fact as there was still work being done to confirm traditional
threading will work on WCOSS2 with the slignshot updates and whatnot.
Details on that are fuzzy to me at the moment.
BLUF, you/someone from the applications team could try traditional
threading and we could gain some insight on performance at those
resolutions. Thanks~
—
Reply to this email directly, view it on GitHub
<#2486 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FVGPKZCGQO7R37N6HLZ7UKE5AVCNFSM6AAAAABQ3GJUOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINRVGY2TQMJYHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
Ok. @GeorgeVandenberghe-NOAA. Where do we employ traditional threading C768 and up? If so, we can set a flag in the global-workflow for those resolutions to use traditional threading. It should be easy enough to set that up. |
I don't know because I usually get CWD testcases from others and work from
there but yes that's an excellent idea. We probably though should
also use a multiple stanza MPI launcher for the different components to
minimize core wastage for components that don't thread, particularly WAVE
…On Fri, Nov 8, 2024 at 8:11 PM Rahul Mahajan ***@***.***> wrote:
Ok. @GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA>.
Where do we employ traditional threading C768 and up? If so, we can set a
flag in the global-workflow for those resolutions to use traditional
threading. It should be easy enough to set that up.
—
Reply to this email directly, view it on GitHub
<#2486 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FQG5MHORVYQWBE3TY3Z7ULE7AVCNFSM6AAAAABQ3GJUOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINRVGY3TANRTGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
Unfortunately I was unable to replicate @jiandewang hercules success for HR4 tag with the top of develop. Moreover, 10 write tasks per group was not a lucky number for orion either. |
Note this was with added waves - so this might have also failed for @jiandewang if he has used waves. |
summary for more tests I did on HERCULES: |
Does anyone know WHY the coupled configurations are hanging. Does anyone
have the knowledge to drill into all components and find where in which
routine, one is stuck. Our older forecast models had this property.. the
MPI "timbers" were clearly exposed and we could see which line of code was
stuck or silently failing (silent
failures of one rank can also cause hangs)
…On Tue, Nov 12, 2024 at 2:25 PM Jessica Meixner ***@***.***> wrote:
Unfortunately I was unable to replicate @jiandewang
<https://github.com/jiandewang> hercules success for HR4 tag with the top
of develop. Moreover, 10 write tasks per group was not a lucky number for
orion either.
Note this was with added waves - so this might have also failed for
@jiandewang <https://github.com/jiandewang> if he has used waves.
—
Reply to this email directly, view it on GitHub
<#2486 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FWQUWAVQQ4ZANRBIX32AIFUBAVCNFSM6AAAAABQ3GJUOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINZQGY3TKNRUGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
On orion, turning off the write grid component means that we're now hanging during wave initialization. The log file can be found here: /work2/noaa/marine/jmeixner/wavesforhr5/test01/nowritegridt2/COMROOT/test01/logs/2019120300/gfs_fcst_seg0.log Instructions from @aerorahul on running the workflow without the write grid component are: config.base: QUILTING=.false. |
Thanks @aerorahul for the instructions and @JessicaMeixner-NOAA for running the test! This is helpful. Give me a moment and I'll figure out next steps. |
At the risk of muddying the waters, I copied the WW3 mod_def and mesh from Jessica's run directory and was able to start up and run the model on hercules on using the DATM-S2SW configuration I've built. This is WW3 at the top of the current dev/ufs-weather-model and WAV is on 592 tasks. |
So should the only difference from your experiment and Jessica's be Hercules vs. Orion? Or is the compile/run config different? Trying to narrow down potential sources of difference. |
My case uses a DATM with MOM6+CICE6 on 1/4 deg tripole and a given WW3 configuration (structured, unstructured etc). So it essentially eliminates any fv3atm related issues. |
If I someone can provide me a canned run-directory on either Hercules or Orion, I can see if I can figure out what is going on. But I'll need to pause work on issue #2466 to do so. |
@DeniseWorthen @JacobCarley-NOAA A canned case was created on Orion. Please see below for the information: This is my first time creating a canned case. Please let me know if anything is missing. |
On Hercules, I have done my best to setup 2 canned cases; both with Both cases are set to run out to 12 hours on Hercules, in the The cases have job cards, module files, and environment variables set as would have been via the workflow. C768: cp -R /work/noaa/stmp/rmahajan/HERCULES/RUNDIRS/sandbox/c1152s2sw ./c1152s2sw.run
cd ./c1152s2sw.run
sbatch c1152s2sw_gfsfcst.sh This should work as long as the user has access to the My sample run directories are in: |
@aerorahul Thanks. One question I did have which options are used to compile? I'm assuming what we are compiling is 32-bit?
|
I grabbed @aerorahul 's
printed every second or few seconds, very slowly. So I just removed almost all the lines from My run directory is @aerorahul the climatological fixed files (specified in input.nml) are still pointing to your directory, so strictly speaking this is not a self-contained (canned case) run directory. If you remove your gwWork directory, people will not be able to run this. Consider moving all fixed files in, for example, fix subdirectory and update the paths in input.nml to have a truly self-contained run directory. |
Interestingly, c768 uses more cores than c1152:
Is this intended? |
George V. noticed that The HRv4 does not work on Hercules or Orion. It hangs sometime after WW3 starts. No relevant message in the log files about the hanging.
To Reproduce: Run an HRv4 experiment on Hercules or Orion
Additional context
Output
The text was updated successfully, but these errors were encountered: