Replies: 3 comments 1 reply
-
Hi Lucy, WW3 does not scale well once your number of processors approaches the number of spectral bins. In your case, 16 nodes * 64 procs = 1024 processors. |
Beta Was this translation helpful? Give feedback.
-
That makes sense, thank you. I am using
SPECTRUM%NK = 30
SPECTRUM%NTH = 36
So, as you say that sounds like it could be hitting a limit.
From: Chris Bunney ***@***.***>
Sent: 08 March 2024 10:41
To: NOAA-EMC/WW3 ***@***.***>
Cc: Lucy Bricheno ***@***.***>; Author ***@***.***>
Subject: Re: [NOAA-EMC/WW3] Running parallel WaveWatch over higher numbers of processors. (Discussion #1200)
Caution: This email has originated from outside of the organisation. Do not click links or open attachments unless you have verified the sender and content is safe. Thank you.
Hi Lucy,
What is you spectral resolution?
WW3 does not scale well once your number of processors approaches the number of spectral bins.
This is due to the way that the propagation routines parallelise over MPI: it does not do a domain decomposition, rather it does a spectral decomposition, so your scaling is limited by the number of spectral bins (not the number of sea points).
In your case, 16 nodes * 64 procs = 1024 processors.
Assuming a 30 freq *30 dirs spectral resolution, this exceeds the number of spectral bins (900) so I would not expect it to scale well.
It might also be the reason for the MPI_ABORT (not 100% sure of this, but it is a bit suspicious!)
—
Reply to this email directly, view it on GitHub<#1200 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AEBZEMWTOVIFT2GBIM4PDNTYXGIS5AVCNFSM6AAAAABEMR3BUSVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DOMJYGY2DE>.
You are receiving this because you authored the thread.Message ID: ***@***.******@***.***>>
This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system.
The National Oceanography Centre (NOC) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. NOC does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses.
Opinions, conclusions or other information in this message and attachments that are not related directly to NOC business are solely those of the author and do not represent the views of NOC.
|
Beta Was this translation helpful? Give feedback.
-
As an addition to what @ukmo-ccbunney said, the limit only applies to non-unstructured parts of ww3. The unstructured ww3 utilizes domain decomposition and you can go beyond NSPEC. The developers verified its scaling up to 10 times NSPEC. Nspec= no. Freq band * no. Directional bands |
Beta Was this translation helpful? Give feedback.
-
Hello,
I'm running WW3 on the UK national research supercomputer ARCHER2. Doing some scaling tests, but seem to be reaching something of a wall on higher numbers
It gets going fine on 216 / 512 processors. (8 nodes, 64 tasks per node)
Initially I had some problems with memory per node, but there is a separate queue with 'high-mem' processors.
But I can't seem to get it scale further. E.g. trying with 16 nodes, 64 tasks per node it fails.
Not a lot of information in the error message: lots of MPICH errors
Wave model ...
EXTCDE MPI_ABORT, IEXIT= 829
MPICH ERROR [Rank 890] [job id 5864657.0] [Fri Mar 8 10:01:45 2024] [nid006489] - Abort(829) (rank 890 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 829) - process 890
aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 829) - process 890
... etc.
I also tried on 32 x 32, and it similarly gets to
Wave model ...
*** WAVEWATCH III WARNING IN W3INIT :
*** WAVEWATCH III WARNING IN W3INIT :
*** WAVEWATCH III WARNING IN W3INIT :
*** WAVEWATCH III WARNING IN W3INIT :
*** WAVEWATCH III ERROR IN W3INIT :
then fails. Any ideas of how to fix, or how I can get a more verbose error message?
Thanks for any input!
Lucy
Beta Was this translation helpful? Give feedback.
All reactions