Overwhelmed storage with too many read requests when running on > 14,000 MPI processes #781
Replies: 2 comments 2 replies
-
I cannot comment on the policy of this admin, but i have never seen the number of reads being an issue. SMILEI is regularly run with more than 14000 MPI processes on other systems and the issue in the case of IOs is the allocated space in the scratch partition. Overall it is a good policy to reduce as much as you can the number of IOs to improve the performance. If you want to reduce the number of MPI processes, a possibility would be to use OMP_NUM_THREADS (for instance use 256 x 14 tasks with 256 x 56 cores and OMP_NUM_THREADS=4). Note: this issue has been converted into a discussion as this is not a bug issue (smilei was running fine, the admin killed it) |
Beta Was this translation helpful? Give feedback.
-
It worked by using OMP_NUM_THREADS>1. Thanks again for your help! @charlesprouveur |
Beta Was this translation helpful? Give feedback.
-
Hello, I am trying some large scale runs with 14336 (256 nodes x 56 cores/node) MPI processes on the Frontera machine of TACC (https://tacc.utexas.edu/systems/frontera/). My jobs were cancelled by admin in the middle of the run, who complained that my job was overwhelming the storage targets with too many read requests and asked us to reduce IO. Is this problem normal for running SMILEI at this scale? I tried to halve the size of the problem and run with 128 nodes (7168 MPI) and had no issues (no complaints and no cancellation of my jobs).
Here is my job script and input file:
input.txt
job.txt
It's 2D Cartesian simulations and the diagnostics include simple probe (at selected points), fields on grids, and particlebinning.
Any solutions/comments are appreciated.
Beta Was this translation helpful? Give feedback.
All reactions