Simply Parallel Parflow implementation #43

ian-bertolacci · 2019-07-22T18:02:22Z

Get a very simple parallel parflow implementations using some form of OpenMP.

ian-bertolacci · 2019-07-22T18:17:14Z

Attempted implementation:
Create GrGeomInLoopParallel, parallelism enabled inside of a new GrGeomOctreeInteriorNodeLoopParallel, which parallelizes the 3D loops using Kokkos::OpenMP.

Performance: Not good.
baseline_72hrx3hr_1de045
Unmodified Parflow at commit 1de045

Time Class	Runtime (seconds) ± 1 standard deviation
real	332.3808 ± 1.7961
user	331.4917 ± 1.7961
sys	0.7596 ± 0.0166

split_loop_72hrx3hr_f2f7d6
Parflow modified to split NFE loop 551 forall and reductions (see pull #13)
(Small performance degradation already known in this implementation)

Time Class	Runtime (seconds) ± 1 standard deviation
real	334.5446 ± 1.352
user	333.6424 ± 1.350
sys	0.7657 ± 0.0225

basic_kokkos_omp_72hrx3hr_fe7c39:
Experimental kokkos implementation using Kokkos::OpenMP execution space

Time Class	Runtime (seconds) ± 1 standard deviation
real	368.4230 ± 1.1382
user	367.9963 ± 1.1425
sys	0.3733 ± 0.0362

basic_kokkos_serial_72hrx3hr_based_fe7c39:
Sanity-check implementation using Kokkos::Serial

Time Class	Runtime (seconds) ± 1 standard deviation
real	367.7456 ± 1.4505
user	367.4793 ± 1.4537
sys	0.2154 ± 0.0207

A non-statistical experiment of scaling shows a negative scaling pattern.
Analysis of individual loop runtimes on different machine (my laptop) indicates 10x slowdown in some loops.

Likely cause: Parallelism within boxes is too fine grain. Boxes small, too little work for too many threads.

Tasks:

Move parallelism to be over boxes
Get running with Steve's Octree changes.

ian-bertolacci · 2019-07-22T19:32:06Z

Attempted Implementation:
Using Michaels OpenMP Pragma-in-macros solution (a true blessing), do the same as before (parallelize the 3D loop nest) using native OpenMP.

Performance: Again, not great.
baseline_72hrx3hr_1de045:

Time Class	Performance ± 1 standard deviation (seconds)
real	332.3808 ± 1.7961
user	331.4917 ± 1.7961
sys	0.7596 ± 0.0166

split_loop_72hrx3hr_f2f7d6:

Time Class	Performance ± 1 standard deviation (seconds)
real	334.5446 ± 1.3524
user	333.6424 ± 1.3506
sys	0.7657 ± 0.0225

simple_omp_threads-1_72hrx3hr_48f278:
OMP_NUM_THREADS=1

Time Class	Performance ± 1 standard deviation (seconds)
real	343.8862 ± 0.9246
user	343.4594 ± 0.9295
sys	0.3714 ± 0.0357

simple_omp_threads-2_72hrx3hr_48f278:
OMP_NUM_THREADS=2

Time Class	Performance ± 1 standard deviation (seconds)
real	347.6891 ± 0.9961
user	345.0150 ± 0.9959
sys	2.6195 ± 0.1100

simple_omp_threads-4_72hrx3hr_48f278:
OMP_NUM_THREADS=4

Time Class	Performance ± 1 standard deviation (seconds)
real	353.5161 ± 1.0830
user	348.1278 ± 1.0844
sys	5.3322 ± 0.1635

simple_omp_threads-8_72hrx3hr_48f278:
OMP_NUM_THREADS=8

Time Class	Performance ± 1 standard deviation (seconds)
real	362.1221 ± 0.9454
user	350.9784 ± 1.0036
sys	11.0904 ± 0.1666

Exhibits negative scaling effects.
Also, system call impact becomes very large.

Likely cause: again, parallelism within boxes is too fine grain. Boxes small, too little work for too many threads.

Tasks:

Move parallelism to be over boxes
Get running with Steve's Octree changes.

ian-bertolacci self-assigned this Jul 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simply Parallel Parflow implementation #43

Simply Parallel Parflow implementation #43

ian-bertolacci commented Jul 22, 2019

ian-bertolacci commented Jul 22, 2019 •

edited

Loading

ian-bertolacci commented Jul 22, 2019

Simply Parallel Parflow implementation #43

Simply Parallel Parflow implementation #43

Comments

ian-bertolacci commented Jul 22, 2019

ian-bertolacci commented Jul 22, 2019 • edited Loading

ian-bertolacci commented Jul 22, 2019

ian-bertolacci commented Jul 22, 2019 •

edited

Loading