Skip to content
This repository has been archived by the owner on May 20, 2024. It is now read-only.

Simply Parallel Parflow implementation #43

Open
ian-bertolacci opened this issue Jul 22, 2019 · 2 comments
Open

Simply Parallel Parflow implementation #43

ian-bertolacci opened this issue Jul 22, 2019 · 2 comments
Assignees

Comments

@ian-bertolacci
Copy link

Get a very simple parallel parflow implementations using some form of OpenMP.

@ian-bertolacci ian-bertolacci self-assigned this Jul 22, 2019
@ian-bertolacci
Copy link
Author

ian-bertolacci commented Jul 22, 2019

Attempted implementation:
Create GrGeomInLoopParallel, parallelism enabled inside of a new GrGeomOctreeInteriorNodeLoopParallel, which parallelizes the 3D loops using Kokkos::OpenMP.

Performance: Not good.
baseline_72hrx3hr_1de045
Unmodified Parflow at commit 1de045

Time Class Runtime (seconds) ± 1 standard deviation
real 332.3808 ± 1.7961
user 331.4917 ± 1.7961
sys 0.7596 ± 0.0166

split_loop_72hrx3hr_f2f7d6
Parflow modified to split NFE loop 551 forall and reductions (see pull #13)
(Small performance degradation already known in this implementation)

Time Class Runtime (seconds) ± 1 standard deviation
real 334.5446 ± 1.352
user 333.6424 ± 1.350
sys 0.7657 ± 0.0225

basic_kokkos_omp_72hrx3hr_fe7c39:
Experimental kokkos implementation using Kokkos::OpenMP execution space

Time Class Runtime (seconds) ± 1 standard deviation
real 368.4230 ± 1.1382
user 367.9963 ± 1.1425
sys 0.3733 ± 0.0362

basic_kokkos_serial_72hrx3hr_based_fe7c39:
Sanity-check implementation using Kokkos::Serial

Time Class Runtime (seconds) ± 1 standard deviation
real 367.7456 ± 1.4505
user 367.4793 ± 1.4537
sys 0.2154 ± 0.0207

A non-statistical experiment of scaling shows a negative scaling pattern.
Analysis of individual loop runtimes on different machine (my laptop) indicates 10x slowdown in some loops.

Likely cause: Parallelism within boxes is too fine grain. Boxes small, too little work for too many threads.

Tasks:

  • Move parallelism to be over boxes
  • Get running with Steve's Octree changes.

@ian-bertolacci
Copy link
Author

Attempted Implementation:
Using Michaels OpenMP Pragma-in-macros solution (a true blessing), do the same as before (parallelize the 3D loop nest) using native OpenMP.

Performance: Again, not great.
baseline_72hrx3hr_1de045:

Time Class Performance ± 1 standard deviation (seconds)
real 332.3808 ± 1.7961
user 331.4917 ± 1.7961
sys 0.7596 ± 0.0166

split_loop_72hrx3hr_f2f7d6:

Time Class Performance ± 1 standard deviation (seconds)
real 334.5446 ± 1.3524
user 333.6424 ± 1.3506
sys 0.7657 ± 0.0225

simple_omp_threads-1_72hrx3hr_48f278:
OMP_NUM_THREADS=1

Time Class Performance ± 1 standard deviation (seconds)
real 343.8862 ± 0.9246
user 343.4594 ± 0.9295
sys 0.3714 ± 0.0357

simple_omp_threads-2_72hrx3hr_48f278:
OMP_NUM_THREADS=2

Time Class Performance ± 1 standard deviation (seconds)
real 347.6891 ± 0.9961
user 345.0150 ± 0.9959
sys 2.6195 ± 0.1100

simple_omp_threads-4_72hrx3hr_48f278:
OMP_NUM_THREADS=4

Time Class Performance ± 1 standard deviation (seconds)
real 353.5161 ± 1.0830
user 348.1278 ± 1.0844
sys 5.3322 ± 0.1635

simple_omp_threads-8_72hrx3hr_48f278:
OMP_NUM_THREADS=8

Time Class Performance ± 1 standard deviation (seconds)
real 362.1221 ± 0.9454
user 350.9784 ± 1.0036
sys 11.0904 ± 0.1666

Exhibits negative scaling effects.
Also, system call impact becomes very large.

Likely cause: again, parallelism within boxes is too fine grain. Boxes small, too little work for too many threads.

Tasks:

  • Move parallelism to be over boxes
  • Get running with Steve's Octree changes.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant