Skip to content
This repository has been archived by the owner on May 20, 2024. It is now read-only.

exploring performance #54

Open
cathieO opened this issue Sep 6, 2019 · 3 comments
Open

exploring performance #54

cathieO opened this issue Sep 6, 2019 · 3 comments
Assignees

Comments

@cathieO
Copy link

cathieO commented Sep 6, 2019

Should parallelism be done over boxes or within?

This depends on the configuration of the domain.

@ian-bertolacci
Copy link

Status:

  1. Implemented iteration over whole subgrid with tiling to simulate boxes (See https://github.com/ian-bertolacci/parflow/tree/experimental_TotalDomain)

To do:

  1. Working with test case(s) from Stefan to get scaling results across box sizes

@ian-bertolacci
Copy link

Initial Performance testing on recilinear domain (sinusoidal) on rose.
Domain size 300x300x250
Tested sizes: 2, 4, 8, 16, 32, 64, 128, 256, 512
Tested threads (when applicable): 1, 2, 4, 8
Reported from an average of 20 trials.

Baseline is the is same domain using the default GrGeomInLoopBoxes macro:
GrGeomInLoopBoxes time statistics:

Time Metric means (seconds) ± 1 st.dev (seconds)
real 187.5037 0.4308
user 170.6810 0.4656
sys 16.8000 0.1567

No parallelism (exploring overhead purely from tiled looping)

case min speedup max speedup average speedup reduction in time reduction in stddev
GrGeomInLoopBoxesTotalDomainTiled_2 0.9033 0.9104 0.9068 -19.2625 0.0923
GrGeomInLoopBoxesTotalDomainTiled_4 0.9452 0.9544 0.9498 -9.9124 -0.0691
GrGeomInLoopBoxesTotalDomainTiled_8 0.9561 0.9665 0.9613 -7.5503 -0.1722
GrGeomInLoopBoxesTotalDomainTiled_16 0.9626 0.9725 0.9675 -6.2886 -0.1137
GrGeomInLoopBoxesTotalDomainTiled_32 0.9595 0.9689 0.9642 -6.9658 -0.0640
GrGeomInLoopBoxesTotalDomainTiled_64 0.9608 0.9679 0.9644 -6.9310 0.1579
GrGeomInLoopBoxesTotalDomainTiled_128 0.9729 0.9832 0.9781 -4.2064 -0.1421
GrGeomInLoopBoxesTotalDomainTiled_256 0.9815 0.9938 0.9876 -2.3461 -0.3104
GrGeomInLoopBoxesTotalDomainTiled_512 0.9948 1.0035 0.9992 -0.1583 0.0424

Parallelism in tiles
Tile size 2

case min speedup max speedup average speedup reduction in time reduction in stddev
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_2_threads-1 0.7353 0.7448 0.7400 -65.8830 -0.6141
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_2_threads-2 0.6642 0.6745 0.6693 -92.6416 -1.0762
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_2_threads-4 0.6606 0.6695 0.6650 -94.4418 -0.8256
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_2_threads-8 0.5723 0.5782 0.5752 -138.4593 -0.4704

Parallelism in tiles
Tile size 4

case min speedup max speedup average speedup reduction in time reduction in stddev
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_4_threads-1 0.9019 0.9121 0.9070 -19.2238 -0.2523
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_4_threads-2 1.0313 1.0422 1.0367 6.6463 -0.0999
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_4_threads-4 1.1205 1.1320 1.1263 21.0198 -0.0395
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_4_threads-8 1.1393 1.1503 1.1448 23.7142 0.0222

Parallelism In tiles
Tile size 8

case min speedup max speedup average speedup reduction in time reduction in stddev
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_8_threads-1 0.9342 0.9434 0.9388 -12.2230 -0.0865
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_8_threads-2 1.1093 1.1200 1.1146 19.2844 0.0031
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_8_threads-4 1.2242 1.2383 1.2312 35.2125 -0.0934
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_8_threads-8 1.2850 1.3037 1.2943 42.6340 -0.2864

Parallelism In tiles
Tile size 16

case min speedup max speedup average speedup reduction in time reduction in stddev
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_16_threads-1 0.9402 0.9502 0.9452 -10.8767 -0.1591
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_16_threads-2 1.1244 1.1353 1.1298 21.5446 0.0122
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_16_threads-4 1.2479 1.2613 1.2546 38.0464 -0.0239
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_16_threads-8 1.3202 1.3331 1.3266 46.1635 0.0709

Parallelism in tiles
Tile size 32

case min speedup max speedup average speedup reduction in time reduction in stddev
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_32_threads-1 0.9389 0.9478 0.9433 -11.2632 -0.0598
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_32_threads-2 1.1208 1.1348 1.1278 21.2432 -0.2215
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_32_threads-4 1.2466 1.2611 1.2538 37.9572 -0.0878
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_32_threads-8 1.3173 1.3326 1.3249 45.9847 -0.0631

Parallelism in tiles
Tile size 64

case min speedup max speedup average speedup reduction in time reduction in stddev
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_64_threads-1 0.9360 0.9449 0.9405 -11.8725 -0.0480
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_64_threads-2 1.1215 1.1312 1.1263 21.0317 0.0914
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_64_threads-4 1.2419 1.2559 1.2489 37.3652 -0.0661
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_64_threads-8 1.3096 1.3256 1.3176 45.1935 -0.1052

Parallelism in tiles
Tile size 128

case min speedup max speedup average speedup reduction in time reduction in stddev
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_128_threads-1 0.9532 0.9609 0.9571 -8.4121 0.0976
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_128_threads-2 1.1337 1.1444 1.1391 22.8939 0.0353
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_128_threads-4 1.2549 1.2660 1.2604 38.7386 0.1161
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_128_threads-8 1.3208 1.3334 1.3271 46.2155 0.0865

Parallelism in tiles
Tile size 256

case min speedup max speedup average speedup reduction in time reduction in stddev
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_256_threads-1 0.9623 0.9715 0.9669 -6.4231 -0.0446
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_256_threads-2 1.1406 1.1527 1.1466 23.9787 -0.0518
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_256_threads-4 1.2588 1.2716 1.2652 39.3022 0.0234
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_256_threads-8 1.3239 1.3396 1.3317 46.7055 -0.0715

Parallelism in tiles
Tile size 512

case min speedup max speedup average speedup reduction in time reduction in stddev
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_512_threads-1 0.9680 0.9797 0.9739 -5.0336 -0.2865
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_512_threads-2 1.1486 1.1583 1.1534 24.9424 0.1206
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_512_threads-4 1.2646 1.2767 1.2706 39.9362 0.0655
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_512_threads-8 1.3272 1.3440 1.3356 47.1131 -0.1275

Parallelism over tiles
Tile size 2

case min speedup max speedup average speedup reduction in time reduction in stddev
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_2_threads-1 0.8932 0.9018 0.8975 -21.4199 -0.0843
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_2_threads-2 1.0896 1.1015 1.0955 16.3506 -0.0991
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_2_threads-4 1.2222 1.2390 1.2306 35.1308 -0.2563
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_2_threads-8 1.3030 1.3212 1.3121 44.5981 -0.2333

Parallelism over tiles
Tile size 4

case min speedup max speedup average speedup reduction in time reduction in stddev
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_4_threads-1 0.9335 0.9442 0.9388 -12.2132 -0.2523
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_4_threads-2 1.1198 1.1284 1.1241 20.7010 0.1729
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_4_threads-4 1.2427 1.2572 1.2499 37.4871 -0.0947
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_4_threads-8 1.3103 1.3290 1.3196 45.4125 -0.2507

Parallelism over tiles
Tile size 8

case min speedup max speedup average speedup reduction in time reduction in stddev
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_8_threads-1 0.9465 0.9560 0.9513 -9.6060 -0.1043
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_8_threads-2 1.1214 1.1370 1.1292 21.4489 -0.3321
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_8_threads-4 1.2423 1.2599 1.2510 37.6256 -0.2816
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_8_threads-8 1.3156 1.3305 1.3230 45.7819 -0.0445

Parallelism over tiles
Tile size 16

case min speedup max speedup average speedup reduction in time reduction in stddev
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_16_threads-1 0.9566 0.9654 0.9610 -7.6164 -0.0112
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_16_threads-2 1.1292 1.1425 1.1359 22.4283 -0.1568
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_16_threads-4 1.2451 1.2617 1.2534 37.9051 -0.2177
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_16_threads-8 1.3157 1.3252 1.3204 45.5006 0.2445

Parallelism over tiles
Tile size 32

case min speedup max speedup average speedup reduction in time reduction in stddev
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_32_threads-1 0.9519 0.9615 0.9567 -8.4842 -0.1007
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_32_threads-2 1.1185 1.1320 1.1252 20.8675 -0.1902
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_32_threads-4 1.2379 1.2541 1.2460 37.0168 -0.1988
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_32_threads-8 1.3089 1.3254 1.3171 45.1446 -0.1333

Parallelism over tiles
Tile size 64

case min speedup max speedup average speedup reduction in time reduction in stddev
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_64_threads-1 0.9487 0.9625 0.9556 -8.7184 -0.5336
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_64_threads-2 1.1304 1.1382 1.1343 22.1994 0.2461
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_64_threads-4 1.2445 1.2607 1.2526 37.8064 -0.1892
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_64_threads-8 1.3083 1.3175 1.3129 44.6905 0.2574

Parallelism over tiles
Tile size 128

case min speedup max speedup average speedup reduction in time reduction in stddev
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_128_threads-1 0.9675 0.9778 0.9727 -5.2719 -0.1424
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_128_threads-2 1.1400 1.1509 1.1454 23.8074 0.0343
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_128_threads-4 1.1798 1.2049 1.1923 30.2350 -0.8657
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_128_threads-8 1.2723 1.2850 1.2787 40.8636 0.0410

Parallelism over tiles
Tile size 256

case min speedup max speedup average speedup reduction in time reduction in stddev
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_256_threads-1 0.9791 0.9873 0.9832 -3.2074 0.0721
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_256_threads-2 1.0206 1.0302 1.0254 4.6405 0.0006
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_256_threads-4 1.0833 1.0937 1.0885 15.2403 0.0071
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_256_threads-8 1.0792 1.0953 1.0872 15.0403 -0.4576

Parallelism over tiles
Tile size 512

case min speedup max speedup average speedup reduction in time reduction in stddev
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_512_threads-1 0.9864 0.9980 0.9922 -1.4791 -0.2385
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_512_threads-2 0.9848 0.9955 0.9901 -1.8662 -0.1588
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_512_threads-4 0.9850 0.9949 0.9899 -1.9137 -0.0813
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_512_threads-8 0.9863 0.9944 0.9903 -1.8304 0.0839

@ian-bertolacci
Copy link

Main Observations

  1. Generally, parallelization over boxes performs better with small box sizes
  2. Generally, parallelization in boxes performs better with large box sizes

Idea is to choose a single size that maximizes both. (Could we also choose a better box size that may be more optimal for one case if the gains are much better of that case than the losses in the alternative case? Eg, if at size 16 both in and over boxes achieve 1.32x speedup, but at size 32, in boxes achieves 2.0x speedup but over boxes only achieves 1.1x speedup)

Future work:
Immediately:
Plot the speedup for 8 threads across tile size for the two cases.

Soon:
Box size analysis using...

  1. Real domains...
  2. Parallel in boxes...
  3. Parallel over boxes...
  4. split-size strategy.

Test on Ocelote?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants