Notes on synch_p2p kernel

We've been exploring ideas for the implementation the synch_p2p kernel from the Intel Parallel Research Kernels (http://github.com/ParRes) on Grappa. We're interested in three questions:

What can we do that has nearest-to-native performance?
What's the prettiest implementation we can come up with?
What needs to improve to make the pretty one fast?

Our experiments are all run on our cluster of 12 X5650 Westmere Xeons with 24 GB RAM each and Mellanox QDR InfiniBand.

Implementations

MPI-like version

This uses a similar strategy to the reference code: it partitions rows across all the cores/processes in the job (that is, the array's columns are block-distributed), and adds a column of ghost-cell full-bit-wrapped doubles on each core to communicate values and synchronize between cores. Cores iterate over their strips by row. When a core finishes a segment, it writes the last value it computed into the corresponding full bit on the next core. The next core blocks on the left and diagonal ghost full bits for its current row; when they are filled it computes the segment.

Since the full-bit write at the end of the segment is asynchronous, the first core is able to continue iterating over its rows. This allows multiple rows to be computed simultaneously in a software-pipelined way. Grappa's aggregation helps automatically coarsen this compared to the reference MPI version.

The iteration is here: https://github.com/uwsampa/grappa/blob/nelson%2BGlobalArray/applications/intelParRes/synch_p2p/p2p-border.cpp#L129

Blocking version

Supporting this version would be ideal---it wraps every cell in the array with a full bit, and then does blocking reads and blocking writes. Unfortunately the multiple round-trips, locality tests, and array indexing kill performance right now. The array indexing currently the largest contributor.

https://github.com/uwsampa/grappa/blob/nelson%2BGlobalArray/applications/intelParRes/synch_p2p/p2p.cpp#L78

Feed-forward version

In this version a cell broadcasts its value to its dependencies as soon as it is computed. Each array cell contains a semaphore along with the data to track when a cell has received all its dependencies. Again, performance is bad right now due to locality tests and array indexing.

https://github.com/uwsampa/grappa/blob/nelson%2BGlobalArray/applications/intelParRes/synch_p2p/p2p-forward.cpp#L88

Combining futures

(to be explained)

Performance

Single-core performance

Reference OpenMP version (5Kx5K problem): 407 MFlops/s
Reference MPI version (20Kx20K problem): 879 MFlops/s
Grappa MPI-like version (20Kx20K problem): 871 MFlops/s

These versions are currently limited primarily by our unoptimized, proof-of-concept array indexing.

Grappa blocking version (20Kx20K problem): ~5 MFlops/s
Grappa feed-forward version (20Kx20K problem): ~5 MFlops/s

Multi-core performance

Reference OpenMP version (6Kx6K problem, 8 cores): 2040 MFlops/s
Reference MPI version (20Kx20K problem, 8 cores): 1941 MFlops/s
Grappa MPI-like version (20Kx20K problem, 8 cores): 6802 MFlops/s

Multi-node performance

Reference MPI version (40Kx40K problem, 8 nodes x 8 cores): 6608 MFlops/s
Grappa MPI-like version (40Kx40K problem, 8 nodes x 8 cores): 52594 MFlops/s

Provide feedback

Saved searches

Use saved searches to filter your results more quickly