Skip to content

AMT Sync&Trans

Adrian edited this page Sep 13, 2021 · 1 revision

Synchronization and Data Transfers

The generation of synchronization and data transfers is currently handled by the AMT. In order to generate these additional nodes a basic version of the AMT needs to present. The generation of the synchronization and data transfers takes this basic AMT as an input and creates a synchronized version of it.

Generation

Currently any parallel computation is performed within the main function of the program (the root of the APT). As such it is sufficient to consider only the direct descendants of the root when generating potential barriers and data transfers. The generation of this happens in seven differeint steps:

  1. Parallel Groups: In the first step the parallel pattern calls within the AMTare grouped according to their former APT node. These parallel groups areused to avoid early synchronization and subsequent data races or deadlocks.The parallel group associated with a parallel call node is always known bythe node. For each parallel group it is also known if a member is the firstto be handled or the last. Therefore, data transfers and synchronizationcan be generated for each parallel group. The data needed by all membersof the group can be transferred before the first member begins computationand synchronization steps affecting all members can only be done once eachmember has started the computation.
  2. Definition of the Initial Flow: The second step creates initial values forthe current placement of all variables available in the scope of the root node.The initial placement for each variable is defined to be in the main memory ofthe default device.
  3. Node Traversal: The following steps are repeated for each child node of theroot. The order of the traversal is defined by the order of the nodes withinthe basic AMT.
  4. Dynamic Programming Handling:If the current node is a parallel patterncall referencing a dynamic programming pattern, data transfers and synchronization within the time step loop are generated. The synchronization specifiesall execution units taking part in the dynamic programming recursion. Thedata transfer sends all written data to all participating execution units. Thegenerated nodes are added to the specification of the parallel call node.
  5. Synchronization Point: The current data placements are compared withthe data flow of the current node. If the current node is a parallel pattern calland all other members of the same group were already traversed, all executionunits used by the parallel group are synchronized. This step creates a barriernode after the current node.
  6. Data Transfers: The current data placements and the data flow of the nextchild node are compared. The incoming data flow of the next node and thecurrent data placement detects overlapping data slices on different devices. Inthis case a data movement is generated after the current node specifying thedata do be transferred. Since data placements can be arbitrarily small theyare maximized in order to reduce the overall latency.
  7. Update Placements: The current data placements are updated in relationto the data transfers occurring and write accesses discovered in the previoussteps. The new data placements are maximized in order to reduce the runtimeof the algorithm. Data placements written to in the current step are erasedfrom all other devices.

Extracted from: (Master thesis at RWTH) HPC Code Generation for ParallelPattern Based Algorithms onHeterogeneous Architectures by Adrian Schmitz 2021

Clone this wiki locally