AMT Generation

The generation of the AMT is implemented in two different strategies:

Normal Mode

This transformation is based on the output provided by the optimization package to generate a corresponding AMT. The implementation of this transformation is part of the optimization implementation.

The starting point for this transformation is defined by the results of the optimization package stored as the FlatAPT and the mapping table. Both data structures are based on the APT root and asume, that no further descendants are to be computed in parallel. To ensure the best parallel potential, the parallel calls are all inlined into the APT root.

FlatAPT

The FlatAPT defines an ordered list of steps. Theses steps contain data flow independant APT nodes. As such, nodes within the same step can be computed regardless of each other. Each node within a step has a data dependency to a node from the preceding step. Parallel nodes can be split into smaller tasks in order optimize the load balance.

Mapping Table

The mapping table is used to assign parallel splits to a specific processor. Currently, the work is not assigned to specific cores or even threads due to the complexity of the task. Thus, it can be assumed, that the corresponding processor utilizes all of its threads. The mapping table also assinges the parallel splits to parallel steps similar to the steps within the FlatAPT. Since, the mapping table do not include the sequential nodes, the mapping table and the FlatAPT are not aligned.

Generation

The AMT is constructed by iterating over the FlatAPT while tracking the mapping table. The mapping table follows the same data flow dependant structure as the FlatAPT. Thus, each parallel step can only start when all preceding steps are finished. The mapping table and FlatAPT can therefore be aliged by testing whether the parallel splits of the current mapping table are the same as the parallel splits of the FlatAPT. Then the parallel and serial steps can be generated.

Serial Step

The generation of a serial step is defined as the transformation of all serial splits within one serial step into AMT nodes. Since, all nodes within the same step are independant the order of their computation is arbitrary. To generate a single serial split a new AMT node is generated, which recursively transforms all its children into AMT nodes as well.

Parallel Step

For the generatetion of a parallel step it is important how and where a parallel split is to be executed:

CPU: Splits targeting the CPU can be transformed similar to serial splits, the information on the target device and the index range is added.
Fused Split: The fused split defines a pipelin./loop fusion optimization. This parallel split contains another list of parallel splits which need to be considered for the generation.
GPU: Splits targeting the GPU only assign a single streaming multiprocessor within the mapping table. To better utilize the architecture of GPUs connected splits are fused. The number of splits fused then defines the grid size. For fused splits on the GPU all children of the fused node need to be fused. Due to the linearity of the data accesses this horizontal (same step of the fused split) fusion of fused splits is possible.
Reductions: Reductions need to optimize the recombination of the partial results. As such, independant of the target architecture reduction calls need to be generated.

Debug Mode

The debug mode is based on annotation in the front-end in order to generate an AMT without depending on the optimization. As the name suggests, this implementation is used to debug the code generator.

Front-End Annotations

res = some_parallel_call<<<[N_i, D_i, P_i], [N_i+1, D_i+1, P_i+1]...>>>(parameters...)

The code example above shows how these annotations in the current front-end look like. Within the <<<>>> notation the user can specify an arbitrary number of target architectures. These architectures are defined as arrays with three elements:

N: The first element defines the compute node which should be used.
D: The second element defines the device, on the compute node chosen before, on which the code shall be executed.
P: The third element defines the processor on a given device to compute the parallel call.

These three parameters reference the corresponding hardware defined in the HL. An example for this can be seen below, where the increment function should perform on processor 0 of device 0 on node 0 and on processor 0 of device 0 on node 1:

res = increment<<<[0,0,0], [1,0,0]>>>(input)

APT Depiction

The annotations from the front-end are stored as additional arguments of the MetaList type within the APT. They are always the last additional arguments, since the annotations can vary in number.

AMT Transformation

The debug mode directly transforms most of the APT nodes into AMT nodes. When encountering annotated parallel calls, the parallel pattern is split into equally sized chunks to be computed on the different target processors. Due to the limitations of such annotations it is not possible to artificially create fused nodes. Furthermore, the horizontal fusion of GPU nodes is also not supported within the debug mode.