Skip to content

Commit

Permalink
convolution documentation fixes
Browse files Browse the repository at this point in the history
  • Loading branch information
singagan committed Apr 25, 2024
1 parent c36b06f commit 6820199
Show file tree
Hide file tree
Showing 10 changed files with 107 additions and 64 deletions.
39 changes: 25 additions & 14 deletions programming_examples/ml/bottleneck/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@

# <ins>Bottleneck Block</ins>
## Introduction
The bottleneck block is a key component in deep neural network architectures like ResNet. It is designed to help address the challenge of training deep networks by reducing computational costs while maintaining or improving performance. This README provides an overview of the process and considerations for accelerating a single bottleneck block.
The bottleneck block is a key component in deep neural network architectures like ResNet. It is designed to help address the challenge of training deep networks by reducing computational costs while maintaining or improving performance. This README provides an overview of the process and considerations for accelerating a bottleneck block on a single NPU column.


## Bottleneck Block Overview
Expand All @@ -36,35 +36,46 @@ The components and functionality of a standard bottleneck block:
</h3>
</p>

## Source Files Overview

```
.
+-- aie2.py # A Python script that defines the AIE array structural design using MLIR-AIE operations.
+-- bottleneck_block.png # Figure describing the layers in the bottleneck block after fusing ReLU and batch norm into the convolution layer.
+-- bottleneck_pipeline.png # Figure describing our implementation bottleneck block on a single NPU Column.
+-- Makefile # Contains instructions for building and compiling software projects.
+-- README.md # This file.
+-- run.lit # For LLVM Integrated Tester (LIT) of the design.
+-- test.py # Python code testbench for the design example.
```

## NPU Implementation

In our bottleneck pipeline implementation, every adjacent ReLU operation is fused into the convolution operation using the approach described in [conv2d_fused_relu](../conv2d_fused_relu). Fusing adjacent convolution and batch norm layers is another inference-time optimization, which involves updating the weight and bias of the convolution layer. The remaining layers of the bottleneck block are mapped onto a single column of NPU with one Shim Tile (0,0) and one Mem Tile (0,1), along with four AIE computer tiles spanning from (0,2) to (0,5), as illustrated in the figure below.
We map a bottleneck block on a single column of NPU in depth-first manner where the output of one convolutional operation on an AIE core is sent directly to another convolutional operation on a separate AIE core, all without the need to transfer intermediate results off-chip.
In our bottleneck pipeline implementation, every adjacent ReLU operation is fused into the convolution operation using the approach described in [conv2d_fused_relu](../conv2d_fused_relu). Fusing adjacent convolution and batch norm layers is another inference-time optimization, which involves updating the weight and bias of the convolution layer. The remaining layers of the bottleneck block are mapped onto a single column of NPU with one `Shim Tile (0,0)` and one `Mem Tile (0,1)`, along with four AIE computer tiles spanning from (0,2) to (0,5), as illustrated in the figure below.

<p align="center">
<picture>
<source media="(prefers-color-scheme: light)" srcset="bottleneck_pipeline.png">
<img alt="block" src="bottleneck_pipeline.png" >
</picture>
<h3 align="center">Depth-first implementation of bottleneck block pipeline on a single column of NPU.
<img
src="bottleneck_pipeline.png">
<h3 align="center"> Depth-first implementation of bottleneck block pipeline on a single column of NPU.
</h3>
</p>
The data movement within this pipeline is orchestrated using the ObjectFifo (OF) primitive. Initially, input activation is brought into the array via the `Shim Tile (0,0)`. We broadcast the data to both `AIE (0,2)` and `AIE (0,4)` to perform the very first convolution and skip addition operation in the bottleneck block, respectively. Since tile (0,4) must await additional data from other kernels before proceeding with its execution, buffering the data for tile (0,4) within the Mem tile is imperative to prevent any stalls in the broadcast process. Due to the data's size, direct buffering in the smaller L1 memory module of tile (0,5) is impractical. Therefore, we require two OFs: one for broadcasting to tile (0,2) and the Mem tile and another for data movement between the Mem tile and tile (0,5). These two OFs are interconnected to indicate that data from the first OF should be implicitly copied to the second OF through the Mem tile's DMA.

Starting from the tile (0,2), data is processed by each compute tile, with the intermediate activations being forwarded to the subsequent tile. AIE (0,2) handles 1x1 convolution with fused ReLU operation. Based on our hand analysis, we partition the 3x3 convolution across two cores to balance computation and accommodate weight distribution across two cores effectively. The resulting feature map from the 1x1 convolution is broadcasted to AIE (0,3) and AIE (0,5) to ensure all required input channels are available for generating output feature maps in the subsequent 3x3 convolution. We split the output feature map processing across AIE (0,3) and AIE (0,5), with each core computing half of the total channels. The outputs from AIE (0,3) and AIE (0,5) are then merged in AIE (0,4) to perform the final 1x1 convolution. This final convolution operation also integrates skip addition, utilizing the initial input and the output of the 1x1 convolution. Following this, the final ReLU activation is applied. Finally, the output is transmitted from the tile (0,4) back to the output via the Shim tile.
The data movement within this pipeline is orchestrated using the ObjectFifo (OF) primitive. Initially, input activation is brought into the array via the `Shim Tile (0,0)`. We broadcast the data to both `AIE (0,2)` and `AIE (0,4)` via `Mem Tile (0,1)` to perform the very first convolution and skip addition operation in the bottleneck block, respectively. Since `AIE (0,4)` must await additional data from other kernels before proceeding with its execution, buffering the data for tile (0,4) within the `Mem Tile (0,1)` is imperative to prevent any stalls in the broadcast process. Due to the data's size, direct buffering in the smaller L1 memory module of `AIE (0,4)` is impractical. Therefore, we require two OFs: one for broadcasting to tile (0,2) and the Mem tile and another for data movement between the Mem tile and tile (0,4). These two OFs are interconnected to indicate that data from the first OF should be implicitly copied to the second OF through the Mem tile's DMA.

Starting from the `AIE (0,2)`, data is processed by each compute tile, with the intermediate activations being forwarded to the subsequent tile. `AIE (0,2)` handles 1x1 convolution with fused ReLU operation. Based on our hand analysis, we partition the 3x3 convolution across two cores, `AIE (0,3)` and `AIE (0,5)`, to balance computation and accommodate weight distribution across two cores effectively. Therefore, the feature map from the 1x1 convolution is broadcasted to `AIE (0,3)` and `AIE (0,5)` to ensure all required input channels are available for generating output feature maps in the subsequent 3x3 convolution. We split the output feature map processing across these cores, with each core computing half of the total output channels. The outputs from `AIE (0,3)` and `AIE (0,5)` are then merged in `AIE (0,4)` to perform the final 1x1 convolution. This final convolution operation also integrates skip addition, utilizing the initial input to the bottleneck block and the output of the 1x1 convolution. The final ReLU activation is applied to obtain the final output feature map. This output feature map is transmitted from the `AIE (0,4)` back to the output via the `Shim Tile (0,0)`. Although not shown in the figure, weights are transferred separately using a `Shim Tile (0,0)` channel into `Mem Tile (0,1)`, which distributes them across appropriate AIE cores in parallel, leveraging the large number of MemTile channels.

We use the following architectural techniques to implement our bottleneck pipeline:

1. Depth-First Implementation: Spatial architectures provide coarse-grained flexibility that allows for tailoring of the data flow to optimize data movement. By tailoring the dataflow, we implement a depth-first schedule for a bottleneck block where the output of one convolutional operation on an AIE core is sent directly to another convolutional operation on a separate AIE core, all without the need to transfer intermediate results off-chip. This approach effectively minimizes the memory footprint associated with intermediate data, mitigating the overhead of costly off-chip accesses and increasing the overall performance.

2. Data Layout: Optimize activation and weight layout to enhance memory access patterns and enable effective utilization of AIE parallel processing units, ultimately improving the performance of 2D convolution operations.
2. Data Layout: Optimize activation and weight layout to enhance memory access patterns and enable effective utilization of AIE parallel processing units, ultimately improving the performance of 2D convolution operations. Please refer to our [conv2d](../conv2d) design for details on the data layout.

3. Kernel Optimization: To optimize convolution operations on AIE, we vectorize the code using AIE vector intrinsics. We load 8 elements of the input channel into vector registers using vector load intrinsic. We apply the convolution operation on this loaded data, utilizing it for enhanced computational efficiency. We implement zero-padding to handle boundary conditions and ensure accurate convolution results, particularly at the edges of feature maps. This comprehensive approach optimizes convolution processing on AIE, facilitating efficient and accurate feature extraction in neural network applications. Input is a 4x8 matrix corresponding to 4 elements of a row and 8 input channels.
3. Kernel Optimization: Please refer to our [conv2d](../conv2d) design for details on vectorizing convolution 2D.

4. Quantization: We use int8 precision for activation and weights. At int8 precision, AIE offers the highest compute density with 256 MAC/cycle.

5. Layer Fused: Initially, we employ AIE's SRS capabilities to fuse ReLU directly into the convolution operation. This integration optimizes performance by eliminating separate ReLU computations, streamlining the convolution process.

Another inference-time optimization is merging BatchNorm directly into the convolution weights. This strategy reduces redundant operations, ensuring more streamlined processing and improved overall performance.
5. Layer Fused: Initially, we employ AIE's SRS capabilities to fuse ReLU directly into the convolution operation. This integration optimizes performance by eliminating separate ReLU computations, streamlining the convolution process. Please refer to our [conv2d_fused_relu](../conv2d_fused_relu) design for details on fusing ReLU into the convolution layer.

## Compilation
To compile the design:
Expand Down
1 change: 0 additions & 1 deletion programming_examples/ml/bottleneck/requirements.txt

This file was deleted.

Loading

0 comments on commit 6820199

Please sign in to comment.