diff --git a/programming_examples/ml/bottleneck/README.md b/programming_examples/ml/bottleneck/README.md
index a408e3f768..eca14aaa6b 100644
--- a/programming_examples/ml/bottleneck/README.md
+++ b/programming_examples/ml/bottleneck/README.md
@@ -10,7 +10,7 @@
# Bottleneck Block
## Introduction
-The bottleneck block is a key component in deep neural network architectures like ResNet. It is designed to help address the challenge of training deep networks by reducing computational costs while maintaining or improving performance. This README provides an overview of the process and considerations for accelerating a single bottleneck block.
+The bottleneck block is a key component in deep neural network architectures like ResNet. It is designed to help address the challenge of training deep networks by reducing computational costs while maintaining or improving performance. This README provides an overview of the process and considerations for accelerating a bottleneck block on a single NPU column.
## Bottleneck Block Overview
@@ -36,35 +36,46 @@ The components and functionality of a standard bottleneck block:
+## Source Files Overview
+
+```
+.
++-- aie2.py # A Python script that defines the AIE array structural design using MLIR-AIE operations.
++-- bottleneck_block.png # Figure describing the layers in the bottleneck block after fusing ReLU and batch norm into the convolution layer.
++-- bottleneck_pipeline.png # Figure describing our implementation bottleneck block on a single NPU Column.
++-- Makefile # Contains instructions for building and compiling software projects.
++-- README.md # This file.
++-- run.lit # For LLVM Integrated Tester (LIT) of the design.
++-- test.py # Python code testbench for the design example.
+```
+
## NPU Implementation
-In our bottleneck pipeline implementation, every adjacent ReLU operation is fused into the convolution operation using the approach described in [conv2d_fused_relu](../conv2d_fused_relu). Fusing adjacent convolution and batch norm layers is another inference-time optimization, which involves updating the weight and bias of the convolution layer. The remaining layers of the bottleneck block are mapped onto a single column of NPU with one Shim Tile (0,0) and one Mem Tile (0,1), along with four AIE computer tiles spanning from (0,2) to (0,5), as illustrated in the figure below.
+We map a bottleneck block on a single column of NPU in depth-first manner where the output of one convolutional operation on an AIE core is sent directly to another convolutional operation on a separate AIE core, all without the need to transfer intermediate results off-chip.
+In our bottleneck pipeline implementation, every adjacent ReLU operation is fused into the convolution operation using the approach described in [conv2d_fused_relu](../conv2d_fused_relu). Fusing adjacent convolution and batch norm layers is another inference-time optimization, which involves updating the weight and bias of the convolution layer. The remaining layers of the bottleneck block are mapped onto a single column of NPU with one `Shim Tile (0,0)` and one `Mem Tile (0,1)`, along with four AIE computer tiles spanning from (0,2) to (0,5), as illustrated in the figure below.
-
-
Depth-first implementation of bottleneck block pipeline on a single column of NPU.
+
+
Depth-first implementation of bottleneck block pipeline on a single column of NPU.
-The data movement within this pipeline is orchestrated using the ObjectFifo (OF) primitive. Initially, input activation is brought into the array via the `Shim Tile (0,0)`. We broadcast the data to both `AIE (0,2)` and `AIE (0,4)` to perform the very first convolution and skip addition operation in the bottleneck block, respectively. Since tile (0,4) must await additional data from other kernels before proceeding with its execution, buffering the data for tile (0,4) within the Mem tile is imperative to prevent any stalls in the broadcast process. Due to the data's size, direct buffering in the smaller L1 memory module of tile (0,5) is impractical. Therefore, we require two OFs: one for broadcasting to tile (0,2) and the Mem tile and another for data movement between the Mem tile and tile (0,5). These two OFs are interconnected to indicate that data from the first OF should be implicitly copied to the second OF through the Mem tile's DMA.
-Starting from the tile (0,2), data is processed by each compute tile, with the intermediate activations being forwarded to the subsequent tile. AIE (0,2) handles 1x1 convolution with fused ReLU operation. Based on our hand analysis, we partition the 3x3 convolution across two cores to balance computation and accommodate weight distribution across two cores effectively. The resulting feature map from the 1x1 convolution is broadcasted to AIE (0,3) and AIE (0,5) to ensure all required input channels are available for generating output feature maps in the subsequent 3x3 convolution. We split the output feature map processing across AIE (0,3) and AIE (0,5), with each core computing half of the total channels. The outputs from AIE (0,3) and AIE (0,5) are then merged in AIE (0,4) to perform the final 1x1 convolution. This final convolution operation also integrates skip addition, utilizing the initial input and the output of the 1x1 convolution. Following this, the final ReLU activation is applied. Finally, the output is transmitted from the tile (0,4) back to the output via the Shim tile.
+The data movement within this pipeline is orchestrated using the ObjectFifo (OF) primitive. Initially, input activation is brought into the array via the `Shim Tile (0,0)`. We broadcast the data to both `AIE (0,2)` and `AIE (0,4)` via `Mem Tile (0,1)` to perform the very first convolution and skip addition operation in the bottleneck block, respectively. Since `AIE (0,4)` must await additional data from other kernels before proceeding with its execution, buffering the data for tile (0,4) within the `Mem Tile (0,1)` is imperative to prevent any stalls in the broadcast process. Due to the data's size, direct buffering in the smaller L1 memory module of `AIE (0,4)` is impractical. Therefore, we require two OFs: one for broadcasting to tile (0,2) and the Mem tile and another for data movement between the Mem tile and tile (0,4). These two OFs are interconnected to indicate that data from the first OF should be implicitly copied to the second OF through the Mem tile's DMA.
+
+Starting from the `AIE (0,2)`, data is processed by each compute tile, with the intermediate activations being forwarded to the subsequent tile. `AIE (0,2)` handles 1x1 convolution with fused ReLU operation. Based on our hand analysis, we partition the 3x3 convolution across two cores, `AIE (0,3)` and `AIE (0,5)`, to balance computation and accommodate weight distribution across two cores effectively. Therefore, the feature map from the 1x1 convolution is broadcasted to `AIE (0,3)` and `AIE (0,5)` to ensure all required input channels are available for generating output feature maps in the subsequent 3x3 convolution. We split the output feature map processing across these cores, with each core computing half of the total output channels. The outputs from `AIE (0,3)` and `AIE (0,5)` are then merged in `AIE (0,4)` to perform the final 1x1 convolution. This final convolution operation also integrates skip addition, utilizing the initial input to the bottleneck block and the output of the 1x1 convolution. The final ReLU activation is applied to obtain the final output feature map. This output feature map is transmitted from the `AIE (0,4)` back to the output via the `Shim Tile (0,0)`. Although not shown in the figure, weights are transferred separately using a `Shim Tile (0,0)` channel into `Mem Tile (0,1)`, which distributes them across appropriate AIE cores in parallel, leveraging the large number of MemTile channels.
We use the following architectural techniques to implement our bottleneck pipeline:
1. Depth-First Implementation: Spatial architectures provide coarse-grained flexibility that allows for tailoring of the data flow to optimize data movement. By tailoring the dataflow, we implement a depth-first schedule for a bottleneck block where the output of one convolutional operation on an AIE core is sent directly to another convolutional operation on a separate AIE core, all without the need to transfer intermediate results off-chip. This approach effectively minimizes the memory footprint associated with intermediate data, mitigating the overhead of costly off-chip accesses and increasing the overall performance.
-2. Data Layout: Optimize activation and weight layout to enhance memory access patterns and enable effective utilization of AIE parallel processing units, ultimately improving the performance of 2D convolution operations.
+2. Data Layout: Optimize activation and weight layout to enhance memory access patterns and enable effective utilization of AIE parallel processing units, ultimately improving the performance of 2D convolution operations. Please refer to our [conv2d](../conv2d) design for details on the data layout.
-3. Kernel Optimization: To optimize convolution operations on AIE, we vectorize the code using AIE vector intrinsics. We load 8 elements of the input channel into vector registers using vector load intrinsic. We apply the convolution operation on this loaded data, utilizing it for enhanced computational efficiency. We implement zero-padding to handle boundary conditions and ensure accurate convolution results, particularly at the edges of feature maps. This comprehensive approach optimizes convolution processing on AIE, facilitating efficient and accurate feature extraction in neural network applications. Input is a 4x8 matrix corresponding to 4 elements of a row and 8 input channels.
+3. Kernel Optimization: Please refer to our [conv2d](../conv2d) design for details on vectorizing convolution 2D.
4. Quantization: We use int8 precision for activation and weights. At int8 precision, AIE offers the highest compute density with 256 MAC/cycle.
-5. Layer Fused: Initially, we employ AIE's SRS capabilities to fuse ReLU directly into the convolution operation. This integration optimizes performance by eliminating separate ReLU computations, streamlining the convolution process.
-
-Another inference-time optimization is merging BatchNorm directly into the convolution weights. This strategy reduces redundant operations, ensuring more streamlined processing and improved overall performance.
+5. Layer Fused: Initially, we employ AIE's SRS capabilities to fuse ReLU directly into the convolution operation. This integration optimizes performance by eliminating separate ReLU computations, streamlining the convolution process. Please refer to our [conv2d_fused_relu](../conv2d_fused_relu) design for details on fusing ReLU into the convolution layer.
## Compilation
To compile the design:
diff --git a/programming_examples/ml/bottleneck/requirements.txt b/programming_examples/ml/bottleneck/requirements.txt
deleted file mode 100644
index 08ed5eeb4b..0000000000
--- a/programming_examples/ml/bottleneck/requirements.txt
+++ /dev/null
@@ -1 +0,0 @@
-torch
\ No newline at end of file
diff --git a/programming_examples/ml/conv2d/README.md b/programming_examples/ml/conv2d/README.md
index eefbd804ed..e2fd236567 100644
--- a/programming_examples/ml/conv2d/README.md
+++ b/programming_examples/ml/conv2d/README.md
@@ -10,23 +10,36 @@
# Convolution 2D
## Introduction
-Convolution is a crucial part of various machine learning and computer vision tasks, such as image recognition, object detection, and image segmentation. This README provides instructions for implementing convolution on AI Engine.
+Convolution is a crucial part of various machine learning and computer vision tasks, such as image recognition, object detection, and image segmentation. This README provides instructions for implementing convolution on AI Engine with 8-bit precision.
-At its core, it is a mathematical operation that combines an input tensor and a filter to produce an output tensor. The input data is represented as a multi-dimensional matrix. The filter is also represented as a multi-dimensional matrix with filter height, width, input and output channels (the same number of channels as the input data). The filter is systematically applied to different regions of the input data. At each step, the filter is element-wise multiplied by the overlapping region of the input data. The element-wise products are summed up to produce a single value, which represents the result of the convolution operation for that region. This process is repeated for all possible regions of the input data, producing an output matrix called the feature map.
+At its core, it is a mathematical operation that combines an input tensor and a filter to produce an output tensor. The input tensor is a multi-dimensional matrix with input weight, height, and channel. The filter is also represented as a multi-dimensional matrix with filter height, width, input, and output channels (the same number of channels as the input tensor). The filter is systematically applied to different regions of the input data. At each step, the filter is element-wise multiplied by the overlapping region of the input tensor. The element-wise products are summed up to produce a single value, representing the result of the convolution operation for that region. This process is repeated for all possible regions of the input tensor, producing an output tensor called the feature map.
-The process of applying the filter to different regions of the input data is often visualized as a sliding window moving across the input data. The size of the sliding window corresponds to the size of the filter, and it moves with a certain stride (the number of pixels it moves at each step). The convolution operation consists of seven nested loops, iterating over the input height, input lenght, input channel, output channel, filter height, filter length, and the batch size, each loop corresponding to different aspect of the operation. This systematic process extracts features from the input tensor, yielding the output feature map and illustrating the computational intricacies of convolution.
+The process of applying the filter to different regions of the input tensor is often visualized as a sliding window moving across the input data. The size of the sliding window corresponds to the size of the filter, and it moves with a certain stride (the number of pixels it moves at each step). The convolution operation consists of seven nested loops, iterating over the input height, input length, input channel, output channel, filter height, filter length, and batch size, each loop corresponding to a different aspect of the operation. This systematic process extracts features from the input tensor, yielding the output feature map and illustrating the computational intricacies of convolution. In this design, we vectorize a two-dimensional convolution with 1x1 filter size.
-## Acceleration Techniques
-1. Kernel Optimization: To optimize convolution operations on AIE, we vectorize the code using AIE vector intrinsics. We load 8 elements of the input channel into vector registers using vector load intrinsic. We apply the convolution operation on this loaded data, utilizing it for enhanced computational efficiency. We implement zero-padding to handle boundary conditions to ensure accurate convolution results, particularly at the edges of feature maps. This comprehensive approach optimizes convolution processing on AIE, facilitating efficient and accurate feature extraction in neural network applications. Input is a 4x8 matrix corresponding to 4 elements of a row and 8 input channels.
-2. Quantization: We use int8 precision for activation and weights. At int8 precision, AIE offers the highest compute density with 256 MAC/cycle.
+## Source Files Overview
-3. Data Layout: Optimize activation and weight layout to enhance memory access patterns and enables effective utilization of AIE parallel processing units, ultimately improving the performance of 2D convolution operations.
+```
+.
++-- act_layout.png # Figure describing input/output data layout.
++-- aie2.py # A Python script that defines the AIE array structural design using MLIR-AIE operations.
++-- Makefile # Contains instructions for building and compiling software projects.
++-- README.md # This file.
++-- run.lit # For LLVM Integrated Tester (LIT) of the design.
++-- test.py # Python code testbench for the design example.
+```
+
+## NPU Implementation
+1. Kernel Optimization: To optimize convolution operations on AIE, we vectorize the code using AIE vector intrinsics. We load 8 elements of the input channel into vector registers using vector load intrinsic. We perform the convolution operation using vector MAC/MUL on this loaded data. We implement zero-padding to handle boundary conditions to ensure accurate convolution results, particularly at the edges of feature maps. This comprehensive approach optimizes convolution processing on AIE, facilitating efficient and accurate feature extraction in neural network applications. Input is a 4x8 matrix corresponding to 4 elements of a row and 8 input channels.
+
+2. Quantization: We use `int8` precision for activation and weights. At `int8` precision, AIE offers the highest compute density with 256 MAC/cycle.
+
+3. Data Layout: We optimize activation and weight layout to enhance memory access patterns and enable effective utilization of AIE parallel processing units, ultimately improving the performance of 2D convolution operations.
## Data Layout
-We must ensure that the data layout is compatible with efficient SIMD processing and rearrange the input data into a format where contiguous elements represent consecutive X-dimension values for each channel. We adopt a channels-last memory ordering, denoted as Y{C/8}XC8, to exploit output channel parallelism by ensuring channels become the densest dimension. Operating on 8 elements simultaneously, we process 8 channels simultaneously with the same width. Subsequently, we traverse the entire width dimension, handling the remaining channels in batches of 8. This process continues row-wise, resulting in our final data layout pattern: NYCXC8. This optimized layout enhances memory access patterns and enables effective utilization of parallel processing units, ultimately improving the performance of 2D convolution operations. This transformation ensures that data can be efficiently loaded into SIMD registers and processed in parallel.
+We must ensure that the data layout is compatible with efficient SIMD processing and rearrange the input data into a format where contiguous elements represent consecutive X-dimension values for each channel. We adopt a channels-last memory ordering, denoted as Y{C/8}X{C8}, to exploit output channel parallelism by ensuring channels become the densest dimension. Operating on 8 elements simultaneously, we process 8 channels simultaneously with the same width. Subsequently, we traverse the entire width dimension, handling the remaining channels in batches of 8. This process continues row-wise, resulting in our final data layout pattern: Y{C/8}X{C8}. This optimized layout enhances memory access patterns and enables effective utilization of parallel processing units, ultimately improving the performance of 2D convolution operations. This transformation ensures that data can be efficiently loaded into SIMD registers and processed in parallel.
-The below figure shows our channle parallel data layout (Y{C/8}XC8) for a tensor dimension 16x16x16:
+The below figure shows our channel parallel data layout (Y{C/8}X{C8}) for a tensor dimension 8x8x16:
-In the Y{C/8}XC8 (with N=1) data layout, the data is organized in memory as follows:
+In the Y{C/8}X{C8} (with N=1) data layout, the data is organized in memory as follows:
+
+* C8: Indicates that 8 elements of the input channel are processed together.
+* X: Represents the input feature map dimension.
+* C/8: Denotes the remaining number of channels.
+* Y: Represents the output feature map dimension.
+
-* Y: Represents the output feature map dimension.
-* C/8: Denotes the number of channels.
-* X: Represents the input feature map dimension.
-* C8: Indicates that 8 elements of the input channel are processed together.
+{O/8}{I/8}YX{I8}{O8} Weight Layout:
-{O/8}{I/8}YXI8O8 Weight Layout:
+We align the weight layout as specified: O/8, I/8, Y, X, I8, O8, to match the input tensor processing. We first load the weight tensor and organize it to match this layout, where dimensions represent output channels, input channels, kernel height, kernel width, input channel groups of 8, and output channel groups of 8. By aligning the weight layout in this manner, we enable seamless integration with the input data layout, maximizing parallelism and minimizing memory access overhead.
-We align the weight layout as specified: O/8, I/8, Y, X, I8, O8, to match the input tensor processing. We first load the weight tensor, organizing it to match this layout, where dimensions represent: output channels, input channels, kernel height, kernel width, input channel groups of 8, and output channel groups of 8. By aligning the weight layout in this manner, we enable seamless integration with the input data layout, maximizing parallelism and minimizing memory access overhead.
+In the {O/8}{I/8}YX{I8}{O8} data layout, the data is organized in memory as follows:
-In the {O/8}{I/8}YXI8O8 data layout, the data is organized in memory as follows:
+* O8: Indicates that 8 elements of the output channel are processed together.
+* I8: Indicates that 8 elements of the input channel are processed together.
+* X: Represents the kernel weight.
+* Y: Represents the kernel height.
+* I/8: Denotes the remaining number of input channels.
+* O/8: Denotes the remaining number of output channels.
-* O/8: Denotes the number of output channels.
-* I/8: Denotes the number of input channels.
-* Y: Represents the kernel height.
-* X: Represents the kernel weight.
-* I8: Indicates that 8 elements of the input channel are processed together.
-* O8: Indicates that 8 elements of the output channel are processed together.
## Compilation
To compile the design:
diff --git a/programming_examples/ml/conv2d/act_layout.png b/programming_examples/ml/conv2d/act_layout.png
index 1c79a7f7ab..7630e06ea0 100644
Binary files a/programming_examples/ml/conv2d/act_layout.png and b/programming_examples/ml/conv2d/act_layout.png differ
diff --git a/programming_examples/ml/conv2d/requirements.txt b/programming_examples/ml/conv2d/requirements.txt
deleted file mode 100644
index 08ed5eeb4b..0000000000
--- a/programming_examples/ml/conv2d/requirements.txt
+++ /dev/null
@@ -1 +0,0 @@
-torch
\ No newline at end of file
diff --git a/programming_examples/ml/conv2d_fused_relu/README.md b/programming_examples/ml/conv2d_fused_relu/README.md
index 0cf93bd356..ea737099d8 100644
--- a/programming_examples/ml/conv2d_fused_relu/README.md
+++ b/programming_examples/ml/conv2d_fused_relu/README.md
@@ -12,22 +12,34 @@
## Introduction
-In [conv2d](../conv2d), we describe how to implement a convolution kernel on AIE. While [relu](../relu) describes the implementation of the ReLU activation function on AIE. This README provides instructions for fusing convolution with the ReLU activation function on AI Engine (AIE).
+In [conv2d](../conv2d), we describe how to implement a two-dimensional convolution kernel on AIE. While [relu](../relu) describes the implementation of the Rectified Linear Unit (ReLU) activation function on AIE. This README provides instructions for fusing convolution with the ReLU activation function on AI Engine (AIE).
+
+
+## Source Files Overview
+
+```
+.
++-- aie2.py # A Python script that defines the AIE array structural design using MLIR-AIE operations.
++-- Makefile # Contains instructions for building and compiling software projects.
++-- README.md # This file.
++-- run.lit # For LLVM Integrated Tester (LIT) of the design.
++-- test.py # Python code testbench for the design example.
+```
## Fusing ReLU
-Fusing ReLU into the convolution operation can optimize the performance by reducing unnecessary data movement, leading to lower external memory bandwidth requirements and computational overhead. The ReLU activation function introduces non-linearity by setting negative values to zero and leaving positive values unchanged. We utilize the SRS capability of AIE to efficiently compute ReLU activation while the data is in the accumulation registers. Such an implementation completely eliminates any need for data movement by fusing at the vector register level.
+Fusing ReLU into the convolution operation can optimize the performance by reducing unnecessary data movement, leading to lower external memory bandwidth requirements and computational overhead. The ReLU activation function introduces non-linearity by setting negative values to zero and leaving positive values unchanged. For fixed-point arithmetic, we can utilize the Shift-Round-Saturate (SRS) capability of AIE to apply an appropriate transformation involving shifting out lower-order bits, rounding, and saturation using the SRS family of intrinsics. Using SRS intrinsics, we can efficiently implement ReLU activation while the data is in the accumulation registers. Such an implementation completely eliminates any need for data movement by fusing at the vector register level.
-After performing the convolution operation, we use `aie::set_rounding()` and `aie::set_saturation()` to set the rounding and saturation modes for the computed results in the accumulator. Setting round mode `postitive_inf` rounds halfway towards positive infinity while setting saturation to `aie::saturation_mode::saturate` saturation rounds an uint8 range (0, 255).
+After performing the convolution operation, we use `aie::set_rounding()` and `aie::set_saturation()` to set the rounding and saturation modes for the computed results in the accumulator. Setting round mode `postitive_inf` rounds halfway towards positive infinity while setting saturation to `aie::saturation_mode::saturate` saturation rounds an uint8 range (0, 255).
```
-::aie::set_saturation(
- aie::saturation_mode::saturate); // Needed to saturate properly to uint8
::aie::set_rounding(
- aie::rounding_mode::positive_inf); // Needed to saturate properly to uint8
+ aie::rounding_mode::positive_inf); # Needed to rounding properly to uint8
+::aie::set_saturation(
+ aie::saturation_mode::saturate); # Needed to saturate properly to uint8
```
-The output data is generated in Y{C/8}XC8 layout. Please refer to our [conv2d](../conv2d) design for details on the data layout.
+The output data is generated in Y{C/8}X{C8} layout. Please refer to our [conv2d](../conv2d) design for details on the data layout.
### Benefits of Fusing Convolutiona and ReLU :
@@ -38,7 +50,7 @@ Fusing ReLU into the convolution operation eliminates unnecessary memory accesse
Fusing ReLU reduces the number of instructions executed per element, resulting in improved computational efficiency and overall performance of the convolution operation.
3. Enhanced Resource Utilization:
-Combining convolution and ReLU operations allows computational resources such as CPU cores or SIMD units to be utilized more efficiently, maximizing throughput and achieving better resource utilization.
+Combining convolution and ReLU operations allows computational resources to be utilized more efficiently, maximizing throughput and achieving better resource utilization.
## Compilation
diff --git a/programming_examples/ml/conv2d_fused_relu/requirements.txt b/programming_examples/ml/conv2d_fused_relu/requirements.txt
deleted file mode 100644
index 08ed5eeb4b..0000000000
--- a/programming_examples/ml/conv2d_fused_relu/requirements.txt
+++ /dev/null
@@ -1 +0,0 @@
-torch
\ No newline at end of file
diff --git a/programming_examples/ml/resnet/README.md b/programming_examples/ml/resnet/README.md
index 402b10639a..e872af6403 100755
--- a/programming_examples/ml/resnet/README.md
+++ b/programming_examples/ml/resnet/README.md
@@ -8,16 +8,13 @@
//
//===----------------------------------------------------------------------===//-->
-# ResNet with Offloaded Conv2_x Bottleneck Blocks
+# ResNet with Offloaded Conv2_x Layers
## Introduction
-ResNet [[1]](#1) is a convolutional neural network architecture that has gained significant popularity for various computer vision tasks, including image classification, object detection, and image segmentation. It is renowned for its depth and efficiency in training very deep networks.
-
-This README focuses on a specific optimization technique applied to ResNet, specifically targeting the offloading of the conv2_x part of the bottleneck blocks. By offloading computations to dedicated hardware accelerators or specialized processors, we aim to improve the overall efficiency and speed of the network, especially when deploying it on resource-constrained devices or in scenarios where real-time processing is critical.
-
+ResNet [[1]](#1) is a convolutional neural network architecture that has gained significant popularity for various computer vision tasks, including image classification, object detection, and image segmentation. It is renowned for its depth and efficiency in training very deep networks. This README focuses on our implementation of the conv2_x layers of the ResNet architecture using three columns of NPU.
## ResNet Architecture Overview
-ResNet consists of several key components:
+ResNet consists of the following key components:
1. Input Layer: This layer accepts input image data with dimensions typically set to 224x224x3 (width, height, and RGB channels).
2. Convolutional Layers: The initial layers perform convolution operations to extract basic features from the input image.
@@ -28,29 +25,41 @@ ResNet consists of several key components:
4. Pooling Layers: Max pooling layers reduce the spatial dimensions of the feature maps.
5. Fully Connected Layer: Produces the final output predictions, typically followed by a softmax activation for classification tasks.
+## Source Files Overview
+
+```
+.
++-- layers_conv2_x # Implementation of ResNet conv2_x layers on NPU
+| +-- aie2.py # A Python script that defines the AIE array structural design using MLIR-AIE operations.
+| +-- Makefile # Contains instructions for building and compiling software projects.
+| +-- resnet_conv2x_pipeline.png # Figure describing our implementation of conv2_x layers on NPU.
+| +-- run.lit # For LLVM Integrated Tester (LIT) of the design.
+| +-- test.py # Python code testbench for the design example.
++-- README.md # This file.
+
+```
## NPU Implementation
The conv2_x stage of ResNet comprises a series of bottleneck blocks, each containing convolutional, batch norm, and ReLU layers responsible for learning more complex features from the input data. By offloading the computations within these blocks to AI Engine, we aim to:
* Reduce the computational burden on the main processing unit (e.g., CPU or GPU).
-* Improve overall inference speed and efficiency, especially in scenarios where real-time processing is crucial.
-* Enable deployment on resource-constrained devices with limited computational resources.
-
-We adopt the [bottleneck design](../../bottleneck) design approach to execute a depth-first implementation of conv2_x layers, seamlessly connecting the output of one bottleneck block on an NPU column to another on a separate column, all without the necessity of transferring intermediate results off-chip. Compared to [bottleneck design](../../bottleneck) design, the initial bottleneck block requires an additional 1x1 convolution on the AIE (0,4) tile to handle channel mismatch for the skip addition between the input from the skip path and the input from the non-skip path. This mismatch arises because the initial input activation transferred from the skip path possesses fewer input channels compared to the output on the non-skip path. To overcome this issue, an additional 1x1 convolution is introduced in the skip path. After the initial processing in the first bottleneck block, the output is relayed to the second bottleneck block on a separate NPU column. The output activation is broadcasted to both `AIE (1,5)` and `AIE (1,3)` via `Mem Tile (1,1)`. The second bottleneck's processing proceeds as described in [bottleneck design](../../bottleneck) design. Similarly, the subsequent bottleneck block requires the output from the second bottleneck, avoiding any need to send intermediate activations off-chip. Upon the completion of processing in the third bottleneck block, the final output is transmitted from tile `AIE (2,4)` back to the output via `Shim tile (2,0)`, completing the seamless flow of computation within the NPU architecture.
+* Improve overall inference speed and efficiency.
+The below figures shows our implementation of the conv2_x layers of the ResNet architecture using three columns of NPU.
-
ResNet conv2_x bottleneck blocks are stacked in depth-first fashion to avoid unnecessary off-chip data movement.
+
ResNet conv2_x stage's bottleneck blocks are stacked depth-first to avoid unnecessary off-chip data movement.
+We adopt the [bottleneck design](../../bottleneck) approach to execute a depth-first implementation of conv2_x layers, seamlessly connecting the output of one bottleneck block on an NPU column to another on a separate column, all without the necessity of transferring intermediate results off-chip. Compared to [bottleneck design](../../bottleneck), the first bottleneck block in the conv2_x stage requires an additional 1x1 convolution on the `AIE (0,4)` tile to handle channel mismatch for the skip addition between the input from the skip path and the input from the non-skip path. This mismatch arises because the initial input activation transferred from the skip path possesses fewer input channels compared to the output on the non-skip path. To overcome this issue, an additional 1x1 convolution is introduced in the skip path. After the initial processing in the first bottleneck block, the output is sent directly to the second bottleneck block on a separate NPU column. The output activation is broadcasted to both `AIE (1,5)` and `AIE (1,3)` via `Mem Tile (1,1)`. The second bottleneck's processing proceeds as described in [bottleneck design](../../bottleneck).
+
+Similarly, the subsequent bottleneck block requires the output from the second bottleneck, avoiding any need to send intermediate activations off-chip. Upon processing in the third bottleneck block, the final output is transmitted from tile `AIE (2,4)` back to the output via `Shim tile (2,0)`, completing the seamless flow of computation within the NPU architecture. Thus, our depth-first implementation avoids any unnecessary off-chip data movement for intermediate tensors.
+
-## Usage and Deployment
-To leverage the optimized ResNet with offloaded conv2_x bottleneck blocks:
-* [IRON Programming](https://github.com/Xilinx/mlir-aie/tree/gagan_asplos_resnet/programming_examples/ml/resnet/layers_conv2_x): Demonstrates the IRON flow for offloading conv2_x to AIE.
## Compilation
diff --git a/programming_examples/ml/resnet/layers_conv2_x/requirements.txt b/programming_examples/ml/resnet/layers_conv2_x/requirements.txt
deleted file mode 100755
index 08ed5eeb4b..0000000000
--- a/programming_examples/ml/resnet/layers_conv2_x/requirements.txt
+++ /dev/null
@@ -1 +0,0 @@
-torch
\ No newline at end of file
diff --git a/programming_guide/section-6/README.md b/programming_guide/section-6/README.md
index 57e02207f0..714b8b6249 100644
--- a/programming_guide/section-6/README.md
+++ b/programming_guide/section-6/README.md
@@ -31,7 +31,7 @@ There are a number of example designs available [here](../../programming_example
## Exercises
-1. In [bottleneck](../../programming_examples/ml/bottleneck/) design following a dataflow approach, how many elements does the 3x3 convolution operation require to proceed with its computation?
+1. In [bottleneck](../../programming_examples/ml/bottleneck/) design following a dataflow approach, how many rows of input data does the 3x3 convolution operation require to proceed with its computation?
2. Suppose you have a bottleneck block with input dimensions of 32x32x256. After passing through the 1x1 convolutional layer, the output dimensions become 32x32x64. What would be the output dimensions after the subsequent 3x3 convolutional layer, assuming a stride of 1 and no padding and output channel of 64?
-----