diff --git a/.github/workflows/buildAndTestRyzenAI.yml b/.github/workflows/buildAndTestRyzenAI.yml
index acf2262fa2..bc3988e002 100644
--- a/.github/workflows/buildAndTestRyzenAI.yml
+++ b/.github/workflows/buildAndTestRyzenAI.yml
@@ -127,6 +127,7 @@ jobs:
           python -m venv aie-venv
           source aie-venv/bin/activate
           pip install -r python/requirements.txt
+          pip install -r python/requirements_ml.txt
           pip install jupyter
           sed -i.bak 's/OUTPUT_TIMEOUT = 10/OUTPUT_TIMEOUT = 100/g' \
             $(python -c 'import site; print(site.getsitepackages()[0])')/jupyter_client/runapp.py
diff --git a/aie_kernels/README.md b/aie_kernels/README.md
new file mode 100644
index 0000000000..acc9ddcd8d
--- /dev/null
+++ b/aie_kernels/README.md
@@ -0,0 +1,57 @@
+<!---//===- README.md --------------------------*- Markdown -*-===//
+//
+// This file is licensed under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+// Copyright (C) 2022, Advanced Micro Devices, Inc.
+// 
+//===----------------------------------------------------------------------===//-->
+
+# AIE Kernels
+
+These kernels are provided as example building blocks for larger designs, and also as illustrations of how to write single core programs for AIEs which can then be duplicated or mixed into multi-core designs using the structural IRON API.
+
+In some cases, the kernels are just generic C code, and will run on any family of AI Engines with varying performance.  Other kernels are then optimized for the AIE1 and AIE2 architectures.  Finally, some kernels use the AIE API, which is a C++ header-only library providing types and operations that get translated into efficient low-level intrinsics, and whose documentation can be found [here](https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_api/aie_api/doc/index.html), while others use the architecture specific low-level intrinsics directly
+
+> **NOTE:** this set of AIE kernels are meant for demonstration along with the programming examples. The goal is not to be 100% performant, there may be room for further improvement. The kernels are provided as-is with no guarantees of support of AMD or AMD Research and Advanced Development.
+
+## Generic
+| Class | Name | Coding style | Purpose | Datatypes |
+|-|-|-|-|-|
+| basic | [passThrough.cc](./generic/passThrough.cc) | AIE API | A simple memcpy operation | `uint8_t`, `int16_t`, `int32_t` |
+
+## AIE1
+| Name | Coding style | Purpose |
+|-|-|-|
+
+## AIE2
+| Class | Name | Coding style | Purpose | Datatypes |
+|-|-|-|-|-|
+| basic | [zero.cc](../../aie_kernels/aie2/zero.cc) | AIE API | Fill a tensor with zeroes | template |
+| basic | [add.cc](../../aie_kernels/aie2/add.cc) | AIE API | Pointwise addition of 2 tensors | `bfloat16` |
+| basic | [mul.cc](../../aie_kernels/aie2/mul.cc) | AIE API | Pointwise multiplication of 2 tensors | `bfloat16` |
+| basic | [scale.cc](../../aie_kernels/aie2/scale.cc) | AIE API | Scale all elements of a tensor with a scale factor | `int32_t` |
+| basic | [bitwiseOR.cc](../../aie_kernels/aie2/bitwiseOR.cc) | AIE API | Bitwise OR of fixed point tensors | `uint8_t`,`int16_t`,`int32_t`|
+| basic | [bitwiseAND.cc](../../aie_kernels/aie2/bitwiseAND.cc) | AIE API | Bitwise AND of fixed point tensors | `uint8_t`,`int16_t`,`int32_t` |
+| gemm  | [mm.cc](../../aie_kernels/aie2/mm.cc) | AIE API | Matrix/Matrix multiplication | `int16_t`,`bfloat16_t` |
+| gemm  | [mv.cc](../../aie_kernels/aie2/mv.cc) | AIE API | Matrix/Vector multiplication | `bfloat16_t` |
+| |
+| reduction | [reduce_add.cc](../../aie_kernels/aie2/reduce_add.cc) | Intrinsics | Find the sum of elements in a tensor | `int32 _t` |
+| reduction| [reduce_max.cc](../../aie_kernels/aie2/reduce_max.cc) | Intrinsics | Find max value across a tensor | `int32 _t` |
+| reduction | [reduce_min.cc](../../aie_kernels/aie2/reduce_min.cc) | Intrinsics | Find min value across a tensor | `int32 _t` |
+||
+| ml | [conv2dk1_i8.cc](../../aie_kernels/aie2/conv2dk1_i8.cc) | AIE API | 1x1 Conv2D | `int8_t` |
+| ml | [conv2dk1.cc](../../aie_kernels/aie2/conv2dk1.cc) | AIE API | 1x1 Conv2D with fused ReLU | `int8_t`, `uint8_t` |
+| ml | [conv2dk3.cc](../../aie_kernels/aie2/conv2dk3.cc) | AIE API | 3x3 Conv2D with fused ReLU | `int8_t`, `uint8_t` |
+| ml | [conv2dk1_skip.cc](../../aie_kernels/aie2/conv2dk1_skip.cc) | AIE API| 1x1 Conv2D with fused skip addition | `int8_t`, `uint8_t` |
+| ml | [conv2dk1_skip_init.cc](../../aie_kernels/aie2/conv2dk1_skip_init.cc) | AIE API | 1x1 Conv2D with fused 1x1 Conv2D skip addition | `int8_t`, `uint8_t` |
+| ml |[relu.cc](../../aie_kernels/aie2/relu.cc) | Intrinsics | ReLU activation function | `bfloat16_t` |
+| ml |  [bf16_exp.cc](../../aie_kernels/aie2/bf16_exp.cc) | AIE API | Raise all elements in a `bfloat` tensor to $e^x$ | `bfloat16_t` |
+| |
+| vision | [gray2rgba.cc](../../aie_kernels/aie2/gray2rgba.cc) | AIE API | Convert from grayscale to RGBA format | `uint8_t` |
+| vision |[rgba2gray.cc](../../aie_kernels/aie2/rgba2gray.cc) | AIE API | Convert from RGBA format to grayscale | `uint8_t` |
+| vision | [rgba2hue.cc](../../aie_kernels/aie2/rgba2hue.cc) | AIE API | Convert from RGBA to hue | `uint8_t` |
+| vision | [addWeighted.cc](../../aie_kernels/aie2/addWeighted.cc) | AIE API | Fixed point weighted sum of two tensors | `uint8_t` |
+| vision | [threshold.cc](../../aie_kernels/aie2/threshold.cc) | AIE API | Clipping | `uint8_t` |  
+| vision | [filter2d.cc](../../aie_kernels/aie2/filter2d.cc) | AIE API | Fixed point 2D image processing filter | `uint8_t` |
diff --git a/docs/conferenceDescriptions/asplos24TutorialDescription.md b/docs/conferenceDescriptions/asplos24TutorialDescription.md
index 31208e6e38..f7f26d5b8a 100644
--- a/docs/conferenceDescriptions/asplos24TutorialDescription.md
+++ b/docs/conferenceDescriptions/asplos24TutorialDescription.md
@@ -22,18 +22,19 @@ Prerequisite: please bring your laptop, so that you can ssh into our Ryzen AI en
 
 | Time | Topic | Presenter | Slides or Code |
 |------|-------|-----------|----------------|
-| 08:30am | Intro to spatial compute and explicit data movement | Kristof | tbd |
-| 08:45am | "Hello World" from Ryzen AI | Jack | tbd |
-| 09:00am | Data movement on Ryzen AI with objectFIFOs | Joe | tbd |
-| 09:30am | Exersise 1: Build and run your first program | All | tbd |
-| 09:45am | Exersise 2: Vector-scalar | All |tbd |
+| 08:30am | Intro to spatial compute and explicit data movement | Kristof | [Programming Guide](../../programming_guide/) |
+| 08:45am | "Hello World" from Ryzen AI | Joe | [AI Engine Basic Building Blocks](../../programming_guide/section-1/) |
+| 09:00am | Data movement on Ryzen AI with objectFIFOs | Joe | [Data Movement](../../programming_guide/section-2/) |
+| 09:30am | Your First Program | Kristof | [My First Program](../../programming_guide/section-3) |
+| 09:50am | Exercise 1: Build and run your first program | All | [Passthrough](../../programming_examples/basic/passthrough_kernel/) |
 | 10:00am | Break | | |
-| 11:00am | Tracing and performance analysis | Jack | tbd |
-| 11:10am | Exercise 3: Tracing vector-scalar | All | tbd |
-| 11:30am | Vectorizing on AIE | Kristof | tbd |
-| 11:40am | Exercise 4: Vectorized vector-scalar | All | tbd |
-| 12:00pm | Dataflow and larger designs | Joe | tbd |
-| 12:15pm | Exercises | All | |
+| 10:30am | Exercise 2: Vector-Scalar Mul | All | [Vector Scalar Mul](../../programming_examples/basic/vector_scalar_mul/) |
+| 10:40am | Tracing and performance analysis | Jack | [Timers](../../programming_guide/section-4/section-4a/) and [Tracing](../../programming_guide/section-4/section-4b/) |
+| 11:10am | Exercise 3: Tracing vector-scalar | All | [Vector Scalar Mul](../../programming_examples/basic/vector_scalar_mul/) |
+| 11:30am | Vectorizing on AIE | Jack | [Kernel Vectorization](../../programming_guide/section-4/section-4c/) |
+| 11:40am | Exercise 4: Vectorized vector-scalar | All | [Vector Scalar Mul](../../programming_examples/basic/vector_scalar_mul/) |
+| 12:00pm | Dataflow and larger designs | Joe | [Example Vector Designs](../../programming_guide/section-5/) and [Large Example Designs](../../programming_guide/section-6/) |
+| 12:15pm | Exercises | All | [Programming Examples](../../programming_examples/) |
 | 12:30pm | Close Tutorial | All | |
 
 
@@ -46,3 +47,5 @@ Prerequisite: please bring your laptop, so that you can ssh into our Ryzen AI en
 *Kristof Denolf* is a Fellow in AMD's Research and Advanced Development group where he is working on energy efficient computer vision and video processing applications to shape future AMD devices. He earned a M.Eng. in electronics from the Katholieke Hogeschool Brugge-Oostende (1998), now part of KULeuven, a M.Sc. in electronic system design from Leeds Beckett University (2000) and a Ph.D. from the Technical University Eindhoven (2007). He has over 25 years of combined research and industry experience at IMEC, Philips, Barco, Apple, Xilinx and AMD. His main research interest are all aspects of the cost-efficient and dataflow oriented design of video, vision and graphics systems.
 
 *Phil James-Roxby* is a Senior Fellow in AMD’s Research and Advanced Development group, working on compilers and runtimes to support current and future AMD devices, particularly in the domain on AI processing.  In the past, he has been responsible for a number of software enablement activities for hardware devices, including SDNet and SDAccel at Xilinx, and the original development environement for the AI Engines.  He holds a PhD from the University of Manchester on hardware acceleration of embedded machine learning applications, and his main research interest continues to be how to enable users to efficiently use diverse hardware in heterogenous systems.
+
+*Samuel Bayliss* is a Fellow in the Research and Advanced Development group at AMD. His academic experience includes formative study at Imperial College London, for which he earned MEng and PhD degrees in 2006 and 2012 respectively. He is energized by his current work in advancing compiler tooling using MLIR, developing programming abstractions for parallel compute and evolving hardware architectures for efficient machine learning.
\ No newline at end of file
diff --git a/programming_examples/basic/README.md b/programming_examples/basic/README.md
index 1f9a8a9477..9d9a57169f 100644
--- a/programming_examples/basic/README.md
+++ b/programming_examples/basic/README.md
@@ -20,5 +20,4 @@ These programming examples provide a good starting point to illustrate how to bu
 * [Vector Reduce Max](./vector_reduce_max) - Single tile performs a reduction of a vector to return the `max` of the elements.
 * [Vector Reduce Min](./vector_reduce_min) - Single tile performs a reduction of a vector to return the `min` of the elements.
 * [Vector Exp](./vector_exp) - A simple element wise exponent function, using the look up table capabilities of the AI Engine.
-* [Matrix Multiplication](./matrix_multiplication) - This directory contains multiple designs spanning: single core and multi-core (whole array) matrix-matrix multiplication, and matrix-vector multiplication designs. It also contains sweep infrastructure for benchmarking.
-* [Hello World (printf log)](./log_hello_world) - Single tile performs a self-query and `printf` function where printed data is moved from local buffers to external memory to be read by the host processor.
+* [Matrix Multiplication](./matrix_multiplication) - This directory contains multiple designs spanning: single core and multi-core (whole array) matrix-matrix multiplication, and matrix-vector multiplication designs. It also contains sweep infrastructure for benchmarking.
\ No newline at end of file
diff --git a/programming_examples/basic/log_hello_world/CMakeLists.txt b/programming_examples/basic/log_hello_world/CMakeLists.txt
deleted file mode 100755
index c4ca0825d4..0000000000
--- a/programming_examples/basic/log_hello_world/CMakeLists.txt
+++ /dev/null
@@ -1,75 +0,0 @@
-# This file is licensed under the Apache License v2.0 with LLVM Exceptions.
-# See https://llvm.org/LICENSE.txt for license information.
-# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-#
-# (c) Copyright 2023 Advanced Micro Devices, Inc.
-
-# parameters
-# -DBOOST_ROOT: Path to Boost install
-# -DXRT_INC_DIR: Full path to src/runtime_src/core/include in XRT cloned repo
-# -DXRT_LIB_DIR: Path to xrt_coreutil.lib
-# -DTARGET_NAME: Target name to be built
-
-# cmake needs this line
-cmake_minimum_required(VERSION 3.1)
-
-set(CMAKE_CXX_STANDARD 23) 
-set(CMAKE_CXX_STANDARD_REQUIRED YES)
-
-find_program(WSL NAMES powershell.exe)
-
-if (NOT WSL)
-    set(CMAKE_C_COMPILER gcc-13)
-    set(CMAKE_CXX_COMPILER g++-13)
-    set(BOOST_ROOT /usr/include/boost CACHE STRING "Path to Boost install")
-    set(XRT_INC_DIR /opt/xilinx/xrt/include CACHE STRING "Path to XRT cloned repo")
-    set(XRT_LIB_DIR /opt/xilinx/xrt/lib CACHE STRING "Path to xrt_coreutil.lib")
-else()
-    set(BOOST_ROOT C:/Technical/thirdParty/boost_1_83_0 CACHE STRING "Path to Boost install")
-    set(XRT_INC_DIR C:/Technical/XRT/src/runtime_src/core/include CACHE STRING "Path to XRT cloned repo")
-    set(XRT_LIB_DIR C:/Technical/xrtIPUfromDLL CACHE STRING "Path to xrt_coreutil.lib")
-endif()
-
-set(TARGET_NAME test CACHE STRING "Target to be built")
-
-SET (ProjectName ${TARGET_NAME})
-SET (currentTarget ${TARGET_NAME})
-
-if ( WSL )
-	set(CMAKE_RUNTIME_OUTPUT_DIRECTORY_RELEASE ${CMAKE_BINARY_DIR})
-endif ()
-
-project(${ProjectName})
-
-# Find packages
-find_package(Boost REQUIRED)
-
-add_executable(${currentTarget}
-    ${CMAKE_CURRENT_SOURCE_DIR}/../../../runtime_lib/test_lib/test_utils.cpp
-    test.cpp
-)
-
-target_compile_definitions(${currentTarget} PUBLIC DISABLE_ABI_CHECK=1)
-
-target_include_directories (${currentTarget} PUBLIC 
-    ${XRT_INC_DIR}
-    ${Boost_INCLUDE_DIRS}
-    ${CMAKE_CURRENT_SOURCE_DIR}/../../../runtime_lib/test_lib
-)
-
-target_link_directories(${currentTarget} PUBLIC
-    ${XRT_LIB_DIR}
-    ${Boost_LIBRARY_DIRS}
-)
-
-if (NOT WSL)
-    target_link_libraries(${currentTarget} PUBLIC
-        xrt_coreutil
-        boost_program_options
-        boost_filesystem
-    )
-else()
-    target_link_libraries(${currentTarget} PUBLIC
-        xrt_coreutil
-    )
-endif()
diff --git a/programming_examples/basic/log_hello_world/Makefile b/programming_examples/basic/log_hello_world/Makefile
deleted file mode 100755
index c5bcd8d5c3..0000000000
--- a/programming_examples/basic/log_hello_world/Makefile
+++ /dev/null
@@ -1,48 +0,0 @@
-##===- Makefile -----------------------------------------------------------===##
-# 
-# This file licensed under the Apache License v2.0 with LLVM Exceptions.
-# See https://llvm.org/LICENSE.txt for license information.
-# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-# 
-##===----------------------------------------------------------------------===##
-
-include ../../makefile-common
-
-all: hello_world_elfstrings.csv
-
-targetname = helloWorld
-
-build/%.o: %.cc
-	mkdir -p ${@D}
-	cd ${@D} && xchesscc_wrapper ${CHESSCCWRAP2_FLAGS} -c $(<:%=../%) -o ${@F}
-
-build/hello_world.mlir: hello_world.py
-	mkdir -p ${@D}
-	python3 $< > $@
-
-build/hello_world.xclbin: build/hello_world.mlir build/kernel.o
-	mkdir -p ${@D}
-	cd ${@D} && aiecc.py --aie-generate-cdo --aie-generate-ipu --no-compile-host \
-		--xclbin-name=${@F} --ipu-insts-name=insts.txt $(<:%=../%)
-
-hello_world_elfstrings.csv: build/hello_world.xclbin
-	python3 elfStringParser.py --input ./build --output $@
-
-${targetname}.exe: test.cpp
-	mkdir -p ${@D}
-	rm -rf _build
-	mkdir -p _build
-	cd _build && ${powershell} cmake .. -DTARGET_NAME=${targetname}
-	cd _build && ${powershell} cmake --build . --config Release
-ifeq "${powershell}" "powershell.exe"
-	cp _build/${targetname}.exe $@
-else
-	cp _build/${targetname} $@ 
-endif
-
-run: ${targetname}.exe hello_world_elfstrings.csv
-	${powershell} ./$< -x build/hello_world.xclbin -i build/insts.txt \
-		-k MLIR_AIE -e $(word 2,$^)
-
-clean:
-	rm -rf build _build *.csv ${powershell}.exe
diff --git a/programming_examples/basic/log_hello_world/README.md b/programming_examples/basic/log_hello_world/README.md
deleted file mode 100644
index cee1dafe81..0000000000
--- a/programming_examples/basic/log_hello_world/README.md
+++ /dev/null
@@ -1,46 +0,0 @@
-## Simple Log Hello World
-
-This reference design demonstrates a simple, low overhead, printf-style log message from AIE tiles.
-
-Features:
-* Low instruction memory overhead (based on variadic templates)
-* Efficient transfers.
-    + Format string addresses are parsed from the compiled elfs host side.
-    + Data transfers from the AIE tile are just string addresses and parameters; no strings are sent.
-    + Host-side string addresses are used to look up format strings and populated with parameters.
-
-### Building and executing (on a phx laptop)
-Type the following to build and run the design in a wsl terminal.
-```
-make run
-```
-
-### Logging from the kernel code
-Below is a simple example of how to use `npulog.h` in a kernel.
-```c++
-#include "npulog.h"
-
-void kernel(uint32_t *logbuffer) {
-	NPULogger log(logbuffer, 2048); // buffer to use, and length of buffer
-	log.write("Hello!");
-}
-```
-
-### Extracting format string addresses at compile time
-After building the `.xclbin` in the directory where the AIE Tile elfs are, call the following to create the mappings from the format strings to addresses.
-```bash
-python3 elfStringParser.py --input <directory where generated elfs are> --output formatStrings.csv
-```
-
-### Decoding the log at runtime 
-At runtime we can run the NPU and then run a decoder on the output buffer to render all the strings.
-
-```c++
-  #include "decodelog.hpp"
-  // ...
-  NPULogDecoder log_decoder("formatString.csv");
-  for (const std::string &str : log_decoder.decode(logbuffer)) {
-    std::cout << str << std::endl;
-  }
-```
-
diff --git a/programming_examples/basic/log_hello_world/decodelog.hpp b/programming_examples/basic/log_hello_world/decodelog.hpp
deleted file mode 100755
index dc73ad2d48..0000000000
--- a/programming_examples/basic/log_hello_world/decodelog.hpp
+++ /dev/null
@@ -1,138 +0,0 @@
-//===- decodelog.hpp ---------------------------------------000---*- C++
-//-*-===//
-//
-// This file is licensed under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-// Copyright (C) 2023, Advanced Micro Devices, Inc.
-//
-//===----------------------------------------------------------------------===//
-
-#include "xrt/xrt_bo.h"
-#include <boost/tokenizer.hpp>
-#include <filesystem>
-#include <fstream>
-#include <iostream>
-#include <map>
-#include <sstream>
-#include <stdint.h>
-#include <stdlib.h>
-#include <string>
-
-class NPULogDecoder {
-  // Parses the strings file to provide a decoder for parameterising the
-  // string messages from the AIE tiles
-
-private:
-  std::string _elfstrings_file;
-  std::map<int, std::string> _str_map;
-
-  void parse_str_map() {
-    std::ifstream file(_elfstrings_file);
-    if (!file.is_open()) {
-      std::cerr << "Error! unable to open the elfstrings file ("
-                << _elfstrings_file << ")\n";
-    }
-    std::string line;
-    while (std::getline(file, line)) {
-      boost::char_separator<char> sep(",");
-      boost::tokenizer<boost::char_separator<char>> tokens(line, sep);
-
-      auto it = tokens.begin();
-      if (it == tokens.end()) {
-        // case where there are no tokens on the line
-        continue;
-      }
-      int address;
-      if (!(std::istringstream(*it) >> address)) {
-        // Handle the case where the first token cannot be converted to an int
-        continue;
-      }
-      ++it;
-      if (it == tokens.end()) {
-        // Handle the case where there's no second token
-        continue;
-      }
-      std::string format_str = *it;
-      _str_map[address] = format_str;
-    }
-  }
-
-public:
-  NPULogDecoder(std::string elfstrings_file)
-      : _elfstrings_file(elfstrings_file) {
-    parse_str_map();
-  }
-
-  // When given a string address return true if we have
-  // the format string for it
-  bool format_str_exists(uint32_t addr) {
-    return _str_map.find(addr) != _str_map.end();
-  }
-
-  // Peel off a message payload from the start of the buffer
-  // and render a format string with the parameters
-  uint32_t *renderNextStr(std::vector<std::string> &log, uint32_t *buffer) {
-    uint32_t straddr = *buffer;
-    buffer++;
-    if (format_str_exists(straddr)) {
-      // Construct the string
-      std::string frmt = _str_map[straddr];
-
-      std::string out;
-      for (std::string::size_type i = 0; i < frmt.size(); ++i) {
-        if (frmt[i] == '%') {
-          // We need to replace this and the next
-          // char with an appropriately converted parameter
-          i++;
-          switch (frmt[i]) {
-          case 'd': { // int type
-            int intparam = *((int *)(buffer));
-            out += std::to_string(intparam);
-            buffer++;
-            break;
-          }
-          case 'f': { // float type
-            float floatparam = *((float *)(buffer));
-            out += std::to_string(floatparam);
-            buffer++;
-            break;
-          }
-          case 'u': { // unsigned type
-            unsigned unsignedparam = *((unsigned *)(buffer));
-            out += std::to_string(unsignedparam);
-            buffer++;
-            break;
-          }
-          case 'x': { // hexadecimal type
-            unsigned hexparam = *((unsigned *)(buffer));
-            std::stringstream stream;
-            stream << std::hex << hexparam;
-            out += stream.str();
-            buffer++;
-            break;
-          }
-          }
-        } else {
-          out += frmt[i];
-        }
-      }
-
-      log.emplace_back(out);
-    }
-    return buffer;
-  }
-
-  std::vector<std::string> decode(xrt::bo buffer) {
-    uint32_t buffer_size = buffer.size();
-    uint32_t *buffer_ptr = buffer.map<uint32_t *>();
-    uint32_t *end_of_buffer = buffer_ptr + (buffer_size / sizeof(uint32_t));
-
-    std::vector<std::string> rendered_log;
-    while (buffer_ptr < end_of_buffer) {
-      buffer_ptr = renderNextStr(rendered_log, buffer_ptr);
-    }
-    return rendered_log;
-  }
-};
diff --git a/programming_examples/basic/log_hello_world/elfStringParser.py b/programming_examples/basic/log_hello_world/elfStringParser.py
deleted file mode 100755
index c50415cc1e..0000000000
--- a/programming_examples/basic/log_hello_world/elfStringParser.py
+++ /dev/null
@@ -1,88 +0,0 @@
-#
-# This file is licensed under the Apache License v2.0 with LLVM Exceptions.
-# See https://llvm.org/LICENSE.txt for license information.
-# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-#
-# (c) Copyright 2023 AMD Inc.
-
-import argparse
-import os
-import re
-import subprocess
-from typing import Dict
-
-
-def call_unix_proc(cmd: str) -> str:
-    cmdlist = cmd.split(" ")
-    try:
-        output = subprocess.check_output(cmdlist, stderr=subprocess.STDOUT)
-        return output.decode()
-    except subprocess.CalledProcessError as e:
-        print(f"ERROR! {cmd} failed \n\n{e.output.decode()}")
-        raise e
-
-
-def _get_ro_offset(ofile: str) -> int:
-    s = f"readelf -S {ofile}"
-    out = call_unix_proc(s)
-    pattern = r"\s*\[\s*[0-9]+\]\s*\.rodata\.DMb\.1\s*PROGBITS\s*([0-9a-z]+)"
-    match = re.search(pattern, out)
-    if match:
-        return int(match.group(1), 16)
-    return int("70A00", 16)
-
-
-def _gen_string_dict(stringsoutput: str, rooffset: int = 0) -> Dict[int, str]:
-    lines = stringsoutput.split("\n")
-    result = {}
-    first = True
-    first_val = 0
-    for line in lines:
-        l = line.lstrip()
-        try:
-            hex_num, text = l.split(" ", 1)
-            if first:
-                first_val = int(hex_num, 16)
-                result[rooffset] = text
-                first = False
-            else:
-                result[(int(hex_num, 16) - first_val) + rooffset] = text
-        except:
-            pass
-    return result
-
-
-def main():
-    parser = argparse.ArgumentParser(
-        description="A utility to extract a json file of all the format strings and corresponding addresses/locations in an AIE design"
-    )
-    parser.add_argument(
-        "--input",
-        required=True,
-        help="Path to the directory where the project was constructed",
-    )
-    parser.add_argument("--output", default="elfstrings.csv")
-    args = parser.parse_args()
-
-    # Collect all the elfs
-    ofiles = []
-    for filename in os.listdir(args.input):
-        if filename.endswith(".elf"):
-            filepath = os.path.join(args.input, filename)
-            ofiles.append(filepath)
-    print(ofiles)
-
-    res = {}
-    for ofile in ofiles:
-        strings_cmd = f"strings --radix x -a {ofile}"
-        object_strings_str = call_unix_proc(strings_cmd)
-        ro_offset = _get_ro_offset(ofile)
-        d = _gen_string_dict(object_strings_str, ro_offset)
-        res = {**res, **d}
-    with open(args.output, "w") as fp:
-        for addr, s in res.items():
-            fp.write(f"{addr},{s}\n")
-
-
-if __name__ == "__main__":
-    main()
diff --git a/programming_examples/basic/log_hello_world/hello_world.py b/programming_examples/basic/log_hello_world/hello_world.py
deleted file mode 100644
index b017d110b7..0000000000
--- a/programming_examples/basic/log_hello_world/hello_world.py
+++ /dev/null
@@ -1,64 +0,0 @@
-# This file is licensed under the Apache License v2.0 with LLVM Exceptions.
-# See https://llvm.org/LICENSE.txt for license information.
-# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-#
-# (c) Copyright 2023 AMD Inc.
-
-from aie.dialects.aie import *
-from aie.dialects.aiex import *
-from aie.dialects.scf import *
-from aie.extras.context import mlir_mod_ctx
-
-
-def printf():
-    N = 512
-
-    with mlir_mod_ctx() as ctx:
-
-        @device(AIEDevice.ipu)
-        def device_body():
-            memRef_ty = T.memref(N, T.i32())
-
-            # AIE Core Function declarations
-            kernel = external_func("kernel", inputs=[memRef_ty, memRef_ty, memRef_ty])
-
-            # Tile declarations
-            ShimTile = tile(0, 0)
-            ComputeTile2 = tile(0, 2)
-
-            # AIE-array data movement with object fifos
-            inOF = object_fifo("inOF", ShimTile, ComputeTile2, 2, memRef_ty)
-            outOF = object_fifo("outOF", ComputeTile2, ShimTile, 2, memRef_ty)
-            logoutOF = object_fifo("logoutOF", ComputeTile2, ShimTile, 2, memRef_ty)
-
-            # Set up compute tiles
-
-            # Compute tile 2
-            @core(ComputeTile2, "kernel.o")
-            def core_body():
-                elemOut = outOF.acquire(ObjectFifoPort.Produce, 1)
-                elemIn = inOF.acquire(ObjectFifoPort.Consume, 1)
-                elemLogout = logoutOF.acquire(ObjectFifoPort.Produce, 1)
-                call(kernel, [elemIn, elemOut, elemLogout])
-                inOF.release(ObjectFifoPort.Consume, 1)
-                outOF.release(ObjectFifoPort.Produce, 1)
-                logoutOF.release(ObjectFifoPort.Produce, 1)
-
-            # To/from AIE-array data movement
-            @FuncOp.from_py_func(memRef_ty, memRef_ty, memRef_ty)
-            def sequence(in_mem, out_mem, logout):
-                ipu_dma_memcpy_nd(
-                    metadata="outOF", bd_id=0, mem=out_mem, sizes=[1, 1, 1, N]
-                )
-                ipu_dma_memcpy_nd(
-                    metadata="inOF", bd_id=1, mem=in_mem, sizes=[1, 1, 1, N]
-                )
-                ipu_dma_memcpy_nd(
-                    metadata="logoutOF", bd_id=2, mem=logout, sizes=[1, 1, 1, N]
-                )
-                ipu_sync(column=0, row=0, direction=0, channel=0)
-
-    print(ctx.module)
-
-
-printf()
diff --git a/programming_examples/basic/log_hello_world/kernel.cc b/programming_examples/basic/log_hello_world/kernel.cc
deleted file mode 100755
index f0dc962ba9..0000000000
--- a/programming_examples/basic/log_hello_world/kernel.cc
+++ /dev/null
@@ -1,41 +0,0 @@
-//===- kernel.cc -------------------------------------------000---*- C++
-//-*-===//
-//
-// This file is licensed under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-// Copyright (C) 2023, Advanced Micro Devices, Inc.
-//
-//===----------------------------------------------------------------------===//
-#define NOCPP
-
-#include <stdint.h>
-#include <stdio.h>
-#include <stdlib.h>
-
-#include <aie_api/aie.hpp>
-
-#include "npulog.h"
-
-extern "C" {
-
-void kernel(uint32_t *in_buffer, uint32_t *out_buffer, uint8_t *logbuffer) {
-
-  NPULogger log(logbuffer, 2048);
-  log.write("Starting kernel execution!\n");
-
-  uint32_t col = (get_coreid() >> 16) & 0x0000FFFF;
-  uint32_t row = get_coreid() & 0x0000FFFF;
-
-  aie::tile tile = aie::tile::current();
-  uint64_t Tstart = tile.cycles();
-  log.write("Core Location col=%u row=%u\n", col, row);
-
-  memcpy(out_buffer, in_buffer, 2048);
-
-  uint64_t Tend = tile.cycles();
-  uint64_t cycles = Tend - Tstart;
-  log.write("Completed executing. cycles=%u\n", cycles);
-}
-}
diff --git a/programming_examples/basic/log_hello_world/npulog.h b/programming_examples/basic/log_hello_world/npulog.h
deleted file mode 100755
index 1dab3c56eb..0000000000
--- a/programming_examples/basic/log_hello_world/npulog.h
+++ /dev/null
@@ -1,84 +0,0 @@
-//===- npulog.h --------------------------------------------000---*- C++
-//-*-===//
-//
-// This file is licensed under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-// Copyright (C) 2023, Advanced Micro Devices, Inc.
-//
-//===----------------------------------------------------------------------===//
-
-#include <stdint.h>
-#include <stdlib.h>
-
-class NPULogger {
-private:
-  uint8_t *_buffer;
-  uint32_t _maxlen;
-  uint32_t _count;
-  uint32_t _dropped_msgs;
-
-public:
-  NPULogger(uint8_t *buffer, uint32_t len) : _buffer(buffer), _maxlen(len) {
-    _count = 0;
-    _dropped_msgs = 0;
-  }
-
-  ~NPULogger() {}
-
-  //----------------------------------
-  // peel parameters of write
-  //----------------------------------
-  // zeroth case
-  template <typename T = void>
-  void log_peel_params(uint32_t *hmsg, uint32_t *cnt) {
-    { return; }
-  }
-
-  // general case: peels the parameters off the argument list and appends them
-  // to the message
-  template <typename P1, typename... Param>
-  void log_peel_params(uint32_t *hmsg, uint32_t *cnt, const P1 &p1,
-                       Param &...param) {
-    hmsg[*cnt + 1] = *(
-        (uint32_t *)&p1); // prune type and place raw bytes in the host message
-    *cnt = *cnt + 1;
-    log_peel_params(hmsg, cnt, param...); // keep on peeling
-    return;
-  }
-  //----------------------------------
-
-  template <typename... Param>
-  void write(const char *msg, const Param &...param) {
-    if (almost_full(16)) {
-      _write("Log buffer is full -- we have dropped %u messages",
-             _dropped_msgs++);
-      _buffer -= 8;
-      _count -= 8;
-      return;
-    }
-    _write(msg, param...);
-  }
-
-  template <typename... Param>
-  void _write(const char *msg, const Param &...param) {
-    // create a message
-    uint32_t hmsg[40];
-
-    // assign the constant string addr in memory
-    hmsg[0] = (uint32_t)((uint32_t *)((void *)(msg)));
-
-    // recursively peel off the parameters and assign
-    uint32_t param_cnt = 0;
-    log_peel_params(&hmsg[0], &param_cnt, param...);
-
-    memcpy(_buffer, &hmsg, (param_cnt + 1) * 4);
-    _buffer += (param_cnt + 1) * 4;
-    _count += (param_cnt + 1) * 4;
-
-    return;
-  }
-
-  bool almost_full(uint32_t amount) { return _count >= (_maxlen - amount); }
-};
diff --git a/programming_examples/basic/log_hello_world/run.lit b/programming_examples/basic/log_hello_world/run.lit
deleted file mode 100644
index 096df253c7..0000000000
--- a/programming_examples/basic/log_hello_world/run.lit
+++ /dev/null
@@ -1,14 +0,0 @@
-// (c) Copyright 2023 Advanced Micro Devices, Inc.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-// REQUIRES: ryzen_ai, chess
-//
-// RUN: xchesscc_wrapper aie2 -I %aietools/include -c %S/kernel.cc -o ./kernel.o
-// RUN: %python %S/hello_world.py > ./aie.mlir
-// RUN: %python aiecc.py --xbridge --aie-generate-cdo --aie-generate-ipu --no-compile-host --xclbin-name=aie.xclbin --ipu-insts-name=insts.txt ./aie.mlir
-// RUN: clang %S/test.cpp -o test.exe -std=c++11 -Wall %xrt_flags -lrt -lstdc++ -lboost_program_options -lboost_filesystem
-// RUN: %python %S/elfStringParser.py --input . --output elf_string.csv
-// RUN: %run_on_ipu ./test.exe -x aie.xclbin -k MLIR_AIE -i insts.txt -e elf_string.csv | FileCheck %s
-// CHECK: Starting kernel execution
-// CHECK: Core Location col=1 row=2
-// CHECK: Completed executing. cycles=
diff --git a/programming_examples/basic/log_hello_world/test.cpp b/programming_examples/basic/log_hello_world/test.cpp
deleted file mode 100755
index face8d0757..0000000000
--- a/programming_examples/basic/log_hello_world/test.cpp
+++ /dev/null
@@ -1,191 +0,0 @@
-//===- test.cpp -------------------------------------------000---*- C++ -*-===//
-//
-// This file is licensed under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-// Copyright (C) 2023, Advanced Micro Devices, Inc.
-//
-//===----------------------------------------------------------------------===//
-
-#include <boost/program_options.hpp>
-#include <cstdint>
-#include <fstream>
-#include <iostream>
-#include <sstream>
-#include <string>
-#include <vector>
-
-#include "decodelog.hpp"
-
-#include "xrt/xrt_bo.h"
-#include "xrt/xrt_device.h"
-#include "xrt/xrt_kernel.h"
-
-namespace po = boost::program_options;
-
-void check_arg_file_exists(po::variables_map &vm_in, std::string name) {
-  if (!vm_in.count(name)) {
-    throw std::runtime_error("Error: no " + name + " file was provided\n");
-  } else {
-    std::ifstream test(vm_in[name].as<std::string>());
-    if (!test) {
-      throw std::runtime_error("The " + name + " file " +
-                               vm_in[name].as<std::string>() +
-                               " does not exist.\n");
-    }
-  }
-}
-
-std::vector<uint32_t> load_instr_sequence(std::string instr_path) {
-  std::ifstream instr_file(instr_path);
-  std::string line;
-  std::vector<uint32_t> instr_v;
-  while (std::getline(instr_file, line)) {
-    std::istringstream iss(line);
-    uint32_t a;
-    if (!(iss >> std::hex >> a)) {
-      throw std::runtime_error("Unable to parse instruction file\n");
-    }
-    instr_v.push_back(a);
-  }
-  return instr_v;
-}
-
-namespace po = boost::program_options;
-
-int main(int argc, const char *argv[]) {
-
-  // Program arguments parsing
-  po::options_description desc("Allowed options");
-  desc.add_options()("help,h", "produce help message")(
-      "xclbin,x", po::value<std::string>()->required(),
-      "the input xclbin path")(
-      "kernel,k", po::value<std::string>()->required(),
-      "the kernel name in the XCLBIN (for instance PP_PRE_FD)")(
-      "verbosity,v", po::value<int>()->default_value(0),
-      "the verbosity of the output")(
-      "elfstrings,e", po::value<std::string>()->required(),
-      "CSV file of format strings and addresses")(
-      "instr,i", po::value<std::string>()->required(),
-      "path of file containing userspace instructions to be sent to the LX6");
-  po::variables_map vm;
-
-  try {
-    po::store(po::parse_command_line(argc, argv, desc), vm);
-    po::notify(vm);
-
-    if (vm.count("help")) {
-      std::cout << desc << "\n";
-      return 1;
-    }
-  } catch (const std::exception &ex) {
-    std::cerr << ex.what() << "\n\n";
-    std::cerr << "Usage:\n" << desc << "\n";
-    return 1;
-  }
-
-  check_arg_file_exists(vm, "xclbin");
-  check_arg_file_exists(vm, "instr");
-  check_arg_file_exists(vm, "elfstrings");
-
-  // Load instruction sequence
-  std::vector<uint32_t> instr_v =
-      load_instr_sequence(vm["instr"].as<std::string>());
-
-  int verbosity = vm["verbosity"].as<int>();
-  if (verbosity >= 1)
-    std::cout << "Sequence instr count: " << instr_v.size() << "\n";
-
-  // Start the XRT test code
-  // Get a device handle
-  unsigned int device_index = 0;
-  auto device = xrt::device(device_index);
-
-  // Load the xclbin
-  if (verbosity >= 1)
-    std::cout << "Loading xclbin: " << vm["xclbin"].as<std::string>() << "\n";
-  auto xclbin = xrt::xclbin(vm["xclbin"].as<std::string>());
-
-  if (verbosity >= 1)
-    std::cout << "Kernel opcode: " << vm["kernel"].as<std::string>() << "\n";
-  std::string Node = vm["kernel"].as<std::string>();
-
-  // Get the kernel from the xclbin
-  auto xkernels = xclbin.get_kernels();
-  auto xkernel = *std::find_if(xkernels.begin(), xkernels.end(),
-                               [Node](xrt::xclbin::kernel &k) {
-                                 auto name = k.get_name();
-                                 return name.rfind(Node, 0) == 0;
-                               });
-  auto kernelName = xkernel.get_name();
-
-  if (verbosity >= 1)
-    std::cout << "Registering xclbin: " << vm["xclbin"].as<std::string>()
-              << "\n";
-
-  device.register_xclbin(xclbin);
-
-  // get a hardware context
-  if (verbosity >= 1)
-    std::cout << "Getting hardware context.\n";
-  xrt::hw_context context(device, xclbin.get_uuid());
-
-  // get a kernel handle
-  if (verbosity >= 1)
-    std::cout << "Getting handle to kernel:" << kernelName << "\n";
-  auto kernel = xrt::kernel(context, kernelName);
-
-  // set up the buffer objects
-  auto bo_instr = xrt::bo(device, instr_v.size() * sizeof(int),
-                          XCL_BO_FLAGS_CACHEABLE, kernel.group_id(0));
-  auto bo_in =
-      xrt::bo(device, 2048, XRT_BO_FLAGS_HOST_ONLY, kernel.group_id(2));
-  auto bo_out =
-      xrt::bo(device, 2048, XRT_BO_FLAGS_HOST_ONLY, kernel.group_id(3));
-  auto bo_logout =
-      xrt::bo(device, 2048, XRT_BO_FLAGS_HOST_ONLY, kernel.group_id(4));
-
-  if (verbosity >= 1)
-    std::cout << "Writing data into buffer objects.\n";
-
-  uint32_t *bufIn = bo_in.map<uint32_t *>();
-  std::vector<uint32_t> srcVecA;
-  for (int i = 0; i < 256; i++) {
-    srcVecA.push_back(42);
-  }
-  memcpy(bufIn, srcVecA.data(), (srcVecA.size() * sizeof(uint32_t)));
-
-  // Copy instruction stream to xrt buffer object
-  void *bufInstr = bo_instr.map<void *>();
-  memcpy(bufInstr, instr_v.data(), instr_v.size() * sizeof(int));
-
-  // sync host to device memories
-  bo_instr.sync(XCL_BO_SYNC_BO_TO_DEVICE);
-  bo_in.sync(XCL_BO_SYNC_BO_TO_DEVICE);
-
-  // Execute the kernel and wait to finish
-  if (verbosity >= 1)
-    std::cout << "Running Kernel.\n";
-  auto run = kernel(bo_instr, instr_v.size(), bo_in, bo_logout, bo_out);
-  run.wait();
-
-  if (verbosity >= 1)
-    std::cout << "Status after run: " << run.state() << "\n";
-
-  // Sync device to host memories
-  bo_out.sync(XCL_BO_SYNC_BO_FROM_DEVICE);
-  bo_logout.sync(XCL_BO_SYNC_BO_FROM_DEVICE);
-
-  // Store result in cv::Mat
-  uint32_t *bufOut = bo_out.map<uint32_t *>();
-  uint32_t *bufLogOut = bo_logout.map<uint32_t *>();
-
-  NPULogDecoder logdecode(vm["elfstrings"].as<std::string>());
-  std::vector<std::string> logout = logdecode.decode(bo_out);
-  for (const std::string &str : logout) {
-    std::cout << str << std::endl;
-  }
-
-  return 0;
-}
diff --git a/programming_examples/basic/vector_exp/aie2.py b/programming_examples/basic/vector_exp/aie2.py
index c1675db1d9..83b8f326ac 100644
--- a/programming_examples/basic/vector_exp/aie2.py
+++ b/programming_examples/basic/vector_exp/aie2.py
@@ -31,99 +31,94 @@ def my_eltwise_exp():
     tiles = N_div_n // n_cores
     buffer_depth = 2
 
-    # ctx wrapper - to convert python to mlir
-    with mlir_mod_ctx() as ctx:
+    # Device declaration - aie2 device NPU (aka Ryzen AI)
+    @device(AIEDevice.ipu)
+    def device_body():
 
-        # Device declaration - aie2 device NPU (aka Ryzen AI)
-        @device(AIEDevice.ipu)
-        def device_body():
+        memRef_ty = T.memref(n, T.bf16())
 
-            memRef_ty = T.memref(n, T.bf16())
+        # Type used in the tile memory
+        memRef_A_ty = T.memref(n, T.bf16())
+        memRef_C_ty = T.memref(n, T.bf16())
 
-            # Type used in the tile memory
-            memRef_A_ty = T.memref(n, T.bf16())
-            memRef_C_ty = T.memref(n, T.bf16())
+        # Type used in the memory tile which aggregates across the 4 cores
+        memRef_A_MT_ty = T.memref(n * n_cores, T.bf16())
+        memRef_C_MT_ty = T.memref(n * n_cores, T.bf16())
 
-            # Type used in the memory tile which aggregates across the 4 cores
-            memRef_A_MT_ty = T.memref(n * n_cores, T.bf16())
-            memRef_C_MT_ty = T.memref(n * n_cores, T.bf16())
+        # AIE Core Function declarations
 
-            # AIE Core Function declarations
+        exp_bf16_1024 = external_func("exp_bf16_1024", inputs=[memRef_ty, memRef_ty])
 
-            exp_bf16_1024 = external_func(
-                "exp_bf16_1024", inputs=[memRef_ty, memRef_ty]
+        # Tile declarations
+        ShimTile = tile(0, 0)
+
+        MemTile = tile(0, 1)
+        cores = [tile(0, 2 + i) for i in range(n_cores)]
+
+        inA_fifo_names = [f"memA{i}" for i in range(n_cores)]
+        outC_fifo_names = [f"memC{i}" for i in range(n_cores)]
+
+        inA_fifos = {}
+        outC_fifos = {}
+
+        # AIE-array data movement with object fifos
+        # Input A
+        inA = object_fifo("inA", ShimTile, MemTile, buffer_depth, memRef_A_MT_ty)
+        for i in range(n_cores):
+            inA_fifos[inA_fifo_names[i]] = object_fifo(
+                inA_fifo_names[i], MemTile, cores[i], buffer_depth, memRef_A_ty
             )
+        object_fifo_link(inA, inA_fifo_names)
 
-            # Tile declarations
-            ShimTile = tile(0, 0)
-
-            MemTile = tile(0, 1)
-            cores = [tile(0, 2 + i) for i in range(n_cores)]
-
-            inA_fifo_names = [f"memA{i}" for i in range(n_cores)]
-            outC_fifo_names = [f"memC{i}" for i in range(n_cores)]
-
-            inA_fifos = {}
-            outC_fifos = {}
-
-            # AIE-array data movement with object fifos
-            # Input A
-            inA = object_fifo("inA", ShimTile, MemTile, buffer_depth, memRef_A_MT_ty)
-            for i in range(n_cores):
-                inA_fifos[inA_fifo_names[i]] = object_fifo(
-                    inA_fifo_names[i], MemTile, cores[i], buffer_depth, memRef_A_ty
-                )
-            object_fifo_link(inA, inA_fifo_names)
-
-            # Output C
-            for i in range(n_cores):
-                outC_fifos[outC_fifo_names[i]] = object_fifo(
-                    outC_fifo_names[i], cores[i], MemTile, buffer_depth, memRef_C_ty
-                )
-            outC = object_fifo("outC", MemTile, ShimTile, buffer_depth, memRef_C_MT_ty)
-            object_fifo_link(outC_fifo_names[0:n_cores], outC)
-
-            # Compute tile bodies
-            for i in range(n_cores):
-                # Compute tile i
-                @core(cores[i], "kernels.a")
-                def core_body():
-                    for _ in for_(0xFFFFFFFF):
-                        for _ in for_(tiles):
-                            elem_out = outC_fifos[outC_fifo_names[i]].acquire(
-                                ObjectFifoPort.Produce, 1
-                            )
-                            elem_in_a = inA_fifos[inA_fifo_names[i]].acquire(
-                                ObjectFifoPort.Consume, 1
-                            )
-
-                            call(exp_bf16_1024, [elem_in_a, elem_out])
-
-                            inA_fifos[inA_fifo_names[i]].release(
-                                ObjectFifoPort.Consume, 1
-                            )
-                            outC_fifos[outC_fifo_names[i]].release(
-                                ObjectFifoPort.Produce, 1
-                            )
-                            yield_([])
+        # Output C
+        for i in range(n_cores):
+            outC_fifos[outC_fifo_names[i]] = object_fifo(
+                outC_fifo_names[i], cores[i], MemTile, buffer_depth, memRef_C_ty
+            )
+        outC = object_fifo("outC", MemTile, ShimTile, buffer_depth, memRef_C_MT_ty)
+        object_fifo_link(outC_fifo_names[0:n_cores], outC)
+
+        # Compute tile bodies
+        for i in range(n_cores):
+            # Compute tile i
+            @core(cores[i], "kernels.a")
+            def core_body():
+                for _ in for_(0xFFFFFFFF):
+                    for _ in for_(tiles):
+                        elem_out = outC_fifos[outC_fifo_names[i]].acquire(
+                            ObjectFifoPort.Produce, 1
+                        )
+                        elem_in_a = inA_fifos[inA_fifo_names[i]].acquire(
+                            ObjectFifoPort.Consume, 1
+                        )
+
+                        call(exp_bf16_1024, [elem_in_a, elem_out])
+
+                        inA_fifos[inA_fifo_names[i]].release(ObjectFifoPort.Consume, 1)
+                        outC_fifos[outC_fifo_names[i]].release(
+                            ObjectFifoPort.Produce, 1
+                        )
                         yield_([])
+                    yield_([])
 
-            # To/from AIE-array data movement
-            tensor_ty = T.memref(N, T.i32())
-
-            @FuncOp.from_py_func(tensor_ty, tensor_ty)
-            def sequence(A, C):
-                ipu_dma_memcpy_nd(
-                    metadata="outC", bd_id=0, mem=C, sizes=[1, 1, 1, C_sz_in_i32s]
-                )
-                ipu_dma_memcpy_nd(
-                    metadata="inA", bd_id=1, mem=A, sizes=[1, 1, 1, A_sz_in_i32s]
-                )
-                ipu_sync(column=0, row=0, direction=0, channel=0)
+        # To/from AIE-array data movement
+        tensor_ty = T.memref(N, T.i32())
 
-    # Print the mlir conversion
-    print(ctx.module)
+        @FuncOp.from_py_func(tensor_ty, tensor_ty)
+        def sequence(A, C):
+            ipu_dma_memcpy_nd(
+                metadata="outC", bd_id=0, mem=C, sizes=[1, 1, 1, C_sz_in_i32s]
+            )
+            ipu_dma_memcpy_nd(
+                metadata="inA", bd_id=1, mem=A, sizes=[1, 1, 1, A_sz_in_i32s]
+            )
+            ipu_sync(column=0, row=0, direction=0, channel=0)
 
 
-# Call design function to generate mlir code to stdout
-my_eltwise_exp()
+with mlir_mod_ctx() as ctx:
+    my_eltwise_exp()
+    res = ctx.module.operation.verify()
+    if res == True:
+        print(ctx.module)
+    else:
+        print(res)
diff --git a/programming_examples/basic/vector_reduce_add/Makefile b/programming_examples/basic/vector_reduce_add/Makefile
index ad4724dc45..ea201d5753 100644
--- a/programming_examples/basic/vector_reduce_add/Makefile
+++ b/programming_examples/basic/vector_reduce_add/Makefile
@@ -19,7 +19,7 @@ all: build/final.xclbin build/insts.txt
 
 VPATH := ../../../aie_kernels/aie2
 
-build/%.o: %.cc
+build/%.cc.o: %.cc
 	mkdir -p ${@D}
 	cd ${@D} && xchesscc_wrapper ${CHESSCCWRAP2_FLAGS} -c $(<:%=../%) -o ${@F}
 
@@ -27,7 +27,7 @@ build/aie.mlir: aie2.py
 	mkdir -p ${@D}
 	python3 $< ${devicename} ${col} > $@
 
-build/final.xclbin: build/aie.mlir build/reduce_add.o
+build/final.xclbin: build/aie.mlir build/reduce_add.cc.o
 	mkdir -p ${@D}
 	cd ${@D} && aiecc.py --aie-generate-cdo --no-compile-host --xclbin-name=${@F} \
 				--aie-generate-ipu --ipu-insts-name=insts.txt $(<:%=../%)
diff --git a/programming_examples/basic/vector_reduce_add/aie2.py b/programming_examples/basic/vector_reduce_add/aie2.py
index ba689a428f..fe035bfc96 100644
--- a/programming_examples/basic/vector_reduce_add/aie2.py
+++ b/programming_examples/basic/vector_reduce_add/aie2.py
@@ -21,61 +21,61 @@ def my_reduce_add():
 
     buffer_depth = 2
 
-    with mlir_mod_ctx() as ctx:
-
-        if len(sys.argv) != 3:
-            raise ValueError("[ERROR] Need 2 command line arguments (Device name, Col)")
-
-        if sys.argv[1] == "ipu":
-            dev = AIEDevice.ipu
-        elif sys.argv[1] == "xcvc1902":
-            dev = AIEDevice.xcvc1902
-        else:
-            raise ValueError("[ERROR] Device name {} is unknown".format(sys.argv[1]))
-
-        @device(dev)
-        def device_body():
-            memRef_I_ty = T.memref(N, T.i32())
-            memRef_O_ty = T.memref(1, T.i32())
-
-            # AIE Core Function declarations
-            reduce_add_vector = external_func(
-                "reduce_add_vector", inputs=[memRef_I_ty, memRef_O_ty, T.i32()]
-            )
-
-            # Tile declarations
-            ShimTile = tile(int(sys.argv[2]), 0)
-            ComputeTile2 = tile(int(sys.argv[2]), 2)
-
-            # AIE-array data movement with object fifos
-            of_in = object_fifo("in", ShimTile, ComputeTile2, buffer_depth, memRef_I_ty)
-            of_out = object_fifo(
-                "out", ComputeTile2, ShimTile, buffer_depth, memRef_O_ty
-            )
-
-            # Set up compute tiles
-
-            # Compute tile 2
-            @core(ComputeTile2, "reduce_add.o")
-            def core_body():
-                for _ in for_(0xFFFFFFFF):
-                    elem_out = of_out.acquire(ObjectFifoPort.Produce, 1)
-                    elem_in = of_in.acquire(ObjectFifoPort.Consume, 1)
-                    call(reduce_add_vector, [elem_in, elem_out, N])
-                    of_in.release(ObjectFifoPort.Consume, 1)
-                    of_out.release(ObjectFifoPort.Produce, 1)
-                    yield_([])
-
-            # To/from AIE-array data movement
-            tensor_ty = T.memref(N, T.i32())
-
-            @FuncOp.from_py_func(tensor_ty, tensor_ty)
-            def sequence(A, C):
-                ipu_dma_memcpy_nd(metadata="out", bd_id=0, mem=C, sizes=[1, 1, 1, 1])
-                ipu_dma_memcpy_nd(metadata="in", bd_id=1, mem=A, sizes=[1, 1, 1, N])
-                ipu_sync(column=0, row=0, direction=0, channel=0)
-
-    print(ctx.module)
-
-
-my_reduce_add()
+    if len(sys.argv) != 3:
+        raise ValueError("[ERROR] Need 2 command line arguments (Device name, Col)")
+
+    if sys.argv[1] == "ipu":
+        dev = AIEDevice.ipu
+    elif sys.argv[1] == "xcvc1902":
+        dev = AIEDevice.xcvc1902
+    else:
+        raise ValueError("[ERROR] Device name {} is unknown".format(sys.argv[1]))
+
+    @device(dev)
+    def device_body():
+        memRef_I_ty = T.memref(N, T.i32())
+        memRef_O_ty = T.memref(1, T.i32())
+
+        # AIE Core Function declarations
+        reduce_add_vector = external_func(
+            "reduce_add_vector", inputs=[memRef_I_ty, memRef_O_ty, T.i32()]
+        )
+
+        # Tile declarations
+        ShimTile = tile(int(sys.argv[2]), 0)
+        ComputeTile2 = tile(int(sys.argv[2]), 2)
+
+        # AIE-array data movement with object fifos
+        of_in = object_fifo("in", ShimTile, ComputeTile2, buffer_depth, memRef_I_ty)
+        of_out = object_fifo("out", ComputeTile2, ShimTile, buffer_depth, memRef_O_ty)
+
+        # Set up compute tiles
+
+        # Compute tile 2
+        @core(ComputeTile2, "reduce_add.cc.o")
+        def core_body():
+            for _ in for_(0xFFFFFFFF):
+                elem_out = of_out.acquire(ObjectFifoPort.Produce, 1)
+                elem_in = of_in.acquire(ObjectFifoPort.Consume, 1)
+                call(reduce_add_vector, [elem_in, elem_out, N])
+                of_in.release(ObjectFifoPort.Consume, 1)
+                of_out.release(ObjectFifoPort.Produce, 1)
+                yield_([])
+
+        # To/from AIE-array data movement
+        tensor_ty = T.memref(N, T.i32())
+
+        @FuncOp.from_py_func(tensor_ty, tensor_ty)
+        def sequence(A, C):
+            ipu_dma_memcpy_nd(metadata="out", bd_id=0, mem=C, sizes=[1, 1, 1, 1])
+            ipu_dma_memcpy_nd(metadata="in", bd_id=1, mem=A, sizes=[1, 1, 1, N])
+            ipu_sync(column=0, row=0, direction=0, channel=0)
+
+
+with mlir_mod_ctx() as ctx:
+    my_reduce_add()
+    res = ctx.module.operation.verify()
+    if res == True:
+        print(ctx.module)
+    else:
+        print(res)
diff --git a/programming_examples/basic/vector_reduce_add/run.lit b/programming_examples/basic/vector_reduce_add/run.lit
index d80c03d3f5..5a29dca2e8 100644
--- a/programming_examples/basic/vector_reduce_add/run.lit
+++ b/programming_examples/basic/vector_reduce_add/run.lit
@@ -1,9 +1,9 @@
 // (c) Copyright 2023 Advanced Micro Devices, Inc.
 // SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
 //
-// REQUIRES: ryzen_ai
+// REQUIRES: ryzen_ai, chess
 //
-// RUN: xchesscc_wrapper aie2 -I %aietools/include -c %S/../../../aie_kernels/aie2/reduce_add.cc -o reduce_add.o
+// RUN: xchesscc_wrapper aie2 -I %aietools/include -c %S/../../../aie_kernels/aie2/reduce_add.cc -o reduce_add.cc.o
 // RUN: %python %S/aie2.py ipu 0 | aie-opt -cse -canonicalize -o ./aie.mlir
 // RUN: %python aiecc.py --aie-generate-cdo --aie-generate-ipu --no-compile-host --xclbin-name=aie.xclbin --ipu-insts-name=insts.txt ./aie.mlir
 // RUN: g++ %S/test.cpp -o test.exe -std=c++23 -Wall -I%S/../../../runtime_lib/test_lib %S/../../../runtime_lib/test_lib/test_utils.cpp %xrt_flags -lrt -lstdc++ -lboost_program_options -lboost_filesystem
diff --git a/programming_examples/basic/vector_reduce_max/Makefile b/programming_examples/basic/vector_reduce_max/Makefile
index 3ca11ea293..3ef597e472 100755
--- a/programming_examples/basic/vector_reduce_max/Makefile
+++ b/programming_examples/basic/vector_reduce_max/Makefile
@@ -19,7 +19,7 @@ all: build/final.xclbin build/insts.txt
 
 VPATH := ../../../aie_kernels/aie2
 
-build/%.o: %.cc
+build/%.cc.o: %.cc
 	mkdir -p ${@D}
 	cd ${@D} && xchesscc_wrapper ${CHESSCCWRAP2_FLAGS} -c $(<:%=../%) -o ${@F}
 
@@ -27,7 +27,7 @@ build/aie.mlir: aie2.py
 	mkdir -p ${@D}
 	python3 $< ${devicename} ${col} > $@
 
-build/final.xclbin: build/aie.mlir build/reduce_max.o
+build/final.xclbin: build/aie.mlir build/reduce_max.cc.o
 	mkdir -p ${@D}
 	cd ${@D} && aiecc.py --aie-generate-cdo --no-compile-host --xclbin-name=${@F} \
 				--aie-generate-ipu --ipu-insts-name=insts.txt $(<:%=../%)
diff --git a/programming_examples/basic/vector_reduce_max/aie2.py b/programming_examples/basic/vector_reduce_max/aie2.py
index b275b76302..c081cf7659 100755
--- a/programming_examples/basic/vector_reduce_max/aie2.py
+++ b/programming_examples/basic/vector_reduce_max/aie2.py
@@ -21,61 +21,61 @@ def my_reduce_max():
 
     buffer_depth = 2
 
-    with mlir_mod_ctx() as ctx:
-
-        if len(sys.argv) != 3:
-            raise ValueError("[ERROR] Need 2 command line arguments (Device name, Col)")
-
-        if sys.argv[1] == "ipu":
-            dev = AIEDevice.ipu
-        elif sys.argv[1] == "xcvc1902":
-            dev = AIEDevice.xcvc1902
-        else:
-            raise ValueError("[ERROR] Device name {} is unknown".format(sys.argv[1]))
-
-        @device(dev)
-        def device_body():
-            memRef_I_ty = T.memref(N, T.i32())
-            memRef_O_ty = T.memref(1, T.i32())
-
-            # AIE Core Function declarations
-            reduce_max_vector = external_func(
-                "reduce_max_vector", inputs=[memRef_I_ty, memRef_O_ty, T.i32()]
-            )
-
-            # Tile declarations
-            ShimTile = tile(int(sys.argv[2]), 0)
-            ComputeTile2 = tile(int(sys.argv[2]), 2)
-
-            # AIE-array data movement with object fifos
-            of_in = object_fifo("in", ShimTile, ComputeTile2, buffer_depth, memRef_I_ty)
-            of_out = object_fifo(
-                "out", ComputeTile2, ShimTile, buffer_depth, memRef_O_ty
-            )
-
-            # Set up compute tiles
-
-            # Compute tile 2
-            @core(ComputeTile2, "reduce_max.o")
-            def core_body():
-                for _ in for_(0xFFFFFFFF):
-                    elem_out = of_out.acquire(ObjectFifoPort.Produce, 1)
-                    elem_in = of_in.acquire(ObjectFifoPort.Consume, 1)
-                    call(reduce_max_vector, [elem_in, elem_out, N])
-                    of_in.release(ObjectFifoPort.Consume, 1)
-                    of_out.release(ObjectFifoPort.Produce, 1)
-                    yield_([])
-
-            # To/from AIE-array data movement
-            tensor_ty = T.memref(N, T.i32())
-
-            @FuncOp.from_py_func(tensor_ty, tensor_ty)
-            def sequence(A, C):
-                ipu_dma_memcpy_nd(metadata="out", bd_id=0, mem=C, sizes=[1, 1, 1, 1])
-                ipu_dma_memcpy_nd(metadata="in", bd_id=1, mem=A, sizes=[1, 1, 1, N])
-                ipu_sync(column=0, row=0, direction=0, channel=0)
-
-    print(ctx.module)
-
-
-my_reduce_max()
+    if len(sys.argv) != 3:
+        raise ValueError("[ERROR] Need 2 command line arguments (Device name, Col)")
+
+    if sys.argv[1] == "ipu":
+        dev = AIEDevice.ipu
+    elif sys.argv[1] == "xcvc1902":
+        dev = AIEDevice.xcvc1902
+    else:
+        raise ValueError("[ERROR] Device name {} is unknown".format(sys.argv[1]))
+
+    @device(dev)
+    def device_body():
+        memRef_I_ty = T.memref(N, T.i32())
+        memRef_O_ty = T.memref(1, T.i32())
+
+        # AIE Core Function declarations
+        reduce_max_vector = external_func(
+            "reduce_max_vector", inputs=[memRef_I_ty, memRef_O_ty, T.i32()]
+        )
+
+        # Tile declarations
+        ShimTile = tile(int(sys.argv[2]), 0)
+        ComputeTile2 = tile(int(sys.argv[2]), 2)
+
+        # AIE-array data movement with object fifos
+        of_in = object_fifo("in", ShimTile, ComputeTile2, buffer_depth, memRef_I_ty)
+        of_out = object_fifo("out", ComputeTile2, ShimTile, buffer_depth, memRef_O_ty)
+
+        # Set up compute tiles
+
+        # Compute tile 2
+        @core(ComputeTile2, "reduce_max.cc.o")
+        def core_body():
+            for _ in for_(0xFFFFFFFF):
+                elem_out = of_out.acquire(ObjectFifoPort.Produce, 1)
+                elem_in = of_in.acquire(ObjectFifoPort.Consume, 1)
+                call(reduce_max_vector, [elem_in, elem_out, N])
+                of_in.release(ObjectFifoPort.Consume, 1)
+                of_out.release(ObjectFifoPort.Produce, 1)
+                yield_([])
+
+        # To/from AIE-array data movement
+        tensor_ty = T.memref(N, T.i32())
+
+        @FuncOp.from_py_func(tensor_ty, tensor_ty)
+        def sequence(A, C):
+            ipu_dma_memcpy_nd(metadata="out", bd_id=0, mem=C, sizes=[1, 1, 1, 1])
+            ipu_dma_memcpy_nd(metadata="in", bd_id=1, mem=A, sizes=[1, 1, 1, N])
+            ipu_sync(column=0, row=0, direction=0, channel=0)
+
+
+with mlir_mod_ctx() as ctx:
+    my_reduce_max()
+    res = ctx.module.operation.verify()
+    if res == True:
+        print(ctx.module)
+    else:
+        print(res)
diff --git a/programming_examples/basic/vector_reduce_max/run.lit b/programming_examples/basic/vector_reduce_max/run.lit
index 0b0d385ce7..6c3233183c 100644
--- a/programming_examples/basic/vector_reduce_max/run.lit
+++ b/programming_examples/basic/vector_reduce_max/run.lit
@@ -3,7 +3,7 @@
 //
 // REQUIRES: ryzen_ai, chess
 //
-// RUN: xchesscc_wrapper aie2 -I %aietools/include -c %S/../../../aie_kernels/aie2/reduce_max.cc -o reduce_max.o
+// RUN: xchesscc_wrapper aie2 -I %aietools/include -c %S/../../../aie_kernels/aie2/reduce_max.cc -o reduce_max.cc.o
 // RUN: %python %S/aie2.py ipu 0 | aie-opt -cse -canonicalize -o ./aie.mlir
 // RUN: %python aiecc.py --aie-generate-cdo --aie-generate-ipu --no-compile-host --xclbin-name=aie.xclbin --ipu-insts-name=insts.txt ./aie.mlir
 // RUN: g++ %S/test.cpp -o test.exe -std=c++23 -Wall -I%S/../../../runtime_lib/test_lib %S/../../../runtime_lib/test_lib/test_utils.cpp %xrt_flags -lrt -lstdc++ -lboost_program_options -lboost_filesystem
diff --git a/programming_examples/basic/vector_reduce_min/Makefile b/programming_examples/basic/vector_reduce_min/Makefile
index 0ade6ed0fd..b0b724e4a3 100755
--- a/programming_examples/basic/vector_reduce_min/Makefile
+++ b/programming_examples/basic/vector_reduce_min/Makefile
@@ -19,7 +19,7 @@ all: build/final.xclbin build/insts.txt
 
 VPATH := ../../../aie_kernels/aie2
 
-build/%.o: %.cc
+build/%.cc.o: %.cc
 	mkdir -p ${@D}
 	cd ${@D} && xchesscc_wrapper ${CHESSCCWRAP2_FLAGS} -c $(<:%=../%) -o ${@F}
 
@@ -27,7 +27,7 @@ build/aie.mlir: aie2.py
 	mkdir -p ${@D}
 	python3 $< ${devicename} ${col} > $@
 
-build/final.xclbin: build/aie.mlir build/reduce_min.o
+build/final.xclbin: build/aie.mlir build/reduce_min.cc.o
 	mkdir -p ${@D}
 	cd ${@D} && aiecc.py --aie-generate-cdo --no-compile-host --xclbin-name=${@F} \
 				--aie-generate-ipu --ipu-insts-name=insts.txt $(<:%=../%)
diff --git a/programming_examples/basic/vector_reduce_min/aie2.py b/programming_examples/basic/vector_reduce_min/aie2.py
index af14cc5a56..a8ef279a13 100755
--- a/programming_examples/basic/vector_reduce_min/aie2.py
+++ b/programming_examples/basic/vector_reduce_min/aie2.py
@@ -21,61 +21,61 @@ def my_reduce_min():
 
     buffer_depth = 2
 
-    with mlir_mod_ctx() as ctx:
-
-        if len(sys.argv) != 3:
-            raise ValueError("[ERROR] Need 2 command line arguments (Device name, Col)")
-
-        if sys.argv[1] == "ipu":
-            dev = AIEDevice.ipu
-        elif sys.argv[1] == "xcvc1902":
-            dev = AIEDevice.xcvc1902
-        else:
-            raise ValueError("[ERROR] Device name {} is unknown".format(sys.argv[1]))
-
-        @device(dev)
-        def device_body():
-            memRef_I_ty = T.memref(N, T.i32())
-            memRef_O_ty = T.memref(1, T.i32())
-
-            # AIE Core Function declarations
-            reduce_min_vector = external_func(
-                "reduce_min_vector", inputs=[memRef_I_ty, memRef_O_ty, T.i32()]
-            )
-
-            # Tile declarations
-            ShimTile = tile(int(sys.argv[2]), 0)
-            ComputeTile2 = tile(int(sys.argv[2]), 2)
-
-            # AIE-array data movement with object fifos
-            of_in = object_fifo("in", ShimTile, ComputeTile2, buffer_depth, memRef_I_ty)
-            of_out = object_fifo(
-                "out", ComputeTile2, ShimTile, buffer_depth, memRef_O_ty
-            )
-
-            # Set up compute tiles
-
-            # Compute tile 2
-            @core(ComputeTile2, "reduce_min.o")
-            def core_body():
-                for _ in for_(0xFFFFFFFF):
-                    elem_out = of_out.acquire(ObjectFifoPort.Produce, 1)
-                    elem_in = of_in.acquire(ObjectFifoPort.Consume, 1)
-                    call(reduce_min_vector, [elem_in, elem_out, N])
-                    of_in.release(ObjectFifoPort.Consume, 1)
-                    of_out.release(ObjectFifoPort.Produce, 1)
-                    yield_([])
-
-            # To/from AIE-array data movement
-            tensor_ty = T.memref(N, T.i32())
-
-            @FuncOp.from_py_func(tensor_ty, tensor_ty)
-            def sequence(A, C):
-                ipu_dma_memcpy_nd(metadata="out", bd_id=0, mem=C, sizes=[1, 1, 1, 1])
-                ipu_dma_memcpy_nd(metadata="in", bd_id=1, mem=A, sizes=[1, 1, 1, N])
-                ipu_sync(column=0, row=0, direction=0, channel=0)
-
-    print(ctx.module)
-
-
-my_reduce_min()
+    if len(sys.argv) != 3:
+        raise ValueError("[ERROR] Need 2 command line arguments (Device name, Col)")
+
+    if sys.argv[1] == "ipu":
+        dev = AIEDevice.ipu
+    elif sys.argv[1] == "xcvc1902":
+        dev = AIEDevice.xcvc1902
+    else:
+        raise ValueError("[ERROR] Device name {} is unknown".format(sys.argv[1]))
+
+    @device(dev)
+    def device_body():
+        memRef_I_ty = T.memref(N, T.i32())
+        memRef_O_ty = T.memref(1, T.i32())
+
+        # AIE Core Function declarations
+        reduce_min_vector = external_func(
+            "reduce_min_vector", inputs=[memRef_I_ty, memRef_O_ty, T.i32()]
+        )
+
+        # Tile declarations
+        ShimTile = tile(int(sys.argv[2]), 0)
+        ComputeTile2 = tile(int(sys.argv[2]), 2)
+
+        # AIE-array data movement with object fifos
+        of_in = object_fifo("in", ShimTile, ComputeTile2, buffer_depth, memRef_I_ty)
+        of_out = object_fifo("out", ComputeTile2, ShimTile, buffer_depth, memRef_O_ty)
+
+        # Set up compute tiles
+
+        # Compute tile 2
+        @core(ComputeTile2, "reduce_min.cc.o")
+        def core_body():
+            for _ in for_(0xFFFFFFFF):
+                elem_out = of_out.acquire(ObjectFifoPort.Produce, 1)
+                elem_in = of_in.acquire(ObjectFifoPort.Consume, 1)
+                call(reduce_min_vector, [elem_in, elem_out, N])
+                of_in.release(ObjectFifoPort.Consume, 1)
+                of_out.release(ObjectFifoPort.Produce, 1)
+                yield_([])
+
+        # To/from AIE-array data movement
+        tensor_ty = T.memref(N, T.i32())
+
+        @FuncOp.from_py_func(tensor_ty, tensor_ty)
+        def sequence(A, C):
+            ipu_dma_memcpy_nd(metadata="out", bd_id=0, mem=C, sizes=[1, 1, 1, 1])
+            ipu_dma_memcpy_nd(metadata="in", bd_id=1, mem=A, sizes=[1, 1, 1, N])
+            ipu_sync(column=0, row=0, direction=0, channel=0)
+
+
+with mlir_mod_ctx() as ctx:
+    my_reduce_min()
+    res = ctx.module.operation.verify()
+    if res == True:
+        print(ctx.module)
+    else:
+        print(res)
diff --git a/programming_examples/basic/vector_reduce_min/run.lit b/programming_examples/basic/vector_reduce_min/run.lit
index 23ce28e79d..95ecbd533a 100644
--- a/programming_examples/basic/vector_reduce_min/run.lit
+++ b/programming_examples/basic/vector_reduce_min/run.lit
@@ -3,7 +3,7 @@
 //
 // REQUIRES: ryzen_ai, chess
 //
-// RUN: xchesscc_wrapper aie2 -I %aietools/include -c %S/../../../aie_kernels/aie2/reduce_min.cc -o reduce_min.o
+// RUN: xchesscc_wrapper aie2 -I %aietools/include -c %S/../../../aie_kernels/aie2/reduce_min.cc -o reduce_min.cc.o
 // RUN: %python %S/aie2.py ipu 0 | aie-opt -cse -canonicalize -o ./aie.mlir
 // RUN: %python aiecc.py --aie-generate-cdo --aie-generate-ipu --no-compile-host --xclbin-name=aie.xclbin --ipu-insts-name=insts.txt ./aie.mlir
 // RUN: g++ %S/test.cpp -o test.exe -std=c++23 -Wall -I%S/../../../runtime_lib/test_lib %S/../../../runtime_lib/test_lib/test_utils.cpp %xrt_flags -lrt -lstdc++ -lboost_program_options -lboost_filesystem
diff --git a/programming_examples/lit.cfg.py b/programming_examples/lit.cfg.py
index d5ff22c85e..b28803cb43 100755
--- a/programming_examples/lit.cfg.py
+++ b/programming_examples/lit.cfg.py
@@ -104,6 +104,14 @@
     opencv_flags = ""
 config.substitutions.append(("%opencv_flags", opencv_flags))
 
+try:
+    import torch
+
+    config.available_features.add("torch")
+except ImportError:
+    print("torch not found", file=sys.stderr)
+    pass
+
 VitisSysrootFlag = ""
 if config.aieHostTarget == "x86_64":
     config.substitutions.append(("%aieHostTargetTriplet%", "x86_64-unknown-linux-gnu"))
diff --git a/programming_examples/ml/bottleneck/Makefile b/programming_examples/ml/bottleneck/Makefile
index f5c6e4561f..47ca6a78f7 100755
--- a/programming_examples/ml/bottleneck/Makefile
+++ b/programming_examples/ml/bottleneck/Makefile
@@ -37,4 +37,4 @@ clean:
 		*.log aie_partition.json *.bin BOOT.BIN _x test.exe
 
 run_py: 
-	${powershell} python3 test.py
+	${powershell} python3 test.py -x build/final.xclbin -i build/insts.txt -k MLIR_AIE
\ No newline at end of file
diff --git a/programming_examples/ml/bottleneck/README.md b/programming_examples/ml/bottleneck/README.md
index 144b8e36f2..40a69e8576 100644
--- a/programming_examples/ml/bottleneck/README.md
+++ b/programming_examples/ml/bottleneck/README.md
@@ -115,11 +115,4 @@ make
 To run the design:
 ```
 make run_py
-```
-
-### Prerequisites
-To install the dependencies, run the following command:
-```
-pip install -r requirements.txt
-
 ```
\ No newline at end of file
diff --git a/programming_examples/ml/bottleneck/run.lit b/programming_examples/ml/bottleneck/run.lit
index ec30002c97..8a6024d66e 100644
--- a/programming_examples/ml/bottleneck/run.lit
+++ b/programming_examples/ml/bottleneck/run.lit
@@ -8,5 +8,5 @@
 // RUN: xchesscc_wrapper aie2 -I %aietools/include -DBIT_WIDTH=8 -DINT8_ACT -c %S/../../../aie_kernels/aie2/conv2dk1_skip.cc -o conv2dk1_skip.o
 // RUN: %python %S/aie2.py | aie-opt -cse -canonicalize -o ./aie.mlir
 // RUN: %python aiecc.py --xbridge --aie-generate-cdo --aie-generate-ipu --no-compile-host --xclbin-name=aie.xclbin --ipu-insts-name=insts.txt ./aie.mlir
-// RUN: %run_on_ipu %python %S/test.py | FileCheck %s
+// RUN: %run_on_ipu %python %S/test.py -x aie.xclbin -i insts.txt -k MLIR_AIE | FileCheck %s
 // CHECK: PASS!
\ No newline at end of file
diff --git a/programming_examples/ml/bottleneck/test.py b/programming_examples/ml/bottleneck/test.py
index 34f6347175..48a9a8929c 100644
--- a/programming_examples/ml/bottleneck/test.py
+++ b/programming_examples/ml/bottleneck/test.py
@@ -14,177 +14,192 @@
 import os
 import numpy as np
 from aie.utils.xrt import setup_aie, extract_trace, write_out_trace, execute
+import aie.utils.test as test_utils
 
 torch.use_deterministic_algorithms(True)
 torch.manual_seed(0)
 
-design = "bottleneck_int8"
-xclbin_path = os.path.abspath("build/final.xclbin")
-insts_path = os.path.abspath("build/insts.txt")
-
-log_folder = "log/"
-if not os.path.exists(log_folder):
-    os.makedirs(log_folder)
-
-num_iter = 1
-npu_time_total = 0
-npu_time_min = 9999999
-npu_time_max = 0
-trace_size = 16384
-enable_trace = False
-trace_file = "log/trace_" + design + ".txt"
-# ------------------------------------------------------
-# Configure this to match your design's buffer size
-# ------------------------------------------------------
-dtype_in = np.dtype("int8")
-dtype_wts = np.dtype("int8")
-dtype_out = np.dtype("uint8")
-
-shape_in_act = (32, 32, 32, 8)
-shape_in_wts1 = (8, 32, 1, 1, 8, 8)  # out,in,ky,kx,in8,out8
-shape_in_wts2 = (8, 8, 3, 3, 8, 8)  # out,in,ky,kx,in8,out8
-shape_in_wts3 = (32, 8, 1, 1, 8, 8)  # out,in,ky,kx,in8,out8
-shape_total_wts = (69632, 1)
-shape_out = (32, 32, 32, 8)
-
-# ------------------------------------------------------
-# Initialize activation, weights, scaling factor for int8 model
-# ------------------------------------------------------
-int_inp = torch.randint(1, 100, (1, 256, 32, 32)).type(torch.FloatTensor)
-int_weight1 = torch.randint(50, 100, (64, 256, 1, 1)).type(torch.FloatTensor)
-int_weight2 = torch.randint(50, 100, (64, 64, 3, 3)).type(torch.FloatTensor)
-int_weight3 = torch.randint(50, 100, (256, 64, 1, 1)).type(torch.FloatTensor)
-
-inp_scale1 = 0.5
-inp_scale2 = 0.5
-inp_scale3 = 0.5
-inp_scale4 = 0.5
-
-weight_scale1 = 0.5
-weight_scale2 = 0.5
-weight_scale3 = 0.5
-
-combined_scale1 = -math.log2(inp_scale1 * weight_scale1 / inp_scale2)
-combined_scale2 = -math.log2(inp_scale2 * weight_scale2 / inp_scale3)
-combined_scale3 = -math.log2(inp_scale3 * weight_scale3 / inp_scale1)
-combined_scale4 = -math.log2(inp_scale1 / inp_scale4)
-conv_scale = 0.0039  # scale to convert int8 output to floating point
-relu_scale = 0.0078  # scale to convert int8 output to floating point
-min = 0
-max = 255
-
-# ------------------------------------------------------
-# Get device, load the xclbin & kernel and register them
-# ------------------------------------------------------
-app = setup_aie(
-    xclbin_path,
-    insts_path,
-    shape_in_act,
-    dtype_in,
-    shape_total_wts,
-    dtype_wts,
-    shape_out,
-    dtype_out,
-    enable_trace=enable_trace,
-    trace_size=trace_size,
-)
-
-
-# ------------------------------------------------------
-# Define your golden reference
-# ------------------------------------------------------
-class bottleneck_int8(nn.Module):
-    def __init__(self, in_planes=256, planes=64):
-        super(bottleneck_int8, self).__init__()
-        self.conv1 = nn.Conv2d(256, 64, kernel_size=1, bias=False)
-        self.conv2 = nn.Conv2d(
-            64, 64, kernel_size=3, padding=1, padding_mode="zeros", bias=False
-        )
-        self.conv3 = nn.Conv2d(64, 256, kernel_size=1, bias=False)
-
-        self.relu1 = nn.ReLU()
-        self.relu2 = nn.ReLU()
-        self.relu3 = nn.ReLU()
-
-    def forward(self, x):
-        conv1_out = self.conv1(x) * inp_scale1 * weight_scale1
-        relu1_out = torch.clamp(
-            torch.round(self.relu1(conv1_out) / inp_scale2), min, max
-        )  # convert to int and apply relu
-        conv2_out = self.conv2(relu1_out) * inp_scale2 * weight_scale2
-        relu2_out = torch.clamp(
-            torch.round(self.relu2(conv2_out) / inp_scale3), min, max
-        )
-        conv3_out = self.conv3(relu2_out) * inp_scale3 * weight_scale3
-        same_scale_init = torch.clamp(torch.round(conv3_out / inp_scale1), -128, 127)
-        skip_add = inp_scale1 * (same_scale_init + int_inp)
-        final_out = inp_scale4 * (
-            torch.clamp(torch.round(skip_add / inp_scale4), min, max)
-        )
-        return final_out
-
-
-# ------------------------------------------------------
-# Pytorch baseline
-# ------------------------------------------------------
-model = bottleneck_int8()
-model.eval()
-model.conv1.weight.data.copy_(int_weight1)
-model.conv2.weight.data.copy_(int_weight2)
-model.conv3.weight.data.copy_(int_weight3)
-
-golden_output = model(int_inp)
-
-# ------------------------------------------------------
-# Reorder input data-layout
-# ------------------------------------------------------
-ds = DataShaper()
-before_input = int_inp.squeeze().data.numpy().astype(dtype_in)
-before_input.tofile(log_folder + "/before_ifm_mem_fmt_1x1.txt", sep=",", format="%d")
-ifm_mem_fmt = ds.reorder_mat(before_input, "YCXC8", "CYX")
-ifm_mem_fmt.tofile(log_folder + "/after_ifm_mem_fmt_1x1.txt", sep=",", format="%d")
-
-wts1 = ds.reorder_mat(int_weight1.data.numpy().astype(dtype_in), "OIYXI8O8", "OIYX")
-wts2 = ds.reorder_mat(int_weight2.data.numpy().astype(dtype_in), "OIYXI8O8", "OIYX")
-wts3 = ds.reorder_mat(int_weight3.data.numpy().astype(dtype_in), "OIYXI8O8", "OIYX")
-
-total_wts = np.concatenate((wts1, wts2, wts3), axis=None)
-total_wts.tofile(log_folder + "/weights_mem_fmt_final.txt", sep=",", format="%d")
-
-# ------------------------------------------------------
-# Main run loop
-# ------------------------------------------------------
-for i in range(num_iter):
-    start = time.time_ns()
-    aie_output = execute(app, ifm_mem_fmt, total_wts) * inp_scale4
-    stop = time.time_ns()
-
-    if enable_trace:
-        aie_output, trace = extract_trace(aie_output, shape_out, dtype_out, trace_size)
-        write_out_trace(trace, trace_file)
-
-    npu_time = stop - start
-    npu_time_total = npu_time_total + npu_time
-
-# ------------------------------------------------------
-# Reorder output data-layout
-# ------------------------------------------------------
-temp_out = aie_output.reshape(32, 32, 32, 8)
-temp_out = ds.reorder_mat(temp_out, "CDYX", "YCXD")
-ofm_mem_fmt = temp_out.reshape(256, 32, 32)
-ofm_mem_fmt.tofile(log_folder + "/after_ofm_mem_fmt_final.txt", sep=",", format="%d")
-ofm_mem_fmt_out = torch.from_numpy(ofm_mem_fmt).unsqueeze(0)
-
-# ------------------------------------------------------
-# Compare the AIE output and the golden reference
-# ------------------------------------------------------
-print("\nAvg NPU time: {}us.".format(int((npu_time_total / num_iter) / 1000)))
-
-assert np.allclose(
-    ofm_mem_fmt_out.detach().numpy(),
-    golden_output.detach().numpy(),
-    rtol=0,
-    atol=inp_scale4,
-)
-
-print("\nPASS!\n")
+
+def main(opts):
+    design = "bottleneck_int8"
+    xclbin_path = opts.xclbin
+    insts_path = opts.instr
+
+    log_folder = "log/"
+    if not os.path.exists(log_folder):
+        os.makedirs(log_folder)
+
+    num_iter = 1
+    npu_time_total = 0
+    npu_time_min = 9999999
+    npu_time_max = 0
+    trace_size = 16384
+    enable_trace = False
+    trace_file = "log/trace_" + design + ".txt"
+    # ------------------------------------------------------
+    # Configure this to match your design's buffer size
+    # ------------------------------------------------------
+    dtype_in = np.dtype("int8")
+    dtype_wts = np.dtype("int8")
+    dtype_out = np.dtype("uint8")
+
+    shape_in_act = (32, 32, 32, 8)
+    shape_in_wts1 = (8, 32, 1, 1, 8, 8)  # out,in,ky,kx,in8,out8
+    shape_in_wts2 = (8, 8, 3, 3, 8, 8)  # out,in,ky,kx,in8,out8
+    shape_in_wts3 = (32, 8, 1, 1, 8, 8)  # out,in,ky,kx,in8,out8
+    shape_total_wts = (69632, 1)
+    shape_out = (32, 32, 32, 8)
+
+    # ------------------------------------------------------
+    # Initialize activation, weights, scaling factor for int8 model
+    # ------------------------------------------------------
+    int_inp = torch.randint(1, 100, (1, 256, 32, 32)).type(torch.FloatTensor)
+    int_weight1 = torch.randint(50, 100, (64, 256, 1, 1)).type(torch.FloatTensor)
+    int_weight2 = torch.randint(50, 100, (64, 64, 3, 3)).type(torch.FloatTensor)
+    int_weight3 = torch.randint(50, 100, (256, 64, 1, 1)).type(torch.FloatTensor)
+
+    inp_scale1 = 0.5
+    inp_scale2 = 0.5
+    inp_scale3 = 0.5
+    inp_scale4 = 0.5
+
+    weight_scale1 = 0.5
+    weight_scale2 = 0.5
+    weight_scale3 = 0.5
+
+    combined_scale1 = -math.log2(inp_scale1 * weight_scale1 / inp_scale2)
+    combined_scale2 = -math.log2(inp_scale2 * weight_scale2 / inp_scale3)
+    combined_scale3 = -math.log2(inp_scale3 * weight_scale3 / inp_scale1)
+    combined_scale4 = -math.log2(inp_scale1 / inp_scale4)
+    conv_scale = 0.0039  # scale to convert int8 output to floating point
+    relu_scale = 0.0078  # scale to convert int8 output to floating point
+    min = 0
+    max = 255
+
+    # ------------------------------------------------------
+    # Get device, load the xclbin & kernel and register them
+    # ------------------------------------------------------
+    app = setup_aie(
+        xclbin_path,
+        insts_path,
+        shape_in_act,
+        dtype_in,
+        shape_total_wts,
+        dtype_wts,
+        shape_out,
+        dtype_out,
+        enable_trace=enable_trace,
+        trace_size=trace_size,
+    )
+
+    # ------------------------------------------------------
+    # Define your golden reference
+    # ------------------------------------------------------
+    class bottleneck_int8(nn.Module):
+        def __init__(self, in_planes=256, planes=64):
+            super(bottleneck_int8, self).__init__()
+            self.conv1 = nn.Conv2d(256, 64, kernel_size=1, bias=False)
+            self.conv2 = nn.Conv2d(
+                64, 64, kernel_size=3, padding=1, padding_mode="zeros", bias=False
+            )
+            self.conv3 = nn.Conv2d(64, 256, kernel_size=1, bias=False)
+
+            self.relu1 = nn.ReLU()
+            self.relu2 = nn.ReLU()
+            self.relu3 = nn.ReLU()
+
+        def forward(self, x):
+            conv1_out = self.conv1(x) * inp_scale1 * weight_scale1
+            relu1_out = torch.clamp(
+                torch.round(self.relu1(conv1_out) / inp_scale2), min, max
+            )  # convert to int and apply relu
+            conv2_out = self.conv2(relu1_out) * inp_scale2 * weight_scale2
+            relu2_out = torch.clamp(
+                torch.round(self.relu2(conv2_out) / inp_scale3), min, max
+            )
+            conv3_out = self.conv3(relu2_out) * inp_scale3 * weight_scale3
+            same_scale_init = torch.clamp(
+                torch.round(conv3_out / inp_scale1), -128, 127
+            )
+            skip_add = inp_scale1 * (same_scale_init + int_inp)
+            final_out = inp_scale4 * (
+                torch.clamp(torch.round(skip_add / inp_scale4), min, max)
+            )
+            return final_out
+
+    # ------------------------------------------------------
+    # Pytorch baseline
+    # ------------------------------------------------------
+    model = bottleneck_int8()
+    model.eval()
+    model.conv1.weight.data.copy_(int_weight1)
+    model.conv2.weight.data.copy_(int_weight2)
+    model.conv3.weight.data.copy_(int_weight3)
+
+    golden_output = model(int_inp)
+
+    # ------------------------------------------------------
+    # Reorder input data-layout
+    # ------------------------------------------------------
+    ds = DataShaper()
+    before_input = int_inp.squeeze().data.numpy().astype(dtype_in)
+    before_input.tofile(
+        log_folder + "/before_ifm_mem_fmt_1x1.txt", sep=",", format="%d"
+    )
+    ifm_mem_fmt = ds.reorder_mat(before_input, "YCXC8", "CYX")
+    ifm_mem_fmt.tofile(log_folder + "/after_ifm_mem_fmt_1x1.txt", sep=",", format="%d")
+
+    wts1 = ds.reorder_mat(int_weight1.data.numpy().astype(dtype_in), "OIYXI8O8", "OIYX")
+    wts2 = ds.reorder_mat(int_weight2.data.numpy().astype(dtype_in), "OIYXI8O8", "OIYX")
+    wts3 = ds.reorder_mat(int_weight3.data.numpy().astype(dtype_in), "OIYXI8O8", "OIYX")
+
+    total_wts = np.concatenate((wts1, wts2, wts3), axis=None)
+    total_wts.tofile(log_folder + "/weights_mem_fmt_final.txt", sep=",", format="%d")
+
+    # ------------------------------------------------------
+    # Main run loop
+    # ------------------------------------------------------
+    for i in range(num_iter):
+        start = time.time_ns()
+        aie_output = execute(app, ifm_mem_fmt, total_wts) * inp_scale4
+        stop = time.time_ns()
+
+        if enable_trace:
+            aie_output, trace = extract_trace(
+                aie_output, shape_out, dtype_out, trace_size
+            )
+            write_out_trace(trace, trace_file)
+
+        npu_time = stop - start
+        npu_time_total = npu_time_total + npu_time
+
+    # ------------------------------------------------------
+    # Reorder output data-layout
+    # ------------------------------------------------------
+    temp_out = aie_output.reshape(32, 32, 32, 8)
+    temp_out = ds.reorder_mat(temp_out, "CDYX", "YCXD")
+    ofm_mem_fmt = temp_out.reshape(256, 32, 32)
+    ofm_mem_fmt.tofile(
+        log_folder + "/after_ofm_mem_fmt_final.txt", sep=",", format="%d"
+    )
+    ofm_mem_fmt_out = torch.from_numpy(ofm_mem_fmt).unsqueeze(0)
+
+    # ------------------------------------------------------
+    # Compare the AIE output and the golden reference
+    # ------------------------------------------------------
+    print("\nAvg NPU time: {}us.".format(int((npu_time_total / num_iter) / 1000)))
+
+    assert np.allclose(
+        ofm_mem_fmt_out.detach().numpy(),
+        golden_output.detach().numpy(),
+        rtol=0,
+        atol=inp_scale4,
+    )
+
+    print("\nPASS!\n")
+
+
+if __name__ == "__main__":
+    p = test_utils.create_default_argparser()
+    opts = p.parse_args(sys.argv[1:])
+    main(opts)
diff --git a/programming_examples/ml/conv2d/Makefile b/programming_examples/ml/conv2d/Makefile
index 2bba6ea11c..0a89ce4bf0 100755
--- a/programming_examples/ml/conv2d/Makefile
+++ b/programming_examples/ml/conv2d/Makefile
@@ -34,4 +34,4 @@ clean:
 		chess* *.o insts.txt \
 		*.log aie_partition.json *.bin BOOT.BIN _x test.exe
 run_py: 
-	${powershell} python3 test.py
\ No newline at end of file
+	${powershell} python3 test.py -x build/final.xclbin -i build/insts.txt -k MLIR_AIE
\ No newline at end of file
diff --git a/programming_examples/ml/conv2d/README.md b/programming_examples/ml/conv2d/README.md
index 81b25f3e52..b2d93f066d 100644
--- a/programming_examples/ml/conv2d/README.md
+++ b/programming_examples/ml/conv2d/README.md
@@ -56,12 +56,5 @@ make
 
 To run the design:
 ```
-make run
-```
-
-### Prerequisites
-To install the dependencies, run the following command:
-```
-pip install -r requirements.txt
-
+make run_py
 ```
\ No newline at end of file
diff --git a/programming_examples/ml/conv2d/run.lit b/programming_examples/ml/conv2d/run.lit
index 1eeef90b94..349e45f9bc 100644
--- a/programming_examples/ml/conv2d/run.lit
+++ b/programming_examples/ml/conv2d/run.lit
@@ -1,4 +1,4 @@
-// (c) Copyright 2023 Advanced Micro Devices, Inc.
+// (c) Copyright 2024 Advanced Micro Devices, Inc.
 // SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
 //
 // REQUIRES: ryzen_ai, chess, torch
@@ -6,5 +6,5 @@
 // RUN: xchesscc_wrapper aie2 -I %aietools/include -DBIT_WIDTH=8  -DINT8_ACT  -c %S/../../../aie_kernels/aie2/conv2dk1_i8.cc -o conv2dk1_i8.o
 // RUN: %python %S/aie2.py | aie-opt -cse -canonicalize -o ./aie.mlir
 // RUN: %python aiecc.py --xbridge --aie-generate-cdo --aie-generate-ipu --no-compile-host --xclbin-name=aie.xclbin --ipu-insts-name=insts.txt ./aie.mlir
-// RUN: %run_on_ipu %python %S/test.py | FileCheck %s
+// RUN: %run_on_ipu %python %S/test.py -x aie.xclbin -i insts.txt -k MLIR_AIE | FileCheck %s
 // CHECK: PASS!
\ No newline at end of file
diff --git a/programming_examples/ml/conv2d/test.py b/programming_examples/ml/conv2d/test.py
index 1dc847d8fe..1a8d2e7712 100644
--- a/programming_examples/ml/conv2d/test.py
+++ b/programming_examples/ml/conv2d/test.py
@@ -14,136 +14,149 @@
 import os
 import numpy as np
 from aie.utils.xrt import setup_aie, extract_trace, write_out_trace, execute
+import aie.utils.test as test_utils
 
 torch.use_deterministic_algorithms(True)
 torch.manual_seed(0)
 
-design = "conv2d"
-xclbin_path = os.path.abspath("build/final.xclbin")
-insts_path = os.path.abspath("build/insts.txt")
-
-log_folder = "log/"
-if not os.path.exists(log_folder):
-    os.makedirs(log_folder)
-
-num_iter = 1
-npu_time_total = 0
-npu_time_min = 9999999
-npu_time_max = 0
-trace_size = 16384
-enable_trace = False
-trace_file = "log/trace_" + design + ".txt"
-# ------------------------------------------------------
-# Configure this to match your design's buffer size
-# ------------------------------------------------------
-dtype_in = np.dtype("int8")
-dtype_wts = np.dtype("int8")
-dtype_out = np.dtype("int8")
-
-shape_total_wts = (4096, 1)
-shape_in_act = (32, 8, 32, 8)  #'YCXC8' , 'CYX'
-shape_in_wts1 = (8, 8, 1, 1, 8, 8)  # out,in,ky,kx,in8,out8
-shape_out = (32, 8, 32, 8)
-
-# ------------------------------------------------------
-# Initialize activation, weights, scaling factor for int8 model
-# ------------------------------------------------------
-int_inp = torch.randint(1, 20, (1, 64, 32, 32)).type(torch.FloatTensor)
-int_weight = torch.randint(50, 80, (64, 64, 1, 1)).type(torch.FloatTensor)
-conv_scale = 7.6294e-06  # scale to convert int8 output to floating point
-int8_scale = 0.0078  # scale to convert int8 output to floating point
-min = -128
-max = 127
-# ------------------------------------------------------
-# Get device, load the xclbin & kernel and register them
-# ------------------------------------------------------
-app = setup_aie(
-    xclbin_path,
-    insts_path,
-    shape_in_act,
-    dtype_in,
-    shape_total_wts,
-    dtype_wts,
-    shape_out,
-    dtype_out,
-    enable_trace=enable_trace,
-    trace_size=trace_size,
-)
-
-
-# ------------------------------------------------------
-# Define your golden reference
-# ------------------------------------------------------
-class conv2d_int_model(nn.Module):
-    def __init__(self, in_planes=64, planes=64):
-        super(conv2d_int_model, self).__init__()
-        self.conv = nn.Conv2d(64, 64, kernel_size=1, bias=False)
-
-    def forward(self, x):
-        out_int = self.conv(x)
-        out_quant = out_int * conv_scale  # int8 x int8 leads to int32 output
-        out_float = int8_scale * torch.clamp(
-            torch.round(out_quant / int8_scale), min, max
-        )  # converting to int8 range
-        return out_float
-
-
-# ------------------------------------------------------
-# Pytorch baseline
-# ------------------------------------------------------
-model = conv2d_int_model()
-model.eval()
-model.conv.weight.data.copy_(int_weight)
-
-golden_output = model(int_inp)
-
-# ------------------------------------------------------
-# Reorder input data-layout
-# ------------------------------------------------------
-ds = DataShaper()
-before_input = int_inp.squeeze().data.numpy().astype(dtype_in)
-before_input.tofile(log_folder + "/before_ifm_mem_fmt_1x1.txt", sep=",", format="%d")
-ifm_mem_fmt = ds.reorder_mat(before_input, "YCXC8", "CYX")
-ifm_mem_fmt.tofile(log_folder + "/after_ifm_mem_fmt_1x1.txt", sep=",", format="%d")
-
-wts1 = ds.reorder_mat(int_weight.data.numpy().astype(dtype_wts), "OIYXI8O8", "OIYX")
-total_wts = np.concatenate((wts1), axis=None)
-total_wts.tofile(log_folder + "/weights_mem_fmt_final.txt", sep=",", format="%d")
-
-# ------------------------------------------------------
-# Main run loop
-# ------------------------------------------------------
-for i in range(num_iter):
-    start = time.time_ns()
-    aie_output = execute(app, ifm_mem_fmt, total_wts) * int8_scale
-    stop = time.time_ns()
-
-    if enable_trace:
-        aie_output, trace = extract_trace(aie_output, shape_out, dtype_out, trace_size)
-        write_out_trace(trace, trace_file)
-
-    npu_time = stop - start
-    npu_time_total = npu_time_total + npu_time
-
-# ------------------------------------------------------
-# Reorder output data-layout
-# ------------------------------------------------------
-temp_out = aie_output.reshape(32, 8, 32, 8)
-temp_out = ds.reorder_mat(temp_out, "CDYX", "YCXD")
-ofm_mem_fmt = temp_out.reshape(64, 32, 32)
-ofm_mem_fmt.tofile(log_folder + "/after_ofm_mem_fmt_final.txt", sep=",", format="%d")
-ofm_mem_fmt_out = torch.from_numpy(ofm_mem_fmt).unsqueeze(0)
-
-# ------------------------------------------------------
-# Compare the AIE output and the golden reference
-# ------------------------------------------------------
-
-print("\nAvg NPU time: {}us.".format(int((npu_time_total / num_iter) / 1000)))
-
-assert np.allclose(
-    ofm_mem_fmt_out.detach().numpy(),
-    golden_output.detach().numpy(),
-    rtol=0,
-    atol=2 * int8_scale,
-)
-print("\nPASS!\n")
+
+def main(opts):
+    design = "conv2d"
+    xclbin_path = opts.xclbin
+    insts_path = opts.instr
+
+    log_folder = "log/"
+    if not os.path.exists(log_folder):
+        os.makedirs(log_folder)
+
+    num_iter = 1
+    npu_time_total = 0
+    npu_time_min = 9999999
+    npu_time_max = 0
+    trace_size = 16384
+    enable_trace = False
+    trace_file = "log/trace_" + design + ".txt"
+    # ------------------------------------------------------
+    # Configure this to match your design's buffer size
+    # ------------------------------------------------------
+    dtype_in = np.dtype("int8")
+    dtype_wts = np.dtype("int8")
+    dtype_out = np.dtype("int8")
+
+    shape_total_wts = (4096, 1)
+    shape_in_act = (32, 8, 32, 8)  #'YCXC8' , 'CYX'
+    shape_in_wts1 = (8, 8, 1, 1, 8, 8)  # out,in,ky,kx,in8,out8
+    shape_out = (32, 8, 32, 8)
+
+    # ------------------------------------------------------
+    # Initialize activation, weights, scaling factor for int8 model
+    # ------------------------------------------------------
+    int_inp = torch.randint(1, 20, (1, 64, 32, 32)).type(torch.FloatTensor)
+    int_weight = torch.randint(50, 80, (64, 64, 1, 1)).type(torch.FloatTensor)
+    conv_scale = 7.6294e-06  # scale to convert int8 output to floating point
+    int8_scale = 0.0078  # scale to convert int8 output to floating point
+    min = -128
+    max = 127
+    # ------------------------------------------------------
+    # Get device, load the xclbin & kernel and register them
+    # ------------------------------------------------------
+    app = setup_aie(
+        xclbin_path,
+        insts_path,
+        shape_in_act,
+        dtype_in,
+        shape_total_wts,
+        dtype_wts,
+        shape_out,
+        dtype_out,
+        enable_trace=enable_trace,
+        trace_size=trace_size,
+    )
+
+    # ------------------------------------------------------
+    # Define your golden reference
+    # ------------------------------------------------------
+    class conv2d_int_model(nn.Module):
+        def __init__(self, in_planes=64, planes=64):
+            super(conv2d_int_model, self).__init__()
+            self.conv = nn.Conv2d(64, 64, kernel_size=1, bias=False)
+
+        def forward(self, x):
+            out_int = self.conv(x)
+            out_quant = out_int * conv_scale  # int8 x int8 leads to int32 output
+            out_float = int8_scale * torch.clamp(
+                torch.round(out_quant / int8_scale), min, max
+            )  # converting to int8 range
+            return out_float
+
+    # ------------------------------------------------------
+    # Pytorch baseline
+    # ------------------------------------------------------
+    model = conv2d_int_model()
+    model.eval()
+    model.conv.weight.data.copy_(int_weight)
+
+    golden_output = model(int_inp)
+
+    # ------------------------------------------------------
+    # Reorder input data-layout
+    # ------------------------------------------------------
+    ds = DataShaper()
+    before_input = int_inp.squeeze().data.numpy().astype(dtype_in)
+    before_input.tofile(
+        log_folder + "/before_ifm_mem_fmt_1x1.txt", sep=",", format="%d"
+    )
+    ifm_mem_fmt = ds.reorder_mat(before_input, "YCXC8", "CYX")
+    ifm_mem_fmt.tofile(log_folder + "/after_ifm_mem_fmt_1x1.txt", sep=",", format="%d")
+
+    wts1 = ds.reorder_mat(int_weight.data.numpy().astype(dtype_wts), "OIYXI8O8", "OIYX")
+    total_wts = np.concatenate((wts1), axis=None)
+    total_wts.tofile(log_folder + "/weights_mem_fmt_final.txt", sep=",", format="%d")
+
+    # ------------------------------------------------------
+    # Main run loop
+    # ------------------------------------------------------
+    for i in range(num_iter):
+        start = time.time_ns()
+        aie_output = execute(app, ifm_mem_fmt, total_wts) * int8_scale
+        stop = time.time_ns()
+
+        if enable_trace:
+            aie_output, trace = extract_trace(
+                aie_output, shape_out, dtype_out, trace_size
+            )
+            write_out_trace(trace, trace_file)
+
+        npu_time = stop - start
+        npu_time_total = npu_time_total + npu_time
+
+    # ------------------------------------------------------
+    # Reorder output data-layout
+    # ------------------------------------------------------
+    temp_out = aie_output.reshape(32, 8, 32, 8)
+    temp_out = ds.reorder_mat(temp_out, "CDYX", "YCXD")
+    ofm_mem_fmt = temp_out.reshape(64, 32, 32)
+    ofm_mem_fmt.tofile(
+        log_folder + "/after_ofm_mem_fmt_final.txt", sep=",", format="%d"
+    )
+    ofm_mem_fmt_out = torch.from_numpy(ofm_mem_fmt).unsqueeze(0)
+
+    # ------------------------------------------------------
+    # Compare the AIE output and the golden reference
+    # ------------------------------------------------------
+
+    print("\nAvg NPU time: {}us.".format(int((npu_time_total / num_iter) / 1000)))
+
+    assert np.allclose(
+        ofm_mem_fmt_out.detach().numpy(),
+        golden_output.detach().numpy(),
+        rtol=0,
+        atol=2 * int8_scale,
+    )
+    print("\nPASS!\n")
+
+
+if __name__ == "__main__":
+    p = test_utils.create_default_argparser()
+    opts = p.parse_args(sys.argv[1:])
+    main(opts)
diff --git a/programming_examples/ml/conv2d_fused_relu/Makefile b/programming_examples/ml/conv2d_fused_relu/Makefile
index f804bdd842..7c59ae4877 100755
--- a/programming_examples/ml/conv2d_fused_relu/Makefile
+++ b/programming_examples/ml/conv2d_fused_relu/Makefile
@@ -34,4 +34,4 @@ clean:
 		*.log aie_partition.json *.bin BOOT.BIN _x test.exe
 
 run_py: 
-	${powershell} python3 test.py
+	${powershell} python3 test.py -x build/final.xclbin -i build/insts.txt -k MLIR_AIE
diff --git a/programming_examples/ml/conv2d_fused_relu/README.md b/programming_examples/ml/conv2d_fused_relu/README.md
index 68e7e9b8cf..3f4a2264cd 100644
--- a/programming_examples/ml/conv2d_fused_relu/README.md
+++ b/programming_examples/ml/conv2d_fused_relu/README.md
@@ -88,12 +88,5 @@ make
 
 To run the design:
 ```
-make run
-```
-
-### Prerequisites
-To install the dependencies, run the following command:
-```
-pip install -r requirements.txt
-
+make run_py
 ```
\ No newline at end of file
diff --git a/programming_examples/ml/conv2d_fused_relu/run.lit b/programming_examples/ml/conv2d_fused_relu/run.lit
index 0c122f451e..cfddde9013 100644
--- a/programming_examples/ml/conv2d_fused_relu/run.lit
+++ b/programming_examples/ml/conv2d_fused_relu/run.lit
@@ -6,5 +6,5 @@
 // RUN: xchesscc_wrapper aie2 -I %aietools/include -DINT8_ACT -DBIT_WIDTH=8 -c %S/../../../aie_kernels/aie2/conv2dk1.cc -o conv2dk1.o
 // RUN: %python %S/aie2.py | aie-opt -cse -canonicalize -o ./aie.mlir
 // RUN: %python aiecc.py --xbridge --aie-generate-cdo --aie-generate-ipu --no-compile-host --xclbin-name=aie.xclbin --ipu-insts-name=insts.txt ./aie.mlir
-// RUN: %run_on_ipu %python %S/test.py | FileCheck %s
+// RUN: %run_on_ipu %python %S/test.py -x aie.xclbin -i insts.txt -k MLIR_AIE | FileCheck %s
 // CHECK: PASS!
\ No newline at end of file
diff --git a/programming_examples/ml/conv2d_fused_relu/test.py b/programming_examples/ml/conv2d_fused_relu/test.py
index 5bfe139112..6fe407faaa 100644
--- a/programming_examples/ml/conv2d_fused_relu/test.py
+++ b/programming_examples/ml/conv2d_fused_relu/test.py
@@ -14,138 +14,151 @@
 import os
 import numpy as np
 from aie.utils.xrt import setup_aie, extract_trace, write_out_trace, execute
+import aie.utils.test as test_utils
 
 torch.use_deterministic_algorithms(True)
 torch.manual_seed(0)
 
-design = "conv2d_with_relu"
-xclbin_path = os.path.abspath("build/final.xclbin")
-insts_path = os.path.abspath("build/insts.txt")
-
-log_folder = "log/"
-if not os.path.exists(log_folder):
-    os.makedirs(log_folder)
-
-num_iter = 1
-npu_time_total = 0
-npu_time_min = 9999999
-npu_time_max = 0
-trace_size = 16384
-enable_trace = False
-trace_file = "log/trace_" + design + ".txt"
-# ------------------------------------------------------
-# Configure this to match your design's buffer size
-# ------------------------------------------------------
-dtype_in = np.dtype("int8")
-dtype_wts = np.dtype("int8")
-dtype_out = np.dtype("uint8")
-
-shape_total_wts = (4096, 1)
-shape_in_act = (32, 8, 32, 8)  #'YCXC8' , 'CYX'
-shape_in_wts1 = (8, 8, 1, 1, 8, 8)  # out,in,ky,kx,in8,out8
-shape_out = (32, 8, 32, 8)
-
-# ------------------------------------------------------
-# Initialize activation, weights, scaling factor for int8 model
-# ------------------------------------------------------
-int_inp = torch.randint(1, 100, (1, 64, 32, 32)).type(torch.FloatTensor)
-int_weight = torch.randint(50, 100, (64, 64, 1, 1)).type(torch.FloatTensor)
-conv_scale = 0.0039  # scale to convert int8 output to floating point
-relu_scale = 0.0078  # scale to convert int8 output to floating point
-min = 0
-max = 255
-
-# ------------------------------------------------------
-# Get device, load the xclbin & kernel and register them
-# ------------------------------------------------------
-app = setup_aie(
-    xclbin_path,
-    insts_path,
-    shape_in_act,
-    dtype_in,
-    shape_total_wts,
-    dtype_wts,
-    shape_out,
-    dtype_out,
-    enable_trace=enable_trace,
-    trace_size=trace_size,
-)
-
-
-# ------------------------------------------------------
-# Define your golden reference
-# ------------------------------------------------------
-class conv2d_relu_int_model(nn.Module):
-    def __init__(self, in_planes=64, planes=64):
-        super(conv2d_relu_int_model, self).__init__()
-        self.conv = nn.Conv2d(64, 64, kernel_size=1, bias=False)
-        self.relu = nn.ReLU()
-
-    def forward(self, x):
-        out_int = self.conv(x)
-        out_float = out_int * conv_scale
-        out_int = self.relu(out_float)
-        out_float = relu_scale * torch.clamp(
-            torch.round(out_int / relu_scale), min, max
-        )  # converting to int to do proper clipping
-        return out_float
-
-
-# ------------------------------------------------------
-# Pytorch baseline
-# ------------------------------------------------------
-model = conv2d_relu_int_model()
-model.eval()
-model.conv.weight.data.copy_(int_weight)
-golden_output = model(int_inp)
-
-# ------------------------------------------------------
-# Reorder input data-layout
-# ------------------------------------------------------
-ds = DataShaper()
-before_input = int_inp.squeeze().data.numpy().astype(dtype_in)
-before_input.tofile(log_folder + "/before_ifm_mem_fmt_1x1.txt", sep=",", format="%d")
-ifm_mem_fmt = ds.reorder_mat(before_input, "YCXC8", "CYX")
-ifm_mem_fmt.tofile(log_folder + "/after_ifm_mem_fmt_1x1.txt", sep=",", format="%d")
-
-wts1 = ds.reorder_mat(int_weight.data.numpy().astype(dtype_wts), "OIYXI8O8", "OIYX")
-total_wts = np.concatenate((wts1), axis=None)
-total_wts.tofile(log_folder + "/weights_mem_fmt_final.txt", sep=",", format="%d")
-
-# ------------------------------------------------------
-# Main run loop
-# ------------------------------------------------------
-for i in range(num_iter):
-    start = time.time_ns()
-    aie_output = execute(app, ifm_mem_fmt, total_wts) * relu_scale
-    stop = time.time_ns()
-
-    if enable_trace:
-        aie_output, trace = extract_trace(aie_output, shape_out, dtype_out, trace_size)
-        write_out_trace(trace, trace_file)
-
-    npu_time = stop - start
-    npu_time_total = npu_time_total + npu_time
-
-# ------------------------------------------------------
-# Reorder output data-layout
-# ------------------------------------------------------
-temp_out = aie_output.reshape(32, 8, 32, 8)
-temp_out = ds.reorder_mat(temp_out, "CDYX", "YCXD")
-ofm_mem_fmt = temp_out.reshape(64, 32, 32)
-ofm_mem_fmt.tofile(log_folder + "/after_ofm_mem_fmt_final.txt", sep=",", format="%d")
-ofm_mem_fmt_out = torch.from_numpy(ofm_mem_fmt).unsqueeze(0)
-
-# ------------------------------------------------------
-# Compare the AIE output and the golden reference
-# ------------------------------------------------------
-print("\nAvg NPU time: {}us.".format(int((npu_time_total / num_iter) / 1000)))
-
-assert np.allclose(
-    ofm_mem_fmt_out.detach().numpy(),
-    golden_output.detach().numpy(),
-    rtol=0,
-    atol=2 * relu_scale,
-)
-
-print("\nPASS!\n")
+
+def main(opts):
+    design = "conv2d_with_relu"
+    xclbin_path = opts.xclbin
+    insts_path = opts.instr
+
+    log_folder = "log/"
+    if not os.path.exists(log_folder):
+        os.makedirs(log_folder)
+
+    num_iter = 1
+    npu_time_total = 0
+    npu_time_min = 9999999
+    npu_time_max = 0
+    trace_size = 16384
+    enable_trace = False
+    trace_file = "log/trace_" + design + ".txt"
+    # ------------------------------------------------------
+    # Configure this to match your design's buffer size
+    # ------------------------------------------------------
+    dtype_in = np.dtype("int8")
+    dtype_wts = np.dtype("int8")
+    dtype_out = np.dtype("uint8")
+
+    shape_total_wts = (4096, 1)
+    shape_in_act = (32, 8, 32, 8)  #'YCXC8' , 'CYX'
+    shape_in_wts1 = (8, 8, 1, 1, 8, 8)  # out,in,ky,kx,in8,out8
+    shape_out = (32, 8, 32, 8)
+
+    # ------------------------------------------------------
+    # Initialize activation, weights, scaling factor for int8 model
+    # ------------------------------------------------------
+    int_inp = torch.randint(1, 100, (1, 64, 32, 32)).type(torch.FloatTensor)
+    int_weight = torch.randint(50, 100, (64, 64, 1, 1)).type(torch.FloatTensor)
+    conv_scale = 0.0039  # scale to convert int8 output to floating point
+    relu_scale = 0.0078  # scale to convert int8 output to floating point
+    min = 0
+    max = 255
+
+    # ------------------------------------------------------
+    # Get device, load the xclbin & kernel and register them
+    # ------------------------------------------------------
+    app = setup_aie(
+        xclbin_path,
+        insts_path,
+        shape_in_act,
+        dtype_in,
+        shape_total_wts,
+        dtype_wts,
+        shape_out,
+        dtype_out,
+        enable_trace=enable_trace,
+        trace_size=trace_size,
+    )
+
+    # ------------------------------------------------------
+    # Define your golden reference
+    # ------------------------------------------------------
+    class conv2d_relu_int_model(nn.Module):
+        def __init__(self, in_planes=64, planes=64):
+            super(conv2d_relu_int_model, self).__init__()
+            self.conv = nn.Conv2d(64, 64, kernel_size=1, bias=False)
+            self.relu = nn.ReLU()
+
+        def forward(self, x):
+            out_int = self.conv(x)
+            out_float = out_int * conv_scale
+            out_int = self.relu(out_float)
+            out_float = relu_scale * torch.clamp(
+                torch.round(out_int / relu_scale), min, max
+            )  # converting to int to do proper clipping
+            return out_float
+
+    # ------------------------------------------------------
+    # Pytorch baseline
+    # ------------------------------------------------------
+    model = conv2d_relu_int_model()
+    model.eval()
+    model.conv.weight.data.copy_(int_weight)
+    golden_output = model(int_inp)
+
+    # ------------------------------------------------------
+    # Reorder input data-layout
+    # ------------------------------------------------------
+    ds = DataShaper()
+    before_input = int_inp.squeeze().data.numpy().astype(dtype_in)
+    before_input.tofile(
+        log_folder + "/before_ifm_mem_fmt_1x1.txt", sep=",", format="%d"
+    )
+    ifm_mem_fmt = ds.reorder_mat(before_input, "YCXC8", "CYX")
+    ifm_mem_fmt.tofile(log_folder + "/after_ifm_mem_fmt_1x1.txt", sep=",", format="%d")
+
+    wts1 = ds.reorder_mat(int_weight.data.numpy().astype(dtype_wts), "OIYXI8O8", "OIYX")
+    total_wts = np.concatenate((wts1), axis=None)
+    total_wts.tofile(log_folder + "/weights_mem_fmt_final.txt", sep=",", format="%d")
+
+    # ------------------------------------------------------
+    # Main run loop
+    # ------------------------------------------------------
+    for i in range(num_iter):
+        start = time.time_ns()
+        aie_output = execute(app, ifm_mem_fmt, total_wts) * relu_scale
+        stop = time.time_ns()
+
+        if enable_trace:
+            aie_output, trace = extract_trace(
+                aie_output, shape_out, dtype_out, trace_size
+            )
+            write_out_trace(trace, trace_file)
+
+        npu_time = stop - start
+        npu_time_total = npu_time_total + npu_time
+
+    # ------------------------------------------------------
+    # Reorder output data-layout
+    # ------------------------------------------------------
+    temp_out = aie_output.reshape(32, 8, 32, 8)
+    temp_out = ds.reorder_mat(temp_out, "CDYX", "YCXD")
+    ofm_mem_fmt = temp_out.reshape(64, 32, 32)
+    ofm_mem_fmt.tofile(
+        log_folder + "/after_ofm_mem_fmt_final.txt", sep=",", format="%d"
+    )
+    ofm_mem_fmt_out = torch.from_numpy(ofm_mem_fmt).unsqueeze(0)
+
+    # ------------------------------------------------------
+    # Compare the AIE output and the golden reference
+    # ------------------------------------------------------
+    print("\nAvg NPU time: {}us.".format(int((npu_time_total / num_iter) / 1000)))
+
+    assert np.allclose(
+        ofm_mem_fmt_out.detach().numpy(),
+        golden_output.detach().numpy(),
+        rtol=0,
+        atol=2 * relu_scale,
+    )
+
+    print("\nPASS!\n")
+
+
+if __name__ == "__main__":
+    p = test_utils.create_default_argparser()
+    opts = p.parse_args(sys.argv[1:])
+    main(opts)
diff --git a/programming_examples/ml/eltwise_add/Makefile b/programming_examples/ml/eltwise_add/Makefile
index 294fc902d8..38b1eef61f 100644
--- a/programming_examples/ml/eltwise_add/Makefile
+++ b/programming_examples/ml/eltwise_add/Makefile
@@ -13,7 +13,6 @@ all: build/final.xclbin
 targetname = myEltwiseAdd
 trace_size = 8192
 
-
 VPATH := ../../../aie_kernels/aie2
 
 build/%.o: %.cc
@@ -29,6 +28,10 @@ build/aie_trace.mlir: aie2.py
 	python3 $< ${trace_size} > $@
 
 
+build/aie_trace.mlir: aie2.py
+	mkdir -p ${@D}
+	python3 $< ${trace_size} > $@
+
 build/final.xclbin: build/aie.mlir build/add.o
 	mkdir -p ${@D}
 	cd ${@D} && aiecc.py --aie-generate-cdo --aie-generate-ipu --no-compile-host \
diff --git a/programming_examples/ml/eltwise_mul/aie2.py b/programming_examples/ml/eltwise_mul/aie2.py
index b18994e180..7a0a0670a9 100644
--- a/programming_examples/ml/eltwise_mul/aie2.py
+++ b/programming_examples/ml/eltwise_mul/aie2.py
@@ -160,6 +160,7 @@ def sequence(A, B, C):
     trace_size = 0 if (len(sys.argv) < 2) else int(sys.argv[1])
 except ValueError:
     print("Argument is not an integer")
+
 with mlir_mod_ctx() as ctx:
     my_eltwise_mul(trace_size)
     res = ctx.module.operation.verify()
diff --git a/programming_examples/ml/relu/aie2.py b/programming_examples/ml/relu/aie2.py
index bb7d1e16d9..0580a0113d 100644
--- a/programming_examples/ml/relu/aie2.py
+++ b/programming_examples/ml/relu/aie2.py
@@ -131,6 +131,7 @@ def sequence(A, C):
     trace_size = 0 if (len(sys.argv) != 2) else int(sys.argv[1])
 except ValueError:
     print("Argument is not an integer")
+
 with mlir_mod_ctx() as ctx:
     my_relu(trace_size)
     res = ctx.module.operation.verify()
diff --git a/programming_examples/ml/resnet/README.md b/programming_examples/ml/resnet/README.md
index 6382079c62..de4cc92535 100755
--- a/programming_examples/ml/resnet/README.md
+++ b/programming_examples/ml/resnet/README.md
@@ -107,14 +107,6 @@ To run the design:
 make run_py
 ```
 
-### Prerequisites
-
-To install the dependencies, run the following command:
-```
-pip install -r requirements.txt
-
-```
-
 ## References
 <a id="1">[1]</a> 
 He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
diff --git a/programming_examples/ml/resnet/layers_conv2_x/Makefile b/programming_examples/ml/resnet/layers_conv2_x/Makefile
index d8f1b7261a..6218e61fb5 100755
--- a/programming_examples/ml/resnet/layers_conv2_x/Makefile
+++ b/programming_examples/ml/resnet/layers_conv2_x/Makefile
@@ -44,4 +44,4 @@ clean:
 		*.log aie_partition.json *.bin BOOT.BIN _x test.exe
 
 run_py: 
-	${powershell} python3 test.py
+	${powershell} python3 test.py -x build/final.xclbin -i build/insts.txt -k MLIR_AIE
\ No newline at end of file
diff --git a/programming_examples/ml/resnet/layers_conv2_x/aie2.py b/programming_examples/ml/resnet/layers_conv2_x/aie2.py
index 235b5c5308..f5243070d9 100755
--- a/programming_examples/ml/resnet/layers_conv2_x/aie2.py
+++ b/programming_examples/ml/resnet/layers_conv2_x/aie2.py
@@ -7,8 +7,8 @@
 
 from aie.dialects.aie import *
 from aie.dialects.aiex import *
+from aie.dialects.scf import *
 from aie.extras.dialects.ext import memref, arith
-from aie.dialects.scf import for_, yield_
 from aie.extras.context import mlir_mod_ctx
 from aie.ir import MemRefType, TypeAttr
 
diff --git a/programming_examples/ml/resnet/layers_conv2_x/run.lit b/programming_examples/ml/resnet/layers_conv2_x/run.lit
index 61f43e45e6..6496daafe7 100755
--- a/programming_examples/ml/resnet/layers_conv2_x/run.lit
+++ b/programming_examples/ml/resnet/layers_conv2_x/run.lit
@@ -1,7 +1,7 @@
 // (c) Copyright 2024 Advanced Micro Devices, Inc.
 // SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
 //
-// REQUIRES: ryzen_ai, chess, torch
+// REQUIRES: ryzen_ai, chess, torch, dontrun
 //
 // RUN: xchesscc_wrapper aie2 -I %aietools/include -DBIT_WIDTH=8  -DINT8_ACT  -c %S/../../../../aie_kernels/aie2/conv2dk1.cc -o conv2dk1_i8.o
 // RUN: xchesscc_wrapper aie2 -I %aietools/include -DBIT_WIDTH=8 -DUINT8_ACT -c %S/../../../../aie_kernels/aie2/conv2dk3.cc -o conv2dk3.o
@@ -10,5 +10,5 @@
 // RUN: xchesscc_wrapper aie2 -I %aietools/include -DBIT_WIDTH=8 -DSCALAR -DUINT8_ACT -c %S/../../../../aie_kernels/aie2/conv2dk1_skip.cc -o conv2dk1_skip.o
 // RUN: %python %S/aie2.py | aie-opt -cse -canonicalize -o ./aie.mlir
 // RUN: %python aiecc.py --xbridge --aie-generate-cdo --aie-generate-ipu --no-compile-host --xclbin-name=aie.xclbin --ipu-insts-name=insts.txt ./aie.mlir
-// RUN: %run_on_ipu %python %S/test.py | FileCheck %s
-// CHECK: PASS!
\ No newline at end of file
+// RUN: %run_on_ipu %python %S/test.py -x aie.xclbin -i insts.txt -k MLIR_AIE | FileCheck %s
+// CHECK: PASS!
diff --git a/programming_examples/ml/resnet/layers_conv2_x/test.py b/programming_examples/ml/resnet/layers_conv2_x/test.py
index 02dc01b127..48b45b99ae 100755
--- a/programming_examples/ml/resnet/layers_conv2_x/test.py
+++ b/programming_examples/ml/resnet/layers_conv2_x/test.py
@@ -14,423 +14,473 @@
 import os
 import numpy as np
 from aie.utils.xrt import setup_aie, extract_trace, write_out_trace, execute
+import aie.utils.test as test_utils
 
 torch.use_deterministic_algorithms(True)
 torch.manual_seed(0)
 
-design = "resnet_conv2_x_int8"
-xclbin_path = os.path.abspath("build/final.xclbin")
-insts_path = os.path.abspath("build/insts.txt")
-
-log_folder = "log/"
-if not os.path.exists(log_folder):
-    os.makedirs(log_folder)
-
-num_iter = 1
-npu_time_total = 0
-npu_time_min = 9999999
-npu_time_max = 0
-trace_size = 16384
-enable_trace = False
-trace_file = "log/trace_" + design + ".txt"
-# ------------------------------------------------------
-# Configure this to match your design's buffer size
-# ------------------------------------------------------
-dtype_in = np.dtype("int8")
-dtype_wts = np.dtype("int8")
-dtype_out = np.dtype("uint8")
-
-shape_in_act = (32, 8, 32, 8)
-shape_total_wts = (212992, 1)
-shape_out = (32, 32, 32, 8)
-
-# ------------------------------------------------------
-# Initialize activation, weights, scaling factor for int8 model
-# ------------------------------------------------------
-int_inp = torch.randint(1, 10, (1, 64, 32, 32)).type(torch.FloatTensor)
-block_0_int_weight_1 = torch.randint(10, 20, (64, 64, 1, 1)).type(torch.FloatTensor)
-block_0_int_weight_2 = torch.randint(10, 20, (64, 64, 3, 3)).type(torch.FloatTensor)
-block_0_int_weight_3 = torch.randint(10, 20, (256, 64, 1, 1)).type(torch.FloatTensor)
-block_0_int_weight_skip = torch.randint(10, 20, (256, 64, 1, 1)).type(torch.FloatTensor)
-
-block_1_int_weight_1 = torch.randint(20, 30, (64, 256, 1, 1)).type(torch.FloatTensor)
-block_1_int_weight_2 = torch.randint(20, 30, (64, 64, 3, 3)).type(torch.FloatTensor)
-block_1_int_weight_3 = torch.randint(20, 30, (256, 64, 1, 1)).type(torch.FloatTensor)
-
-block_2_int_weight_1 = torch.randint(30, 40, (64, 256, 1, 1)).type(torch.FloatTensor)
-block_2_int_weight_2 = torch.randint(30, 40, (64, 64, 3, 3)).type(torch.FloatTensor)
-block_2_int_weight_3 = torch.randint(30, 40, (256, 64, 1, 1)).type(torch.FloatTensor)
-
-init_scale = 0.5
-block_0_relu_1 = 0.5
-block_0_relu_2 = 0.5
-block_0_relu_3 = 0.5
-
-block_0_weight_scale1 = 0.5
-block_0_weight_scale2 = 0.5
-block_0_weight_scale3 = 0.5
-block_0_weight_scale_skip = 0.5
-
-block_1_relu_1 = 0.5
-block_1_relu_2 = 0.5
-block_1_relu_3 = 0.5
-
-block_1_weight_scale1 = 0.5
-block_1_weight_scale2 = 0.5
-block_1_weight_scale3 = 0.5
-block_1_quant_add_1 = 0.5
-
-block_2_relu_1 = 0.5
-block_2_relu_2 = 0.5
-block_2_relu_3 = 0.5
-
-block_2_weight_scale1 = 0.5
-block_2_weight_scale2 = 0.5
-block_2_weight_scale3 = 0.5
-block_2_quant_add_1 = 0.5
-
-block_0_combined_scale1 = -math.log2(
-    init_scale * block_0_weight_scale1 / block_0_relu_1
-)  # RHS after first conv1x1 | clip 0-->255
-block_0_combined_scale2 = -math.log2(
-    block_0_relu_1 * block_0_weight_scale2 / block_0_relu_2
-)  # RHS after second conv3x3 | clip 0-->255
-block_0_combined_scale3 = -math.log2(
-    block_0_relu_2 * block_0_weight_scale3 / init_scale
-)  # RHS after third conv1x1 | clip -128-->+127
-block_0_combined_scale_skip = -math.log2(
-    init_scale * block_0_weight_scale_skip / init_scale
-)  # LHS after conv1x1 | clip -128-->+127
-block_0_combined_scale4 = -math.log2(
-    init_scale / block_0_relu_3
-)  # After addition | clip 0-->255
-
-block_1_combined_scale1 = -math.log2(
-    block_0_relu_3 * block_1_weight_scale1 / block_1_relu_1
-)  # RHS after first conv1x1 | clip 0-->255
-block_1_combined_scale2 = -math.log2(
-    block_1_relu_1 * block_1_weight_scale2 / block_1_relu_2
-)  # RHS after second conv3x3 | clip 0-->255
-block_1_combined_scale3 = -math.log2(
-    block_1_relu_2 * block_1_weight_scale3 / block_1_quant_add_1
-)  # RHS after third conv1x1 | clip -128-->+127
-block_1_combined_scale4 = -math.log2(
-    block_1_quant_add_1 / block_1_relu_3
-)  # After addition | clip 0-->255
-
-block_2_combined_scale1 = -math.log2(
-    block_1_relu_3 * block_2_weight_scale1 / block_2_relu_1
-)  # RHS after first conv1x1 | clip 0-->255
-block_2_combined_scale2 = -math.log2(
-    block_2_relu_1 * block_2_weight_scale2 / block_2_relu_2
-)  # RHS after second conv3x3 | clip 0-->255
-block_2_combined_scale3 = -math.log2(
-    block_2_relu_2 * block_2_weight_scale3 / block_2_quant_add_1
-)  # RHS after third conv1x1 | clip -128-->+127
-block_2_combined_scale4 = -math.log2(
-    block_2_quant_add_1 / block_2_relu_3
-)  # After addition | clip 0-->255
-
-min = 0
-max = 255
-
-# ------------------------------------------------------
-# Get device, load the xclbin & kernel and register them
-# ------------------------------------------------------
-app = setup_aie(
-    xclbin_path,
-    insts_path,
-    shape_in_act,
-    dtype_in,
-    shape_total_wts,
-    dtype_wts,
-    shape_out,
-    dtype_out,
-    enable_trace=enable_trace,
-    trace_size=trace_size,
-)
-
-
-# ------------------------------------------------------
-# Define your golden reference
-# ------------------------------------------------------
-class resnet_conv2_x_int8(nn.Module):
-    expansion = 4
-
-    def __init__(self, in_planes=64, planes=64):
-        super(resnet_conv2_x_int8, self).__init__()
-
-        self.shortcut = nn.Conv2d(
-            in_planes, self.expansion * planes, kernel_size=1, bias=False
-        )
-        # Bottleneck 0
-        self.block_0_conv1 = nn.Conv2d(in_planes, planes, kernel_size=1, bias=False)
-        self.block_0_conv2 = nn.Conv2d(
-            planes, planes, kernel_size=3, padding=1, padding_mode="zeros", bias=False
-        )
-        self.block_0_conv3 = nn.Conv2d(
-            planes, self.expansion * planes, kernel_size=1, bias=False
-        )
-
-        self.block_0_relu1 = nn.ReLU()
-        self.block_0_relu2 = nn.ReLU()
-        self.block_0_relu3 = nn.ReLU()
-
-        # Bottleneck 1
-        self.block_1_conv1 = nn.Conv2d(
-            self.expansion * planes, planes, kernel_size=1, bias=False
-        )
-        self.block_1_conv2 = nn.Conv2d(
-            planes, planes, kernel_size=3, padding=1, padding_mode="zeros", bias=False
-        )
-        self.block_1_conv3 = nn.Conv2d(
-            planes, self.expansion * planes, kernel_size=1, bias=False
-        )
-
-        self.block_1_relu1 = nn.ReLU()
-        self.block_1_relu2 = nn.ReLU()
-        self.block_1_relu3 = nn.ReLU()
-
-        # Bottleneck 2
-        self.block_2_conv1 = nn.Conv2d(
-            self.expansion * planes, planes, kernel_size=1, bias=False
-        )
-        self.block_2_conv2 = nn.Conv2d(
-            planes, planes, kernel_size=3, padding=1, padding_mode="zeros", bias=False
-        )
-        self.block_2_conv3 = nn.Conv2d(
-            planes, self.expansion * planes, kernel_size=1, bias=False
-        )
-
-        self.block_2_relu1 = nn.ReLU()
-        self.block_2_relu2 = nn.ReLU()
-        self.block_2_relu3 = nn.ReLU()
-
-    def forward(self, x):
-        # **************** Bottleneck 0 ****************
-        block_0_conv1_out = self.block_0_conv1(x) * init_scale * block_0_weight_scale1
-        block_0_relu1_out = torch.clamp(
-            torch.round(self.block_0_relu1(block_0_conv1_out) / block_0_relu_1),
-            min,
-            max,
-        )  # convert to int and apply relu
-        block_0_conv2_out = (
-            self.block_0_conv2(block_0_relu1_out)
-            * block_0_relu_1
-            * block_0_weight_scale2
-        )
-        block_0_relu2_out = torch.clamp(
-            torch.round(self.block_0_relu2(block_0_conv2_out) / block_0_relu_2),
-            min,
-            max,
-        )
-        block_0_conv3_out = (
-            self.block_0_conv3(block_0_relu2_out)
-            * block_0_relu_2
-            * block_0_weight_scale3
-        )
-        block_0_rhf_same_scale = torch.clamp(
-            torch.round(block_0_conv3_out / init_scale), -128, 127
-        )
-
-        block_0_lhs_conv = self.shortcut(x) * init_scale * block_0_weight_scale_skip
-        block_0_lhs_same_scale = torch.clamp(
-            torch.round(block_0_lhs_conv / init_scale), -128, 127
-        )
-        # convert to int and apply relu
-
-        block_0_skip_add = init_scale * (
-            block_0_rhf_same_scale + block_0_lhs_same_scale
-        )
-        block_0_final_out = torch.clamp(
-            torch.round(self.block_0_relu3(block_0_skip_add) / block_0_relu_3), min, max
-        )
-        # **************** Bottleneck 1 ****************
-        block_1_conv1_out = (
-            self.block_1_conv1(block_0_final_out)
-            * block_0_relu_3
-            * block_1_weight_scale1
-        )
-        block_1_relu1_out = torch.clamp(
-            torch.round(self.block_1_relu1(block_1_conv1_out) / block_1_relu_1),
-            min,
-            max,
-        )  # convert to int and apply relu
-        block_1_conv2_out = (
-            self.block_1_conv2(block_1_relu1_out)
-            * block_1_relu_1
-            * block_1_weight_scale2
-        )
-        block_1_relu2_out = torch.clamp(
-            torch.round(self.block_1_relu2(block_1_conv2_out) / block_1_relu_2),
-            min,
-            max,
-        )
-        block_1_conv3_out = (
-            self.block_1_conv3(block_1_relu2_out)
-            * block_1_relu_2
-            * block_1_weight_scale3
-        )
-        block_1_rhf_same_scale = torch.clamp(
-            torch.round(block_1_conv3_out / block_0_relu_3), -128, 127
-        )
-
-        block_1_skip_add = block_0_relu_3 * (block_1_rhf_same_scale + block_0_final_out)
-        block_1_final_out = torch.clamp(
-            torch.round(self.block_1_relu3(block_1_skip_add) / block_1_relu_3), min, max
-        )
-
-        # **************** Bottleneck 2 ****************
-        block_2_conv1_out = (
-            self.block_2_conv1(block_1_final_out)
-            * block_1_relu_3
-            * block_2_weight_scale1
-        )
-        block_2_relu1_out = torch.clamp(
-            torch.round(self.block_2_relu1(block_2_conv1_out) / block_2_relu_1),
-            min,
-            max,
-        )  # convert to int and apply relu
-        block_2_conv2_out = (
-            self.block_2_conv2(block_2_relu1_out)
-            * block_2_relu_1
-            * block_2_weight_scale2
-        )
-        block_2_relu2_out = torch.clamp(
-            torch.round(self.block_2_relu2(block_2_conv2_out) / block_2_relu_2),
-            min,
-            max,
-        )
-        block_2_conv3_out = (
-            self.block_2_conv3(block_2_relu2_out)
-            * block_2_relu_2
-            * block_2_weight_scale3
-        )
-        block_2_rhf_same_scale = torch.clamp(
-            torch.round(block_2_conv3_out / block_1_relu_3), -128, 127
-        )
-
-        block_2_skip_add = block_1_relu_3 * (block_2_rhf_same_scale + block_1_final_out)
-        block_2_final_out = block_2_relu_3 * (
-            torch.clamp(
-                torch.round(self.block_2_relu3(block_2_skip_add) / block_2_relu_3),
+
+def main(opts):
+    design = "resnet_conv2_x_int8"
+    xclbin_path = opts.xclbin
+    insts_path = opts.instr
+
+    log_folder = "log/"
+    if not os.path.exists(log_folder):
+        os.makedirs(log_folder)
+
+    num_iter = 1
+    npu_time_total = 0
+    npu_time_min = 9999999
+    npu_time_max = 0
+    trace_size = 16384
+    enable_trace = False
+    trace_file = "log/trace_" + design + ".txt"
+    # ------------------------------------------------------
+    # Configure this to match your design's buffer size
+    # ------------------------------------------------------
+    dtype_in = np.dtype("int8")
+    dtype_wts = np.dtype("int8")
+    dtype_out = np.dtype("uint8")
+
+    shape_in_act = (32, 8, 32, 8)
+    shape_total_wts = (212992, 1)
+    shape_out = (32, 32, 32, 8)
+
+    # ------------------------------------------------------
+    # Initialize activation, weights, scaling factor for int8 model
+    # ------------------------------------------------------
+    int_inp = torch.randint(1, 10, (1, 64, 32, 32)).type(torch.FloatTensor)
+    block_0_int_weight_1 = torch.randint(10, 20, (64, 64, 1, 1)).type(torch.FloatTensor)
+    block_0_int_weight_2 = torch.randint(10, 20, (64, 64, 3, 3)).type(torch.FloatTensor)
+    block_0_int_weight_3 = torch.randint(10, 20, (256, 64, 1, 1)).type(
+        torch.FloatTensor
+    )
+    block_0_int_weight_skip = torch.randint(10, 20, (256, 64, 1, 1)).type(
+        torch.FloatTensor
+    )
+
+    block_1_int_weight_1 = torch.randint(20, 30, (64, 256, 1, 1)).type(
+        torch.FloatTensor
+    )
+    block_1_int_weight_2 = torch.randint(20, 30, (64, 64, 3, 3)).type(torch.FloatTensor)
+    block_1_int_weight_3 = torch.randint(20, 30, (256, 64, 1, 1)).type(
+        torch.FloatTensor
+    )
+
+    block_2_int_weight_1 = torch.randint(30, 40, (64, 256, 1, 1)).type(
+        torch.FloatTensor
+    )
+    block_2_int_weight_2 = torch.randint(30, 40, (64, 64, 3, 3)).type(torch.FloatTensor)
+    block_2_int_weight_3 = torch.randint(30, 40, (256, 64, 1, 1)).type(
+        torch.FloatTensor
+    )
+
+    init_scale = 0.5
+    block_0_relu_1 = 0.5
+    block_0_relu_2 = 0.5
+    block_0_relu_3 = 0.5
+
+    block_0_weight_scale1 = 0.5
+    block_0_weight_scale2 = 0.5
+    block_0_weight_scale3 = 0.5
+    block_0_weight_scale_skip = 0.5
+
+    block_1_relu_1 = 0.5
+    block_1_relu_2 = 0.5
+    block_1_relu_3 = 0.5
+
+    block_1_weight_scale1 = 0.5
+    block_1_weight_scale2 = 0.5
+    block_1_weight_scale3 = 0.5
+    block_1_quant_add_1 = 0.5
+
+    block_2_relu_1 = 0.5
+    block_2_relu_2 = 0.5
+    block_2_relu_3 = 0.5
+
+    block_2_weight_scale1 = 0.5
+    block_2_weight_scale2 = 0.5
+    block_2_weight_scale3 = 0.5
+    block_2_quant_add_1 = 0.5
+
+    block_0_combined_scale1 = -math.log2(
+        init_scale * block_0_weight_scale1 / block_0_relu_1
+    )  # RHS after first conv1x1 | clip 0-->255
+    block_0_combined_scale2 = -math.log2(
+        block_0_relu_1 * block_0_weight_scale2 / block_0_relu_2
+    )  # RHS after second conv3x3 | clip 0-->255
+    block_0_combined_scale3 = -math.log2(
+        block_0_relu_2 * block_0_weight_scale3 / init_scale
+    )  # RHS after third conv1x1 | clip -128-->+127
+    block_0_combined_scale_skip = -math.log2(
+        init_scale * block_0_weight_scale_skip / init_scale
+    )  # LHS after conv1x1 | clip -128-->+127
+    block_0_combined_scale4 = -math.log2(
+        init_scale / block_0_relu_3
+    )  # After addition | clip 0-->255
+
+    block_1_combined_scale1 = -math.log2(
+        block_0_relu_3 * block_1_weight_scale1 / block_1_relu_1
+    )  # RHS after first conv1x1 | clip 0-->255
+    block_1_combined_scale2 = -math.log2(
+        block_1_relu_1 * block_1_weight_scale2 / block_1_relu_2
+    )  # RHS after second conv3x3 | clip 0-->255
+    block_1_combined_scale3 = -math.log2(
+        block_1_relu_2 * block_1_weight_scale3 / block_1_quant_add_1
+    )  # RHS after third conv1x1 | clip -128-->+127
+    block_1_combined_scale4 = -math.log2(
+        block_1_quant_add_1 / block_1_relu_3
+    )  # After addition | clip 0-->255
+
+    block_2_combined_scale1 = -math.log2(
+        block_1_relu_3 * block_2_weight_scale1 / block_2_relu_1
+    )  # RHS after first conv1x1 | clip 0-->255
+    block_2_combined_scale2 = -math.log2(
+        block_2_relu_1 * block_2_weight_scale2 / block_2_relu_2
+    )  # RHS after second conv3x3 | clip 0-->255
+    block_2_combined_scale3 = -math.log2(
+        block_2_relu_2 * block_2_weight_scale3 / block_2_quant_add_1
+    )  # RHS after third conv1x1 | clip -128-->+127
+    block_2_combined_scale4 = -math.log2(
+        block_2_quant_add_1 / block_2_relu_3
+    )  # After addition | clip 0-->255
+
+    min = 0
+    max = 255
+
+    # ------------------------------------------------------
+    # Get device, load the xclbin & kernel and register them
+    # ------------------------------------------------------
+    app = setup_aie(
+        xclbin_path,
+        insts_path,
+        shape_in_act,
+        dtype_in,
+        shape_total_wts,
+        dtype_wts,
+        shape_out,
+        dtype_out,
+        enable_trace=enable_trace,
+        trace_size=trace_size,
+    )
+
+    # ------------------------------------------------------
+    # Define your golden reference
+    # ------------------------------------------------------
+    class resnet_conv2_x_int8(nn.Module):
+        expansion = 4
+
+        def __init__(self, in_planes=64, planes=64):
+            super(resnet_conv2_x_int8, self).__init__()
+
+            self.shortcut = nn.Conv2d(
+                in_planes, self.expansion * planes, kernel_size=1, bias=False
+            )
+            # Bottleneck 0
+            self.block_0_conv1 = nn.Conv2d(in_planes, planes, kernel_size=1, bias=False)
+            self.block_0_conv2 = nn.Conv2d(
+                planes,
+                planes,
+                kernel_size=3,
+                padding=1,
+                padding_mode="zeros",
+                bias=False,
+            )
+            self.block_0_conv3 = nn.Conv2d(
+                planes, self.expansion * planes, kernel_size=1, bias=False
+            )
+
+            self.block_0_relu1 = nn.ReLU()
+            self.block_0_relu2 = nn.ReLU()
+            self.block_0_relu3 = nn.ReLU()
+
+            # Bottleneck 1
+            self.block_1_conv1 = nn.Conv2d(
+                self.expansion * planes, planes, kernel_size=1, bias=False
+            )
+            self.block_1_conv2 = nn.Conv2d(
+                planes,
+                planes,
+                kernel_size=3,
+                padding=1,
+                padding_mode="zeros",
+                bias=False,
+            )
+            self.block_1_conv3 = nn.Conv2d(
+                planes, self.expansion * planes, kernel_size=1, bias=False
+            )
+
+            self.block_1_relu1 = nn.ReLU()
+            self.block_1_relu2 = nn.ReLU()
+            self.block_1_relu3 = nn.ReLU()
+
+            # Bottleneck 2
+            self.block_2_conv1 = nn.Conv2d(
+                self.expansion * planes, planes, kernel_size=1, bias=False
+            )
+            self.block_2_conv2 = nn.Conv2d(
+                planes,
+                planes,
+                kernel_size=3,
+                padding=1,
+                padding_mode="zeros",
+                bias=False,
+            )
+            self.block_2_conv3 = nn.Conv2d(
+                planes, self.expansion * planes, kernel_size=1, bias=False
+            )
+
+            self.block_2_relu1 = nn.ReLU()
+            self.block_2_relu2 = nn.ReLU()
+            self.block_2_relu3 = nn.ReLU()
+
+        def forward(self, x):
+            # **************** Bottleneck 0 ****************
+            block_0_conv1_out = (
+                self.block_0_conv1(x) * init_scale * block_0_weight_scale1
+            )
+            block_0_relu1_out = torch.clamp(
+                torch.round(self.block_0_relu1(block_0_conv1_out) / block_0_relu_1),
+                min,
+                max,
+            )  # convert to int and apply relu
+            block_0_conv2_out = (
+                self.block_0_conv2(block_0_relu1_out)
+                * block_0_relu_1
+                * block_0_weight_scale2
+            )
+            block_0_relu2_out = torch.clamp(
+                torch.round(self.block_0_relu2(block_0_conv2_out) / block_0_relu_2),
+                min,
+                max,
+            )
+            block_0_conv3_out = (
+                self.block_0_conv3(block_0_relu2_out)
+                * block_0_relu_2
+                * block_0_weight_scale3
+            )
+            block_0_rhf_same_scale = torch.clamp(
+                torch.round(block_0_conv3_out / init_scale), -128, 127
+            )
+
+            block_0_lhs_conv = self.shortcut(x) * init_scale * block_0_weight_scale_skip
+            block_0_lhs_same_scale = torch.clamp(
+                torch.round(block_0_lhs_conv / init_scale), -128, 127
+            )
+            # convert to int and apply relu
+
+            block_0_skip_add = init_scale * (
+                block_0_rhf_same_scale + block_0_lhs_same_scale
+            )
+            block_0_final_out = torch.clamp(
+                torch.round(self.block_0_relu3(block_0_skip_add) / block_0_relu_3),
                 min,
                 max,
             )
-        )
-        return block_2_final_out
-
-
-# ------------------------------------------------------
-# Pytorch baseline
-# ------------------------------------------------------
-model = resnet_conv2_x_int8()
-model.eval()
-model.block_0_conv1.weight.data.copy_(block_0_int_weight_1)
-model.block_0_conv2.weight.data.copy_(block_0_int_weight_2)
-model.block_0_conv3.weight.data.copy_(block_0_int_weight_3)
-model.shortcut.weight.data.copy_(block_0_int_weight_skip)
-
-model.block_1_conv1.weight.data.copy_(block_1_int_weight_1)
-model.block_1_conv2.weight.data.copy_(block_1_int_weight_2)
-model.block_1_conv3.weight.data.copy_(block_1_int_weight_3)
-
-model.block_2_conv1.weight.data.copy_(block_2_int_weight_1)
-model.block_2_conv2.weight.data.copy_(block_2_int_weight_2)
-model.block_2_conv3.weight.data.copy_(block_2_int_weight_3)
-
-golden_output = model(int_inp)
-
-# ------------------------------------------------------
-# Reorder input data-layout
-# ------------------------------------------------------
-ds = DataShaper()
-before_input = int_inp.squeeze().data.numpy().astype(dtype_in)
-before_input.tofile(log_folder + "/before_ifm_mem_fmt_1x1.txt", sep=",", format="%d")
-ifm_mem_fmt = ds.reorder_mat(before_input, "YCXC8", "CYX")
-ifm_mem_fmt.tofile(log_folder + "/after_ifm_mem_fmt_1x1.txt", sep=",", format="%d")
-
-block0_wts1 = ds.reorder_mat(
-    block_0_int_weight_1.data.numpy().astype(dtype_wts), "OIYXI8O8", "OIYX"
-)
-block0_wts2 = ds.reorder_mat(
-    block_0_int_weight_2.data.numpy().astype(dtype_wts), "OIYXI8O8", "OIYX"
-)
-block0_wts3 = ds.reorder_mat(
-    block_0_int_weight_3.data.numpy().astype(dtype_wts), "OIYXI8O8", "OIYX"
-)
-block0_wts_skip = ds.reorder_mat(
-    block_0_int_weight_skip.data.numpy().astype(dtype_wts), "OIYXI8O8", "OIYX"
-)
-
-total_wts = np.concatenate(
-    (block0_wts1, block0_wts2, block0_wts3, block0_wts_skip), axis=None
-)
-
-block1_wts1 = ds.reorder_mat(
-    block_1_int_weight_1.data.numpy().astype(dtype_wts), "OIYXI8O8", "OIYX"
-)
-block1_wts2 = ds.reorder_mat(
-    block_1_int_weight_2.data.numpy().astype(dtype_wts), "OIYXI8O8", "OIYX"
-)
-block1_wts3 = ds.reorder_mat(
-    block_1_int_weight_3.data.numpy().astype(dtype_wts), "OIYXI8O8", "OIYX"
-)
-
-total_wts2 = np.concatenate(
-    (total_wts, block1_wts1, block1_wts2, block1_wts3), axis=None
-)
-
-block2_wts1 = ds.reorder_mat(
-    block_2_int_weight_1.data.numpy().astype(dtype_wts), "OIYXI8O8", "OIYX"
-)
-block2_wts2 = ds.reorder_mat(
-    block_2_int_weight_2.data.numpy().astype(dtype_wts), "OIYXI8O8", "OIYX"
-)
-block2_wts3 = ds.reorder_mat(
-    block_2_int_weight_3.data.numpy().astype(dtype_wts), "OIYXI8O8", "OIYX"
-)
-
-total_wts3 = np.concatenate(
-    (total_wts2, block2_wts1, block2_wts2, block2_wts3), axis=None
-)
-
-total_wts3.tofile(log_folder + "/weights_mem_fmt_final.txt", sep=",", format="%d")
-
-# ------------------------------------------------------
-# Main run loop
-# ------------------------------------------------------
-for i in range(num_iter):
-    start = time.time_ns()
-    aie_output = execute(app, ifm_mem_fmt, total_wts) * block_2_relu_3
-    stop = time.time_ns()
-
-    if enable_trace:
-        aie_output, trace = extract_trace(aie_output, shape_out, dtype_out, trace_size)
-        write_out_trace(trace, trace_file)
-
-    npu_time = stop - start
-    npu_time_total = npu_time_total + npu_time
-
-# ------------------------------------------------------
-# Reorder output data-layout
-# ------------------------------------------------------
-temp_out = aie_output.reshape(32, 32, 32, 8)
-temp_out = ds.reorder_mat(temp_out, "CDYX", "YCXD")
-ofm_mem_fmt = temp_out.reshape(256, 32, 32)
-ofm_mem_fmt.tofile(log_folder + "/after_ofm_mem_fmt_final.txt", sep=",", format="%d")
-ofm_mem_fmt_out = torch.from_numpy(ofm_mem_fmt).unsqueeze(0)
-
-# ------------------------------------------------------
-# Compare the AIE output and the golden reference
-# ------------------------------------------------------
-print("\nAvg NPU time: {}us.".format(int((npu_time_total / num_iter) / 1000)))
-
-assert np.allclose(
-    ofm_mem_fmt_out.detach().numpy(),
-    golden_output.detach().numpy(),
-    rtol=0,
-    atol=block_2_relu_3,
-)
-
-print("\nPASS!\n")
+            # **************** Bottleneck 1 ****************
+            block_1_conv1_out = (
+                self.block_1_conv1(block_0_final_out)
+                * block_0_relu_3
+                * block_1_weight_scale1
+            )
+            block_1_relu1_out = torch.clamp(
+                torch.round(self.block_1_relu1(block_1_conv1_out) / block_1_relu_1),
+                min,
+                max,
+            )  # convert to int and apply relu
+            block_1_conv2_out = (
+                self.block_1_conv2(block_1_relu1_out)
+                * block_1_relu_1
+                * block_1_weight_scale2
+            )
+            block_1_relu2_out = torch.clamp(
+                torch.round(self.block_1_relu2(block_1_conv2_out) / block_1_relu_2),
+                min,
+                max,
+            )
+            block_1_conv3_out = (
+                self.block_1_conv3(block_1_relu2_out)
+                * block_1_relu_2
+                * block_1_weight_scale3
+            )
+            block_1_rhf_same_scale = torch.clamp(
+                torch.round(block_1_conv3_out / block_0_relu_3), -128, 127
+            )
+
+            block_1_skip_add = block_0_relu_3 * (
+                block_1_rhf_same_scale + block_0_final_out
+            )
+            block_1_final_out = torch.clamp(
+                torch.round(self.block_1_relu3(block_1_skip_add) / block_1_relu_3),
+                min,
+                max,
+            )
+
+            # **************** Bottleneck 2 ****************
+            block_2_conv1_out = (
+                self.block_2_conv1(block_1_final_out)
+                * block_1_relu_3
+                * block_2_weight_scale1
+            )
+            block_2_relu1_out = torch.clamp(
+                torch.round(self.block_2_relu1(block_2_conv1_out) / block_2_relu_1),
+                min,
+                max,
+            )  # convert to int and apply relu
+            block_2_conv2_out = (
+                self.block_2_conv2(block_2_relu1_out)
+                * block_2_relu_1
+                * block_2_weight_scale2
+            )
+            block_2_relu2_out = torch.clamp(
+                torch.round(self.block_2_relu2(block_2_conv2_out) / block_2_relu_2),
+                min,
+                max,
+            )
+            block_2_conv3_out = (
+                self.block_2_conv3(block_2_relu2_out)
+                * block_2_relu_2
+                * block_2_weight_scale3
+            )
+            block_2_rhf_same_scale = torch.clamp(
+                torch.round(block_2_conv3_out / block_1_relu_3), -128, 127
+            )
+
+            block_2_skip_add = block_1_relu_3 * (
+                block_2_rhf_same_scale + block_1_final_out
+            )
+            block_2_final_out = block_2_relu_3 * (
+                torch.clamp(
+                    torch.round(self.block_2_relu3(block_2_skip_add) / block_2_relu_3),
+                    min,
+                    max,
+                )
+            )
+            return block_2_final_out
+
+    # ------------------------------------------------------
+    # Pytorch baseline
+    # ------------------------------------------------------
+    model = resnet_conv2_x_int8()
+    model.eval()
+    model.block_0_conv1.weight.data.copy_(block_0_int_weight_1)
+    model.block_0_conv2.weight.data.copy_(block_0_int_weight_2)
+    model.block_0_conv3.weight.data.copy_(block_0_int_weight_3)
+    model.shortcut.weight.data.copy_(block_0_int_weight_skip)
+
+    model.block_1_conv1.weight.data.copy_(block_1_int_weight_1)
+    model.block_1_conv2.weight.data.copy_(block_1_int_weight_2)
+    model.block_1_conv3.weight.data.copy_(block_1_int_weight_3)
+
+    model.block_2_conv1.weight.data.copy_(block_2_int_weight_1)
+    model.block_2_conv2.weight.data.copy_(block_2_int_weight_2)
+    model.block_2_conv3.weight.data.copy_(block_2_int_weight_3)
+
+    golden_output = model(int_inp)
+
+    # ------------------------------------------------------
+    # Reorder input data-layout
+    # ------------------------------------------------------
+    ds = DataShaper()
+    before_input = int_inp.squeeze().data.numpy().astype(dtype_in)
+    before_input.tofile(
+        log_folder + "/before_ifm_mem_fmt_1x1.txt", sep=",", format="%d"
+    )
+    ifm_mem_fmt = ds.reorder_mat(before_input, "YCXC8", "CYX")
+    ifm_mem_fmt.tofile(log_folder + "/after_ifm_mem_fmt_1x1.txt", sep=",", format="%d")
+
+    block0_wts1 = ds.reorder_mat(
+        block_0_int_weight_1.data.numpy().astype(dtype_wts), "OIYXI8O8", "OIYX"
+    )
+    block0_wts2 = ds.reorder_mat(
+        block_0_int_weight_2.data.numpy().astype(dtype_wts), "OIYXI8O8", "OIYX"
+    )
+    block0_wts3 = ds.reorder_mat(
+        block_0_int_weight_3.data.numpy().astype(dtype_wts), "OIYXI8O8", "OIYX"
+    )
+    block0_wts_skip = ds.reorder_mat(
+        block_0_int_weight_skip.data.numpy().astype(dtype_wts), "OIYXI8O8", "OIYX"
+    )
+
+    total_wts = np.concatenate(
+        (block0_wts1, block0_wts2, block0_wts3, block0_wts_skip), axis=None
+    )
+
+    block1_wts1 = ds.reorder_mat(
+        block_1_int_weight_1.data.numpy().astype(dtype_wts), "OIYXI8O8", "OIYX"
+    )
+    block1_wts2 = ds.reorder_mat(
+        block_1_int_weight_2.data.numpy().astype(dtype_wts), "OIYXI8O8", "OIYX"
+    )
+    block1_wts3 = ds.reorder_mat(
+        block_1_int_weight_3.data.numpy().astype(dtype_wts), "OIYXI8O8", "OIYX"
+    )
+
+    total_wts2 = np.concatenate(
+        (total_wts, block1_wts1, block1_wts2, block1_wts3), axis=None
+    )
+
+    block2_wts1 = ds.reorder_mat(
+        block_2_int_weight_1.data.numpy().astype(dtype_wts), "OIYXI8O8", "OIYX"
+    )
+    block2_wts2 = ds.reorder_mat(
+        block_2_int_weight_2.data.numpy().astype(dtype_wts), "OIYXI8O8", "OIYX"
+    )
+    block2_wts3 = ds.reorder_mat(
+        block_2_int_weight_3.data.numpy().astype(dtype_wts), "OIYXI8O8", "OIYX"
+    )
+
+    total_wts3 = np.concatenate(
+        (total_wts2, block2_wts1, block2_wts2, block2_wts3), axis=None
+    )
+
+    total_wts3.tofile(log_folder + "/weights_mem_fmt_final.txt", sep=",", format="%d")
+
+    # ------------------------------------------------------
+    # Main run loop
+    # ------------------------------------------------------
+    for i in range(num_iter):
+        start = time.time_ns()
+        aie_output = execute(app, ifm_mem_fmt, total_wts) * block_2_relu_3
+        stop = time.time_ns()
+
+        if enable_trace:
+            aie_output, trace = extract_trace(
+                aie_output, shape_out, dtype_out, trace_size
+            )
+            write_out_trace(trace, trace_file)
+
+        npu_time = stop - start
+        npu_time_total = npu_time_total + npu_time
+
+    # ------------------------------------------------------
+    # Reorder output data-layout
+    # ------------------------------------------------------
+    temp_out = aie_output.reshape(32, 32, 32, 8)
+    temp_out = ds.reorder_mat(temp_out, "CDYX", "YCXD")
+    ofm_mem_fmt = temp_out.reshape(256, 32, 32)
+    ofm_mem_fmt.tofile(
+        log_folder + "/after_ofm_mem_fmt_final.txt", sep=",", format="%d"
+    )
+    ofm_mem_fmt_out = torch.from_numpy(ofm_mem_fmt).unsqueeze(0)
+
+    # ------------------------------------------------------
+    # Compare the AIE output and the golden reference
+    # ------------------------------------------------------
+    print("\nAvg NPU time: {}us.".format(int((npu_time_total / num_iter) / 1000)))
+
+    assert np.allclose(
+        ofm_mem_fmt_out.detach().numpy(),
+        golden_output.detach().numpy(),
+        rtol=0,
+        atol=block_2_relu_3,
+    )
+
+    print("\nPASS!\n")
+
+
+if __name__ == "__main__":
+    p = test_utils.create_default_argparser()
+    opts = p.parse_args(sys.argv[1:])
+    main(opts)
diff --git a/programming_examples/vision/color_detect/Makefile b/programming_examples/vision/color_detect/Makefile
index c8feea4cb6..9376fcd770 100755
--- a/programming_examples/vision/color_detect/Makefile
+++ b/programming_examples/vision/color_detect/Makefile
@@ -42,7 +42,7 @@ build/final_${COLORDETECT_WIDTH}.xclbin: build/aie2_lineBased_8b_${COLORDETECT_W
 	cd ${@D} && aiecc.py --aie-generate-cdo --aie-generate-ipu --no-compile-host \
 		--xclbin-name=${@F} --ipu-insts-name=insts.txt $(<:%=../%)
 
-build/${targetname}.exe: test.cpp
+${targetname}.exe: test.cpp
 	mkdir -p ${@D}
 	rm -rf _build
 	mkdir -p _build
@@ -55,7 +55,7 @@ else
 	cp _build/${targetname} $@ 
 endif
 
-run: build/${targetname}.exe build/final_${COLORDETECT_WIDTH}.xclbin build/insts.txt
+run: ${targetname}.exe build/final_${COLORDETECT_WIDTH}.xclbin build/insts.txt
 	${powershell} ./$< -x build/final_${COLORDETECT_WIDTH}.xclbin -i build/insts.txt -k MLIR_AIE
 
 clean:
diff --git a/programming_examples/vision/vision_passthrough/README.md b/programming_examples/vision/vision_passthrough/README.md
index 31d4add65f..ebb86bc0f2 100644
--- a/programming_examples/vision/vision_passthrough/README.md
+++ b/programming_examples/vision/vision_passthrough/README.md
@@ -15,7 +15,7 @@ Single tile applies a pass through kernel on data from local memory. There are t
 To compile desing in Windows:
 ```
 make
-make build/passThrough.exe
+make passThrough.exe
 ```
 
 To run the design:
diff --git a/programming_guide/README.md b/programming_guide/README.md
index 87f6414e5a..0d9e19a9b8 100644
--- a/programming_guide/README.md
+++ b/programming_guide/README.md
@@ -16,7 +16,7 @@ The AI Engine (AIE) array is a spatial compute architecture: a modular and scala
 
 Programming the AIE-array configures all its spatial building blocks: the compute cores' program memory, the data movers' buffer descriptors, interconnect with switches, etc. This guide introduces our Interface Representation for hands-ON (IRON) close-to-metal programming of the AIE-array. IRON is an open access toolkit enabling performance engineers to build fast and efficient, often specialized designs through a set of Python language bindings around mlir-aie, our MLIR-based representation of the AIE-array. mlir-aie provides the foundation from which complex and performant AI Engine designs can be defined and is supported by simulation and hardware implementation infrastructure. 
 
-> **NOTE:**  For those interested in better understanding how AI Engine designs are defined at the MLIR level, take a look through the [MLIR tutorial](../tutorials/) material. mlir-aie also serves as a lower layer for other higher-level abstraction MLIR layers such as [mlir-air](https://github.com/Xilinx/mlir-air).
+> **NOTE:**  For those interested in better understanding how AI Engine designs are defined at the MLIR level, take a look through the [MLIR tutorial](../mlir_tutorials/) material. mlir-aie also serves as a lower layer for other higher-level abstraction MLIR layers such as [mlir-air](https://github.com/Xilinx/mlir-air).
 
 This IRON AIE programming guide first introduces the language bindings for AIE-array's structural elements ([section 1](./section-1/README.md)). After explaining how to set up explicit data movement ([section 2](./section-2/README.md)) to transport the necessary data, you can run your first program on the AIE compute core ([section 3](./section-3/README.md)). [Section 4](./section-4/README.md) adds tracing for performance analysis and explains how to exploit the compute dense vector operations. More vector design examples, basic and larger (ML or computer vision) are given in sections [5](./section-5/README.md) and [6](./section-6/README.md). Finally, the [quick reference](./quick_reference.md) summarizes the most important API elements.
 
diff --git a/programming_guide/assets/ComputeTile.png b/programming_guide/assets/ComputeTile.png
new file mode 100644
index 0000000000..065fed189f
Binary files /dev/null and b/programming_guide/assets/ComputeTile.png differ
diff --git a/programming_guide/assets/ComputeTile_2.png b/programming_guide/assets/ComputeTile_2.png
new file mode 100644
index 0000000000..6141e4edd7
Binary files /dev/null and b/programming_guide/assets/ComputeTile_2.png differ
diff --git a/programming_guide/section-2/section-2a/README.md b/programming_guide/section-2/section-2a/README.md
index 61b367145d..1b826b3efd 100644
--- a/programming_guide/section-2/section-2a/README.md
+++ b/programming_guide/section-2/section-2a/README.md
@@ -30,9 +30,9 @@ class object_fifo:
 ```
 We will now go over each of the inputs, what they represents and why they are required by the abstraction. We will first focus on the mandatory inputs and in a later section of the guide on the default valued ones (see Data Layout Transformations in [section-2c](../section-2c/README.md#data-layout-transformations)).
 
-First of all, an Object FIFO has a unique `name` which is required for the lowering steps. It functions as an ordered buffer that has `depth`-many objects of specified `datatype`. Currently, all objects in an Object FIFO have to be of the same datatype. The `datatype` is a tensor-like attribute where the size of the tensor and the type of the individual elements are specified at the same time (i.e. `<16xi32>`). The `depth` can be either an integer or an array of integers. The latter is used to support a specific dependency that can arise when working with multiple Object FIFOs and it is further explained in the Key Object FIFO Patterns [section](../section-2b/02_Broadcast/README.md#object-fifo-broadcast-pattern).
+First of all, an Object FIFO has a unique `name` which is required for the lowering steps. The Object FIFO functions as an ordered buffer that has `depth`-many objects of specified `datatype`. Currently, all objects in an Object FIFO have to be of the same datatype. The `datatype` is a tensor-like attribute where the size of the tensor and the type of the individual elements are specified at the same time (i.e. `<16xi32>`). The `depth` can be either an integer or an array of integers. The latter is explained further down in this section.
 
-An Object FIFO is created between a producer, or source tile, and a consumer, or destination tile. The tiles are where producer and consumer processes accessing the Object FIFO will be executed. Below, you can see an example of an Object FIFO created between producer tile A and consumer tile B:
+An Object FIFO is created between a producer, or source tile, and a consumer, or destination tile. The tiles are where producer and consumer processes accessing the Object FIFO will be executed. These processes are also refered to as the `actors` of the Object FIFO, based on dataflow theory terminology. Below, you can see an example of an Object FIFO created between producer tile A and consumer tile B:
 ```python
 A = tile(1, 3)
 B = tile(2, 4)
@@ -115,5 +115,76 @@ def core_body():
         yield_([])
 ```
 
+### Specifying the Object FIFO Depth as an Array
+
+As was mentioned in the beginning of this section, the AIE architecture is a spatial architecture that requires explicit data movement. As such, while the Object FIFO's conceptual design is that of an ordered buffer between two or more AIE tiles, in reality its conceptual depth is spread out over multiple resource pools that may be located at different levels of the memory hierarchy and on different tiles.
+
+A more in-depth yet still logical view of the Object FIFO's depth is that the producer and each consumer have their own working resource pool available in their local memory modules which they can use to send and receive data in relation to the data movement described by the fifo. The Object FIFO primitive and its lowering typically allocate the depth of each of these pools such that the resulting behaviour matches that of the conceptual depth.
+
+The user does however have the possibility to manually choose the depth of these pools. This feature is available because, while the Object FIFO primitive tries to offer a unified representation of the data movement across the AIE array, it also aims to provide performance programmers with the tools to more finely control it.
+
+For example in the code snippet below `of0` describes the data movement between producer A and consumer B:
+```python
+A = tile(1, 3)
+B = tile(2, 4)
+of0 = object_fifo("objfifo0", A, B, 3, T.memref(256, T.i32()))
+```
+The conceptual depth of the Object FIFO is `3`. The reasoning behind this choice of depth can be understood by looking at the acquire and release patterns of the two actors:
+```python
+@core(A)
+def core_body():
+    for _ in range_(9):
+        elem0 = of0.acquire(ObjectFifoPort.Produce, 1)
+        call(produce_func, [elem0])
+        of0.release(ObjectFifoPort.Produce, 1)
+        yield_([])
+
+@core(B)
+def core_body():
+    for _ in range_(9):
+        elems = of0.acquire(ObjectFifoPort.Consume, 2)
+        call(consume_func, [elems[0], elems[1]])
+        of0.release(ObjectFifoPort.Consume, 2)
+        yield_([])
+```
+Each iteration:
+* producer A acquires one object to produce into, calls the kernel function `produce_func` to store new data in it for B to consume, and releases the object,
+* consumer B acquires two objects to consume, reads the data and applies kernel function `consume_func`, then releases both objects.
+
+A conceptual depth of `2` would have sufficed for this system to function without deadlocking. However, with a depth of `3`, A and B can execute concurrently, i.e., while B consumes two objects and applies the kernel function A has one object available into which it can produce at the same time.
+
+The equivalent of this conceptual depth of `3` using an array of depths would be:
+```python
+of0 = object_fifo("objfifo0", A, B, [1, 2], T.memref(256, T.i32()))
+```
+where `1` is the number of resources available locally to producer A and `2` is the number available to consumer B.
+
+> **NOTE:**  For a correct lowering, this feature should be used in situations where the producers and consumers of the Object FIFO are running on different tiles.
+
+The feature of specifying the depths of the resource pools for different actors of the Object FIFO is used to support a specific dependency that can arise when working with multiple Object FIFOs and it is further explained in the Key Object FIFO Patterns [section](../section-2b/02_Broadcast/README.md#object-fifo-broadcast-pattern).
+
+### Advanced Topic : Data Movement Accelerators
+
+**The following topic is not required to understand the rest of this guide.**
+
+This part of the guide introduces a few lower level concepts in the AIE hardware and takes a closer look at the individual resource pools on each tile and the reasoning behind their depths.
+
+Every tile in the AIE array has its own dedicated Data Movement Accelerator (or `DMA`). The DMAs are responsible for moving data from the tile's memory module to the AXI stream interconnect or from the stream to the memory module. In the case of compute tiles, both the compute core and the tile's DMA are able to access the tile's memory module. Because of this, there is a need for a <u>synchronization mechanism</u> that will allow the compute core and the DMA to signal to each other when data is available for the other party to read or write in order to avoid data corruption. This is very similar to the concept of the Object FIFO where producers and consumers must first acquire objects before they can access them, and release them when they are done so they may be acquired by the other party.
+
+The figure below showcases a high-level view of a compute tile, where the compute core and the DMA are both reading and writing data to a location `buff` in the local memory module:
+
+<img src="./../../assets/ComputeTile.png" height="250">
+
+The intent of this high-level view is to showcase how the DMA is able to interact with memory locations while the core is computing. It can send the data in them over the AXI stream and receive data from the stream to write into the memory locations as well, for example for the core to execute on. Because of this potential for concurrency, it is often the case that a ping-pong, or double, buffer is used instead of a single buffer. This is showcased in the figure below where the `buff` has been extended to a `buff_ping` and `buff_pong`:
+
+<img src="./../../assets/ComputeTile_2.png" height="250">
+
+> **NOTE:**  It is possible to directly configure the DMAs without the use of the Object FIFO primitive to setup data movement between tiles. This is described in [Section 2f](../section-2f/README.md).
+
+## <u>Exercises</u>
+1. In the previous [subsection](./README.md/#specifying-the-object-fifo-depth-as-an-array) it was explained that the conceptual depth of `3` for `of0` could be represented as an array of depths `[1, 2]`. With the advanced knowledge on the topic of DMAs, do you think those depths suffice for the compute cores on tiles A and B to run concurrently with their local DMAs? <img src="../../../mlir_tutorials/images/answer1.jpg" title="No. In the case of producer A, only a single object was allocated for the design which results in the compute core and the DMA having to wait while the other party respectively computes or moves the data. This is similar for consumer B, where the compute core acquires both allocated objects, leaving none for the DMA to interact with." height=25>
+
+1. How would you update the depths? <img src="../../../mlir_tutorials/images/answer1.jpg" title="Producer A requires a ping-pong buffer to function concurrently with its DMA. Similarly, consumer B requires two additional objects that the DMA can write new data into while B computes. The updated depths are [2, 4]." height=25>
+
 -----
 [[Up](..)] [[Next - Section 2b](../section-2b/)]
diff --git a/programming_guide/section-2/section-2b/02_Broadcast/README.md b/programming_guide/section-2/section-2b/02_Broadcast/README.md
index 3410764df7..1842bf390c 100644
--- a/programming_guide/section-2/section-2b/02_Broadcast/README.md
+++ b/programming_guide/section-2/section-2b/02_Broadcast/README.md
@@ -27,9 +27,9 @@ of0 = object_fifo("objfifo0", A, [B, C, D], 3, T.memref(256, T.i32()))
 
 The `depth` input of an Object FIFO can also be specified as an array of integers, which describe the number of objects that are available to each tile (the producer tile plus each consumer tile) when accessing the Object FIFO. For the previous example, each of the four tiles has a resource pool of 3 objects available to perform the data movement of `of_0`.
 
-> **NOTE:**  This functionality of the Object FIFO primitive exposes what is actually going on at the hardware level when the data movement is established for a broadcast. The object pool of the Object FIFO is not a single structure but rather composed of several pools of objects that are allocated in the memory module of each tile involved in the data movement. Specifying the `depth` as an array of integers allows the user full control to set the sizes of the pools on each individual tile.
+> **NOTE:**  This functionality of the Object FIFO primitive exposes what is actually going on at the hardware level when the data movement is established for a broadcast. The object pool of the Object FIFO is not a single structure but rather composed of several pools of objects that are allocated in the memory module of each tile involved in the data movement. Specifying the `depth` as an array of integers allows the user full control to set the sizes of the pools on each individual tile. Please see [Section 2a](../../section-2a/README.md/#specifying-the-object-fifo-depth-as-an-array) for more details.
 
-The main advantage of this feature comes to light during a situation like the one showcased in the example below, which we refer to as a broadcast with a <u>skip-connection</u>. In the example below two Object FIFOs are created: `of0` is a broadcast from producer tile A to consumer tiles B and C, while `of1` is a 1-to-1 data movement from producer tile B to consumer tile C. We refer to `of1` as a skip-connection because it is a dependency between the two consumer tiles of the same broadcast connection.
+The main advantage of this feature comes to light during a situation like the one showcased in the example below, which we refer to as a broadcast with a <u>skip-connection</u>. In the example below two Object FIFOs are created: `of0` is a broadcast from producer tile A to consumer tiles B and C, while `of1` is a 1-to-1 data movement from producer tile B to consumer tile C. We refer to `of0` as a skip-connection because it skips over B in the A &rarr; B &rarr; C chain when connecting A &rarr; C.
 ```python
 A = tile(1, 3)
 B = tile(2, 3)
diff --git a/programming_guide/section-2/section-2f/README.md b/programming_guide/section-2/section-2f/README.md
index 88f9f28d13..3f619e7b63 100644
--- a/programming_guide/section-2/section-2f/README.md
+++ b/programming_guide/section-2/section-2f/README.md
@@ -10,7 +10,7 @@
 
 # <ins>Section 2f - Data Movement Without Object FIFOs</ins>
 
-Not all data movement patterns can be described with Object FIFOs. This section goes into detail about how a user can express data movement using the Data Movement Accelerators (or `DMA`) on AIE tiles.
+Not all data movement patterns can be described with Object FIFOs. This **advanced** section goes into detail about how a user can express data movement using the Data Movement Accelerators (or `DMA`) on AIE tiles. To better understand the code and concepts introduced in this section it is recommended to first read the [Advanced Topic of Section - 2a on DMAs](../section-2a/README.md/#advanced-topic--data-movement-accelerators).
 
 The AIE architecture currently has three different types of tiles: compute tiles referred to as `tile`, memory tiles reffered to as `Mem tile`, and external memory interface tiles referred to as `Shim tile`. Each of these tiles has its own attributes regarding compute capabilities and memory capacity, but the base design of their DMAs is the same. The different types of DMAs can be intialized using the constructors in [aie.py](../../../python/dialects/aie.py):
 ```python
diff --git a/programming_guide/section-6/README.md b/programming_guide/section-6/README.md
index 83e8899002..f54c812ab3 100644
--- a/programming_guide/section-6/README.md
+++ b/programming_guide/section-6/README.md
@@ -26,8 +26,14 @@ There are a number of example designs available [here](../../programming_example
 
 | Design name | Data type | Description | 
 |-|-|-|
-|[bottleneck](../../programming_examples/ml/bottleneck/)|ui8|A Bottleneck Residual Block is a variant of the residual block that utilises 1x1 convolutions to create a bottleneck. The use of a bottleneck reduces the number of parameters and computations.|
+|[bottleneck](../../programming_examples/ml/bottleneck/)|ui8|A Bottleneck Residual Block is a variant of the residual block that utilises three convolutions, using 1x1, 3x3 and 1x1 filter sizes, respectively. The use of a bottleneck reduces the number of parameters and computations.|
 |[resnet](../../programming_examples/ml/resnet/)|ui8|ResNet with offloaded conv2_x bottleneck blocks. The implementation features kernel fusion and dataflow optimizations highlighting the unique architectural capabilties of AI Engines.|
 
+## Exercises
+
+1. In [bottlneck](../../programming_examples/ml/bottleneck/) design following a dataflow approach, how many elements does the 3x3 convolution operation require to proceed with its computation? <img src="../../mlir_tutorials/images/answer1.jpg" title="3. This allows for the necessary neighborhood information required by the convolutional kernel to be available for processing." height=25>
+2. Suppose you have a bottleneck block with input dimensions of 32x32x256. After passing through the 1x1 convolutional layer, the output dimensions become 32x32x64. What would be the output dimensions after the subsequent 3x3 convolutional layer, assuming a stride of 1 and no padding and output channel of 64? <img src="../../mlir_tutorials/images/answer1.jpg" title="30×30×64. Without padding, the spatial dimensions would shrink by two pixels in each dimension due to the 3x3 convolution operation." height=25>
+
 -----
 [[Prev - Section 5](../section-5/)] [[Top](..)]
+