Merge upstream (Xilinx#191)

* install target.h for the memory allocator as well (Xilinx#606) This is redundant, but reflects the true dependencies better. * Fix path used for tests Peano should come before the regular vitis path, due to name collisions. * Add basic AIE2 tests. * [AIE] Add decoding of DMA status The test library now does this for both AIE1 and AIE2 DMAs. * [AIE] Add packet stream tests for ShimDMAs This test looks at a common scenario where we have 3 tensors input to a tile from 3 independent DMAs, but only 2 receiving tile DMAs. Using packet routing, this scenario can be accommodated by time-sharing one of the destination DMAs. Obsoletes Xilinx#85 * Fix error message if no device. We need to return with an error message, or later code will segfault. * Put exp lookup table into run_time_lib/AIE2 (Xilinx#604) * Update chess_intrinsic_wrapper.cpp (Xilinx#610) Remove event intrinsic declarations from AIEv1 wrapper * Add TOSA tensor broadcast and mixed precision tests (Xilinx#609) * Add the following TOSA integration tests to test/Integration/Dialect/TOSA/ * List of PASS tests: i16xi16_add_elem (lane=32) i16xi16_mul_elem (lane=32) i16xi16_sel (lane=32) i16xi16_sub_elem (lane=32) i8xi8_add_elem (lane=64) i8xi8_mul_elem (lane=32) i8xi8_sel (lane=64) i8xi8_sub_elem (lane=64) bf16xbf16_sub_elem_2d_broadcast_1d (lane=16) bf16xbf16_sub_elem_2d_broadcast_1d_reshape (lane=16) * List of XFAIL tests: i8xi16_sub_elem (lane=32) bf16xbf16_sub_elem_2d_broadcast_2d (lane=16) bf16xbf16_sub_elem_2d_broadcast_1d_unit_dim (lane=16) * Fix include order (Xilinx#613) * [aievec] Add hoisting patterns for arith.extsi Hoisting cast operations as close as possible to the source of data can make later patterns more robust to typical variations in the source code. We might need to revisit this one if, in the future, this process causes unintended consequences. * Implement inverse of a float by lookup tables (Xilinx#612) * Fix some test failures (Xilinx#614) * Move aiecc.py implementation to python library (Xilinx#387) * Use correct macros for C API (Xilinx#615) * capi * reformat * Re-export MLIR target set in CMake (Xilinx#617) * mlirconfig * reformat * Disable `-Wno-unknown-warning-option` on windows (Xilinx#620) * unknownwarning * reformat * win32 (Xilinx#618) * Use new policy CMP0091 for MSVC (Xilinx#619) * msvc * reformat * Revert "Use new policy CMP0091 for MSVC (Xilinx#619)" (Xilinx#622) This reverts commit 1520898. * Use upstream CMake macros to find python (Xilinx#616) * cmake * reformat * Bump cmakeModules (Xilinx#624) * ObjFifo unroll dependency fixes (Xilinx#621) * Fixes for objFifo unrolling algorithm. * EOF * clang-format (Xilinx#625) * Use target model functions to get number of DMA channels. * Clang format * Fix function call * Add shim tiles that are not in noc columns in the getNumDestShimMuxConnections() functions. * Add isShimNOCorPLTile () to the target model. * Add missing target model. * Add improvements to the doc * Add isShimNOCorPLTile() virtual function. * Clang format --------- Co-authored-by: abisca <[email protected]> Co-authored-by: Joseph Melber <[email protected]> * need ONLY option to make cmake find numpy (Xilinx#630) * Split ccache database according to the parallel jobs (Xilinx#600) This fixes a race condition in the ccache database writing happening at the end of each job running in parallel. By using a unique key per job, each database is correctly written and can be used for a next CI run. Use also the real LLVM commit hash used by ccache database key in CI instead of previous hack assuming the textual commit was present inside utils/clone-llvm.sh. * Fix TOSA broadcast and mixed precision tests (Xilinx#631) Fix the following TOSA tests: - bf16xbf16_sub_elem_2d_broadcast_2d - i8xi16_sub_elem Add the following new TOSA tests: - i16xi16_sub_elem_2d_broadcast_scalar (pass) - i16xi16_sub_elem_2d_broadcast_1d_unit_dim (pass) - bf16xbf16_sub_elem_2d_broadcast_scalar (xfail) * Fix ordering of putStream intrinsic. The argument order for the intrinsic didn't match. Argument 0: channel # Argument 1: value * Fix decoding of tile status for stream stalls. These were just non-sensical. * Add end-to-end tests for CPU stream access. * Fix intrinsic wrapper for aie2 acquire/release * Explictly compile intrinsic-dependent code with chess frontend. * [tests] Remove address. This address is ignored, resulting in a warning. * catch up to TOM MLIR (Xilinx#590) * catch up to llvm TOM * Update VectorToAIEVecConversions.cpp * Get `VectorType` instead of `Type` * format * xfail opaque pointer related tests and update test * update finalize-memref-to-llvm --------- Co-authored-by: Javier Setoain <[email protected]> * Add softmax test cases (Xilinx#635) * Revised the xchess compilation commands for lut test cases. (Xilinx#636) * Add more combined precision tosa tests (Xilinx#637) Add the following passing element-wise tosa tests: - i32xi32_add_elem (lane=32) - i32xi32_mul_elem (lane=16) - i32xi32_sel (lane=16) - i32xi32_sub_elem (lane=32) Add the following passing combined precision element-wise tosa tests: - i8xi16_add_elem (lane=32) - i8xi16_sub_elem (lane=32) - i8xi32_add_elem (lane=32) - i8xi32_sub_elem (lane=32) - i16xi32_add_elem_v16 (lane=16) - i16xi32_sub_elem_v16 (lane=16) Add the following XFAIL combined precision element-wise tosa tests: - i16xi32_add_elem_v32 (lane=32) - i16xi32_sub_elem_v32 (lane=32) * [aievec] Generalize vector passes Right now, vectorization passes hook to FuncOp, which prevents conversion to AIEVec within other top level operations, like AIE.device ops. This patch makes all passes generic and allows for conversion within AIE.device. * Implement tanh(x) based on linear approximation lookup tables (Xilinx#639) * Refactor conversion of aievec.mul_elem to support combined precision (Xilinx#643) * Refactor AIE-ML acc datatype emission * Refactor arith.muli/mulf to aievec.mul_elem conversion pattern to make it extensible and clean - Reorganize the existing case-by-case patterns and decouple the pattern that requires two inputs to be the same type - Make it a cleaner pattern considering lhs/rhs/out datatype - Verified that all the dut.cc are identical before/after the refactor * Add convertValueToTargetTypeAieML() which can be helpful for handling the vector lane mismatch issue later on. * Add CPP emission for aievec.unpack op * Add VectorToAIEVec lit tests to cover the lowering patterns * Add new combined precision tosa tests for element-wise multiply: - i8xi16_mul_elem_v32 (out=i32, lane=32) (cycle count=144, PM=272), PASS - i8xi16_mul_elem_v16 (out=i32, lane=16) (cycle count=792, PM=368), XFAIL - No intent to work on this at the moment, but keep a record there - i16xi32_mul_elem (out=i32, lane=16) (cycle count=408, PM=384), PASS - i8xi32_mul_elem (out=i32, lane=16) (cycle count=728, PM=368), PASS * Compute memref sizes by multiplying all shape sizes. (Xilinx#641) Co-authored-by: abisca <[email protected]> * [aievec][nfc] Clean-up aievec to llvm conversion This code needed updating its use of a couple of constructs, and namespaces. * Add tosa-to-tensor pass to fix regression (Xilinx#645) * Add tosa-to-tensor pass to fix regression of tosa broadcast tests * Convert math.sqrt to a function call getSqrtBf16() for v16bfloat16 and v32bfloat16 types (Xilinx#646) * Add comments for sqrt.h (Xilinx#648) * Adding more tosa tests for combined precision inputs and broadcast (Xilinx#650) * Add floatxfloat_sub_elem tosa test * Add floatxfloat_add_elem tosa test * Add floatxfloat_sel tosa test * Add bf16xfloat_sub_elem tosa test * Add bf16xfloat_add_elem tosa test * Add i16xi16_sub_elem broadcast tests * Add i8xi8_sub_elem broadcast tests * Reorganize bf16xbf16 broadcast tosa tests * Add floatxfloat_sub_elem broadcast tests * Fix tosa lowering pipeline for bf16xbf16 sub_elem broadcast tests * [aievec] Add missing conversion warnings for mac_elem and broadcast This patch is a first step towards enabling AIEVec to LLVM Dialect conversion for AIEml intrinsics. * Add support of broadcast with vector width = 256 or 1024 and fix TOSA tests (Xilinx#653) *Add support of broadcast_elem/broadcast_to_vxx for vector width == 256 (e.g. v16bf16) or 1024 (e.g. v32int32). *Since we lower vector.broadcast op to multiple aievec ops, we have to fix FoldMulAddChainToConv pass to recognize the new aievec.broadcast patterns *Add the following list of PASS tests for implicit broadcast: i32xi32_sub_elem_16x1024_broadcast_1 i32xi32_sub_elem_2d_broadcast_1d_unit_dim_v16 (out=i32, lane=16) i32xi32_sub_elem_2d_broadcast_1d_unit_dim_v32 (out=i32, lane=32) i32xi32_sub_elem_2d_broadcast_scalar_v16 (out=i32, lane=16) i32xi32_sub_elem_2d_broadcast_scalar_v32 (out=i32, lane=32) i32xi32_sub_elem_16x1024_broadcast_1024 i32xi32_sub_elem_2d_broadcast_1d_reshape_v16 (out=i32, lane=16) i32xi32_sub_elem_2d_broadcast_1d_reshape_v32 (out=i32, lane=32) i32xi32_sub_elem_2d_broadcast_1d_v16 (out=i32, lane=16) i32xi32_sub_elem_2d_broadcast_1d_v32 (out=i32, lane=32) i32xi32_sub_elem_2d_broadcast_2d_v16 (out=i32, lane=16) i32xi32_sub_elem_2d_broadcast_2d_v32 (out=i32, lane=32) *Add dut.cc reference for bf16xbf16_sub_elem_16x1024_broadcast_1 tests. The resulting dut.cc is legal, but it's blocked by "broadcast_elem() of v32bfloat16" bug. Hence, the tests are still marked XFAIL. *Add conversion test coverage for aievec.broadcast and aievec.broadcast_scalar in test_broadcast.mlir *Fix i8xi16_mul_elem_v32 mlir script * Convert tosa.erf and math.erf to a function call getErfBf16() for v16bfloat16 and v32bfloat16 types (Xilinx#652) * Enable use of mlir pass manager in aiecc (Xilinx#628) * Enable use of mlir pass manager in aiecc * clang-format * limit scope of mlir context, rebase * fixup * Revert "catch up to TOM MLIR (Xilinx#590)" (Xilinx#656) This reverts commit 47ff7d3. * Make pathfinder aware of the arch-specific routing constraints (Xilinx#657) * Convert math.rsqrt to a function call getRsqrtBf16() for v16bfloat16 and v32bfloat16 types and reorganize files in aie_runtime_lib (Xilinx#655) * Add more add/sub/mul mixed precision tests (Xilinx#659) * Refactor the tosa-to-vector pipelines script in each test to a central place at test/Integration/lit.local.cfg for better maintainability. Also, make sure each .mlir test is running in a unique workdir for placing multiple .mlir test in a single directory. * For add/sub/mul mixed precision tests, we add tests with swapped inputs * Per the TOSA spec at https://www.mlplatform.org/tosa/tosa_spec.html#_mul, we add test coverage for i16xi16_mul_elem_i32, and i8xi8_mul_elem_i32. Our refactored mul_elem lowering pattern works on these two cases directly, and the acctype for the i8/i16 mac intrinsics we used is i32. * Enable AIEX dialect bindings (Xilinx#658) * Enable AIEX dialect bindings * Replace 'Aie' prefix with 'AIE' in python cmake * Fixs after merge * Apply xca_udm_dbg workaround to new tests * Change runner to xsj * XFAIL some tests after merge Co-authored-by: Stephen Neuendorffer <[email protected]> Co-authored-by: Hanchen Ye <[email protected]> Co-authored-by: Lina Yu <[email protected]> Co-authored-by: James Lin <[email protected]> Co-authored-by: Javier Setoain <[email protected]> Co-authored-by: Maksim Levental <[email protected]> Co-authored-by: Andra Bisca <[email protected]> Co-authored-by: abisca <[email protected]> Co-authored-by: Joseph Melber <[email protected]> Co-authored-by: Kristof Denolf <[email protected]> Co-authored-by: Ronan Keryell <[email protected]> Co-authored-by: Javier Setoain <[email protected]> Co-authored-by: erwei-xilinx <[email protected]>
fifield · Sep 29, 2023 · a8d67c5 · a8d67c5
1 parent b075eee
commit a8d67c5
Show file tree

Hide file tree

Showing 293 changed files with 7,224 additions and 874 deletions.
diff --git a/.github/workflows/buildAndTest.yml b/.github/workflows/buildAndTest.yml
@@ -26,7 +26,7 @@ jobs:
   # cache.
   build-llvm:
     name: Build pynqMLIR-AIE
-    runs-on: xrlabs-xco
+    runs-on: xrlabs-xsj
     steps:
       # - name: Configure Environment
       #  run: echo "$GITHUB_WORKSPACE/llvm/install/bin" >> $GITHUB_PATH

diff --git a/aie_runtime_lib/AIE/lut_based_ops.cpp b/aie_runtime_lib/AIE/lut_based_ops.cpp
diff --git a/aie_runtime_lib/AIE/lut_based_ops.h b/aie_runtime_lib/AIE/lut_based_ops.h
@@ -1,82 +1 @@
-//===---  exp_lut.h - get exponential values from loopup tables ---===//
-//
-// This file is licensed under the Apache License v2.0 with LLVM Exceptions
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-// (c) Copyright 2023 Xilinx Inc.
-//
-//
-//===----------------------------------------------------------------------===//
-// This is the implementation of getting exponential values for a bfloat16
-// vector from exponential lookup tables.
-//===----------------------------------------------------------------------===//
-#ifndef __LUT_BASED_OPS_H__
-#define __LUT_BASED_OPS_H__
-
-#include "aie_api/aie.hpp"
-
-alignas(aie::vector_decl_align) extern int16 exp_ilut_ab[512];
-alignas(aie::vector_decl_align) extern int16 exp_ilut_cd[512];
-alignas(aie::vector_decl_align) extern int16 exp_flut_ab[512];
-alignas(aie::vector_decl_align) extern int16 exp_flut_cd[512];
-alignas(aie::vector_decl_align) extern unsigned char m_inv_lut[128];
-
-__attribute__((always_inline)) v16accfloat getExpBf16(v16bfloat16 x) {
-  bfloat16 __aie_dm_resource_a *ilut_ab =
-      (bfloat16 __aie_dm_resource_a *)exp_ilut_ab;
-  bfloat16 __aie_dm_resource_b *ilut_cd =
-      (bfloat16 __aie_dm_resource_b *)exp_ilut_cd;
-  bfloat16 __aie_dm_resource_a *flut_ab =
-      (bfloat16 __aie_dm_resource_a *)exp_flut_ab;
-  bfloat16 __aie_dm_resource_b *flut_cd =
-      (bfloat16 __aie_dm_resource_b *)exp_flut_cd;
-
-  using lut_type = aie::lut<4, bfloat16, bfloat16>;
-  const int LUT_elems = 256;
-  const int step_i = 8;
-  const int step_f = 0;
-
-  lut_type lut_i(LUT_elems, ilut_ab, ilut_cd);
-  lut_type lut_f(LUT_elems, flut_ab, flut_cd);
-  aie::parallel_lookup<uint16, lut_type, aie::lut_oor_policy::truncate>
-      lookup_i(lut_i, step_i);
-  aie::parallel_lookup<uint16, lut_type, aie::lut_oor_policy::truncate>
-      lookup_f(lut_f, step_f);
-
-  aie::vector<bfloat16, 16> I_val_vec, F_val_vec;
-  aie::accum<accfloat, 16> exp_val;
-  aie::vector<bfloat16, 16> input_bf16 = x;
-
-  // position of output decimal point = 8, making input become 8 bits, and for
-  // LUT_elems = 256 lookup. aie::vector<int16, 16>
-  // input=aie::to_fixed<int16>(input_bf16,8);
-  aie::vector<int16, 32> input0 = v32int16(bfloat16_to_int(input_bf16, 8));
-  aie::vector<int16, 16> input = aie::filter_even(input0);
-
-  I_val_vec = lookup_i.fetch(input.cast_to<uint16>());
-  F_val_vec = lookup_f.fetch(input.cast_to<uint16>());
-  exp_val = aie::mul(I_val_vec, F_val_vec);
-  return v16accfloat(exp_val);
-}
-
-__attribute__((always_inline)) bfloat16 getInvBf16(float x) {
-  unsigned int *B_x;
-  unsigned int exp_mask = 0x7F800000;
-  unsigned int mantissa_mask = 0x007FFFFF;
-  unsigned int mantissa_Q = 0x00008000;
-  unsigned char exponent, mantissa;
-  unsigned inv_exponent;
-  unsigned short inv_x_val;
-  unsigned int B_Q;
-  bfloat16 *inv_x;
-  B_x = (unsigned int *)&x;
-  B_Q = *B_x + mantissa_Q;
-  exponent = (B_Q & exp_mask) >> 23;
-  mantissa = (B_Q & mantissa_mask) >> 16;
-  inv_exponent = (mantissa == 0) + (253 - exponent);
-  inv_x_val = (inv_exponent << 7) + m_inv_lut[mantissa];
-  inv_x = (bfloat16 *)&inv_x_val;
-  return *inv_x;
-}
-#endif //__LUT_BASED_OPS_H__
+// Unsupported exp_lut.h for AIE1
diff --git a/aie_runtime_lib/AIE/vec_math.h b/aie_runtime_lib/AIE/vec_math.h
@@ -0,0 +1 @@
+// Unsupported sqrt.h for AIE1
diff --git a/aie_runtime_lib/AIE2/chess_intrinsic_wrapper.cpp b/aie_runtime_lib/AIE2/chess_intrinsic_wrapper.cpp
@@ -19,10 +19,10 @@
 /// when parsing .ll code containing standard intrinsic names, so these symbols
 /// are defined that way.
 
-extern "C" void llvm___aie___lock___acquire___reg(unsigned id, unsigned val) {
+extern "C" void llvm___aie2___acquire(unsigned id, unsigned val) {
   acquire_equal(id, val);
 }
-extern "C" void llvm___aie___lock___release___reg(unsigned id, unsigned val) {
+extern "C" void llvm___aie2___release(unsigned id, unsigned val) {
   release(id, val);
 }
 extern "C" void llvm___aie___event0() { event0(); }

diff --git a/aie_runtime_lib/AIE2/lut_based_ops.cpp b/aie_runtime_lib/AIE2/lut_based_ops.cpp
@@ -1,4 +1,4 @@
-//===---  exp_lut.cpp - exponential loopup tables ---===//
+//===---  lut_based_ops.cpp - lookup table based operations ---===//
 //
 // This file is licensed under the Apache License v2.0 with LLVM Exceptions
 // See https://llvm.org/LICENSE.txt for license information.
@@ -8,7 +8,7 @@
 //
 //
 //===----------------------------------------------------------------------===//
-// These are exponential lookup tables for bfloat16 type
+// Lookup table based operations
 //===----------------------------------------------------------------------===//
 
 #include "aie_api/aie.hpp"
@@ -225,3 +225,139 @@ alignas(aie::vector_decl_align) unsigned char m_inv_lut[128] = {
     22,  22,  21,  20,  20,  19,  18,  18,  17,  16,  16,  15,  14,  14,  13,
     13,  12,  11,  11,  10,  10,  9,   9,   8,   7,   7,   6,   6,   5,   5,
     4,   4,   3,   3,   2,   2,   1,   1};
+
+// Tanh look up tables: Divides into 32 segments between [-4,4], bank size:
+// (32*2*2*4)*2=1k, one lut=512B
+float chess_storage(% chess_alignof(v32int8)) tanh_lut_ab[128] = {
+    0.00000000000000000000000000000000, -1.00000000000000000000000000000000,
+    0.00283813476562500000000000000000, -0.98828125000000000000000000000000,
+    0.00000000000000000000000000000000, -1.00000000000000000000000000000000,
+    0.00283813476562500000000000000000, -0.98828125000000000000000000000000,
+    0.00509643554687500000000000000000, -0.98046875000000000000000000000000,
+    0.00750732421875000000000000000000, -0.97265625000000000000000000000000,
+    0.00509643554687500000000000000000, -0.98046875000000000000000000000000,
+    0.00750732421875000000000000000000, -0.97265625000000000000000000000000,
+    0.01269531250000000000000000000000, -0.95703125000000000000000000000000,
+    0.02124023437500000000000000000000, -0.93359375000000000000000000000000,
+    0.01269531250000000000000000000000, -0.95703125000000000000000000000000,
+    0.02124023437500000000000000000000, -0.93359375000000000000000000000000,
+    0.03540039062500000000000000000000, -0.89843750000000000000000000000000,
+    0.05639648437500000000000000000000, -0.85156250000000000000000000000000,
+    0.03540039062500000000000000000000, -0.89843750000000000000000000000000,
+    0.05639648437500000000000000000000, -0.85156250000000000000000000000000,
+    0.09179687500000000000000000000000, -0.78125000000000000000000000000000,
+    0.14550781250000000000000000000000, -0.68750000000000000000000000000000,
+    0.09179687500000000000000000000000, -0.78125000000000000000000000000000,
+    0.14550781250000000000000000000000, -0.68750000000000000000000000000000,
+    0.22949218750000000000000000000000, -0.56250000000000000000000000000000,
+    0.34765625000000000000000000000000, -0.41601562500000000000000000000000,
+    0.22949218750000000000000000000000, -0.56250000000000000000000000000000,
+    0.34765625000000000000000000000000, -0.41601562500000000000000000000000,
+    0.50390625000000000000000000000000, -0.25976562500000000000000000000000,
+    0.69140625000000000000000000000000, -0.11962890625000000000000000000000,
+    0.50390625000000000000000000000000, -0.25976562500000000000000000000000,
+    0.69140625000000000000000000000000, -0.11962890625000000000000000000000,
+    0.86718750000000000000000000000000, -0.03076171875000000000000000000000,
+    1.00000000000000000000000000000000, 0.00000000000000000000000000000000,
+    0.86718750000000000000000000000000, -0.03076171875000000000000000000000,
+    1.00000000000000000000000000000000, 0.00000000000000000000000000000000,
+    1.00000000000000000000000000000000, 0.00000000000000000000000000000000,
+    0.86718750000000000000000000000000, 0.03076171875000000000000000000000,
+    1.00000000000000000000000000000000, 0.00000000000000000000000000000000,
+    0.86718750000000000000000000000000, 0.03076171875000000000000000000000,
+    0.69140625000000000000000000000000, 0.11962890625000000000000000000000,
+    0.50390625000000000000000000000000, 0.25976562500000000000000000000000,
+    0.69140625000000000000000000000000, 0.11962890625000000000000000000000,
+    0.50390625000000000000000000000000, 0.25976562500000000000000000000000,
+    0.34765625000000000000000000000000, 0.41601562500000000000000000000000,
+    0.22949218750000000000000000000000, 0.56250000000000000000000000000000,
+    0.34765625000000000000000000000000, 0.41601562500000000000000000000000,
+    0.22949218750000000000000000000000, 0.56250000000000000000000000000000,
+    0.14550781250000000000000000000000, 0.68750000000000000000000000000000,
+    0.09179687500000000000000000000000, 0.78125000000000000000000000000000,
+    0.14550781250000000000000000000000, 0.68750000000000000000000000000000,
+    0.09179687500000000000000000000000, 0.78125000000000000000000000000000,
+    0.05639648437500000000000000000000, 0.85156250000000000000000000000000,
+    0.03540039062500000000000000000000, 0.89843750000000000000000000000000,
+    0.05639648437500000000000000000000, 0.85156250000000000000000000000000,
+    0.03540039062500000000000000000000, 0.89843750000000000000000000000000,
+    0.02124023437500000000000000000000, 0.93359375000000000000000000000000,
+    0.01269531250000000000000000000000, 0.95703125000000000000000000000000,
+    0.02124023437500000000000000000000, 0.93359375000000000000000000000000,
+    0.01269531250000000000000000000000, 0.95703125000000000000000000000000,
+    0.00750732421875000000000000000000, 0.97265625000000000000000000000000,
+    0.00509643554687500000000000000000, 0.98046875000000000000000000000000,
+    0.00750732421875000000000000000000, 0.97265625000000000000000000000000,
+    0.00509643554687500000000000000000, 0.98046875000000000000000000000000,
+    0.00283813476562500000000000000000, 0.98828125000000000000000000000000,
+    0.00000000000000000000000000000000, 1.00000000000000000000000000000000,
+    0.00283813476562500000000000000000, 0.98828125000000000000000000000000,
+    0.00000000000000000000000000000000, 1.00000000000000000000000000000000,
+};
+
+float chess_storage(% chess_alignof(v32int8)) tanh_lut_cd[128] = {
+    0.00000000000000000000000000000000, -1.00000000000000000000000000000000,
+    0.00283813476562500000000000000000, -0.98828125000000000000000000000000,
+    0.00000000000000000000000000000000, -1.00000000000000000000000000000000,
+    0.00283813476562500000000000000000, -0.98828125000000000000000000000000,
+    0.00509643554687500000000000000000, -0.98046875000000000000000000000000,
+    0.00750732421875000000000000000000, -0.97265625000000000000000000000000,
+    0.00509643554687500000000000000000, -0.98046875000000000000000000000000,
+    0.00750732421875000000000000000000, -0.97265625000000000000000000000000,
+    0.01269531250000000000000000000000, -0.95703125000000000000000000000000,
+    0.02124023437500000000000000000000, -0.93359375000000000000000000000000,
+    0.01269531250000000000000000000000, -0.95703125000000000000000000000000,
+    0.02124023437500000000000000000000, -0.93359375000000000000000000000000,
+    0.03540039062500000000000000000000, -0.89843750000000000000000000000000,
+    0.05639648437500000000000000000000, -0.85156250000000000000000000000000,
+    0.03540039062500000000000000000000, -0.89843750000000000000000000000000,
+    0.05639648437500000000000000000000, -0.85156250000000000000000000000000,
+    0.09179687500000000000000000000000, -0.78125000000000000000000000000000,
+    0.14550781250000000000000000000000, -0.68750000000000000000000000000000,
+    0.09179687500000000000000000000000, -0.78125000000000000000000000000000,
+    0.14550781250000000000000000000000, -0.68750000000000000000000000000000,
+    0.22949218750000000000000000000000, -0.56250000000000000000000000000000,
+    0.34765625000000000000000000000000, -0.41601562500000000000000000000000,
+    0.22949218750000000000000000000000, -0.56250000000000000000000000000000,
+    0.34765625000000000000000000000000, -0.41601562500000000000000000000000,
+    0.50390625000000000000000000000000, -0.25976562500000000000000000000000,
+    0.69140625000000000000000000000000, -0.11962890625000000000000000000000,
+    0.50390625000000000000000000000000, -0.25976562500000000000000000000000,
+    0.69140625000000000000000000000000, -0.11962890625000000000000000000000,
+    0.86718750000000000000000000000000, -0.03076171875000000000000000000000,
+    1.00000000000000000000000000000000, 0.00000000000000000000000000000000,
+    0.86718750000000000000000000000000, -0.03076171875000000000000000000000,
+    1.00000000000000000000000000000000, 0.00000000000000000000000000000000,
+    1.00000000000000000000000000000000, 0.00000000000000000000000000000000,
+    0.86718750000000000000000000000000, 0.03076171875000000000000000000000,
+    1.00000000000000000000000000000000, 0.00000000000000000000000000000000,
+    0.86718750000000000000000000000000, 0.03076171875000000000000000000000,
+    0.69140625000000000000000000000000, 0.11962890625000000000000000000000,
+    0.50390625000000000000000000000000, 0.25976562500000000000000000000000,
+    0.69140625000000000000000000000000, 0.11962890625000000000000000000000,
+    0.50390625000000000000000000000000, 0.25976562500000000000000000000000,
+    0.34765625000000000000000000000000, 0.41601562500000000000000000000000,
+    0.22949218750000000000000000000000, 0.56250000000000000000000000000000,
+    0.34765625000000000000000000000000, 0.41601562500000000000000000000000,
+    0.22949218750000000000000000000000, 0.56250000000000000000000000000000,
+    0.14550781250000000000000000000000, 0.68750000000000000000000000000000,
+    0.09179687500000000000000000000000, 0.78125000000000000000000000000000,
+    0.14550781250000000000000000000000, 0.68750000000000000000000000000000,
+    0.09179687500000000000000000000000, 0.78125000000000000000000000000000,
+    0.05639648437500000000000000000000, 0.85156250000000000000000000000000,
+    0.03540039062500000000000000000000, 0.89843750000000000000000000000000,
+    0.05639648437500000000000000000000, 0.85156250000000000000000000000000,
+    0.03540039062500000000000000000000, 0.89843750000000000000000000000000,
+    0.02124023437500000000000000000000, 0.93359375000000000000000000000000,
+    0.01269531250000000000000000000000, 0.95703125000000000000000000000000,
+    0.02124023437500000000000000000000, 0.93359375000000000000000000000000,
+    0.01269531250000000000000000000000, 0.95703125000000000000000000000000,
+    0.00750732421875000000000000000000, 0.97265625000000000000000000000000,
+    0.00509643554687500000000000000000, 0.98046875000000000000000000000000,
+    0.00750732421875000000000000000000, 0.97265625000000000000000000000000,
+    0.00509643554687500000000000000000, 0.98046875000000000000000000000000,
+    0.00283813476562500000000000000000, 0.98828125000000000000000000000000,
+    0.00000000000000000000000000000000, 1.00000000000000000000000000000000,
+    0.00283813476562500000000000000000, 0.98828125000000000000000000000000,
+    0.00000000000000000000000000000000, 1.00000000000000000000000000000000,
+};
diff --git a/aie_runtime_lib/AIE2/lut_based_ops.h b/aie_runtime_lib/AIE2/lut_based_ops.h
@@ -79,4 +79,31 @@ __attribute__((always_inline)) bfloat16 getInvBf16(float x) {
   inv_x = (bfloat16 *)&inv_x_val;
   return *inv_x;
 }
+
+extern float tanh_lut_ab[];
+extern float tanh_lut_cd[];
+
+inline __attribute__((always_inline)) v16bfloat16
+getTanhBf16(v16bfloat16 vInput) {
+  aie::vector<bfloat16, 16> input = vInput;
+
+  int step_bits = -2;
+  int bias = 16;
+  int data_size = 16;
+  int LUT_elems = 32;
+  int shift_offset = 0; // unused
+
+  using lut_type = aie::lut<4, float, bfloat16>;
+
+  lut_type test_lut(LUT_elems, (bfloat16 *)tanh_lut_ab,
+                    (bfloat16 *)tanh_lut_cd);
+
+  aie::linear_approx<bfloat16, lut_type> lin_aprox(test_lut, step_bits, bias,
+                                                   shift_offset);
+
+  aie::vector<bfloat16, 16> output =
+      lin_aprox.compute(input).to_vector<bfloat16>();
+
+  return (v16bfloat16)output;
+}
 #endif //__LUT_BASED_OPS_H__