diff --git a/README.md b/README.md index 25c05b61..46eeb232 100644 --- a/README.md +++ b/README.md @@ -11,6 +11,7 @@ If you're just using Exo, install it using `pip`: ```sh $ pip install exo-lang ``` +In case of `ModuleNotFoundError: No module named 'attrs'` please upgrade your attrs module by `pip install --upgrade attrs`. ## Compile Exo @@ -29,11 +30,6 @@ You can use optional arguments to customize the output: - The `--stem` argument allows you to specify custom names for the C file and header file. -# Examples - -Take a look at [examples](examples/avx2_matmul/README.md) for scheduling examples, and [API documentation](docs/API.md) for scheduling interface documentation. - - # Build Exo from source We make active use of newer Python 3.x features. Please use Python 3.9 or 3.10 if you're getting errors about unsupported features. @@ -62,7 +58,6 @@ Finally, you can build and install Exo. (exo) $ pip install dist/*.whl ``` - ## PySMT Depending on your setup, getting PySMT to work correctly may be difficult. You @@ -117,26 +112,11 @@ pytest --cov=./ --cov-report=html Then, if you want to see annotated source files, open `./htmlcov/index.html`. +--- + +# Learn about Exo -# Repository structure - -In this repository, folders are structured as follows: - -1. `src/exo` is where the core Exo implementation resides. - - **APIs.** Documentation for the APIs can be found in the [API documentation](docs/API.md). - - `API.py` defines a stable API for top-level decorators (`proc`, `instr`, and `config`). - - `API_scheduling.py` defines a API for scheduling primitives. - - `API_cursors.py` defines a API for Cursors. - - **Standard libraries.** These could be user-defined, but we provide them for convenience. - - `libs/` contains some common memory definitions (`memories.py`) and custom malloc implementations. - - `platforms/` contains instruction definitions that are part of the release. - - `stdlib/` contains user-level scheduling functions such as `vectorize`. - - Other files are implementation details of Exo (e.g., `typecheck.py` implements typecheck), are not exposed to users. -2. `apps/` contains some sample applications written in Exo. -3. `dependencies/` contains submodules that Exo's apps and testing depends on. -4. `examples/` contains a step-by-step example of scheduling basic matrix multiplication on AVX2. -5. `tests/` contains the Exo test suite. -6. `docs/` contains additional Exo documentation. +Take a look at the [examples](examples/README.md) directory for scheduling examples and the [documentation](docs/README.md) directory for various documentation about Exo. # Contact diff --git a/docs/Cursors.md b/docs/Cursors.md index 317ea7a7..dfbaaf53 100644 --- a/docs/Cursors.md +++ b/docs/Cursors.md @@ -370,3 +370,7 @@ p2 = reorder_scope(p1, p1.forward(c).next(), ...) In this code, the navigation `.next()` is applied to the forwarded cursor `p1.forward(c)`. Attempting to change `p1.forward(c).next()` to `p1.forward(c.next())` will result in incorrect behavior. This is because navigation and forwarding are *not commutative*. +## Further Reading +More details of the design principles of Cursors can be found in our [ASPLOS '25 paper](.) or in [Kevin Qian's MEng thesis](https://dspace.mit.edu/handle/1721.1/157187). + + diff --git a/docs/Design.md b/docs/Design.md new file mode 100644 index 00000000..153aa7d2 --- /dev/null +++ b/docs/Design.md @@ -0,0 +1,70 @@ +# Design Document for Exo + +Exo is a domain-specific language designed to enable productive development of high-performance kernel libraries that target specialized hardware accelerators. + +The key design principles of Exo are: +- **Performance Transparency**: We do not do "magic optimizations" that are surprising and opaque to users. +- **WYSIWYG**: Exo IR closely models C-style code and will be trivially lowered to C code. +- **User Control**: Give the performance control back to users. + +--- + +# Exocompilation: Externalizing Hardware Targets + +One of the main ideas behind Exo is **exocompilation**, which allows users to define hardware targets externally to the compiler in user-level libraries. This has several advantages: + +- Hardware vendors can support new accelerators without maintaining compiler forks. +- The cost of adding support for new hardware is significantly reduced. +- Proprietary details of hardware can be protected. + +Users can model custom [memories](./memories.md), [instructions](./instructions.md), and configuration state in libraries to target a specific accelerator. These hardware abstractions can then be used to write hand-optimized code or as building blocks for higher-level scheduling transformations. + +More info can be found in the [PLDI paper](https://people.csail.mit.edu/yuka/pdf/exo_pldi2022_full.pdf), [instructions.md](./instructions.md), and [memories.md](./memories.md). + +## Fine-Grained Primitives for Performance Control + +Exo provides a set of fine-grained scheduling primitives that offer users low-level control over performance-critical aspects. These primitives can be combined to create complex transformation schedules. Some examples of these primitives include: + +- `replace`: Maps code fragments to custom instructions +- `delete_config`: Removes redundant configuration statements + +The key research contributions of Exo were supporting `replace` through unification and the ability to reason about configuration states. Explicit control over these low-level details allows Exo to achieve performance comparable to highly-tuned vendor libraries and hand-optimized assembly code. All the primitives can be found in the [primitives/](./primitives/) directory. + +## Rewrite-based Scheduling Language + +Exo employs a *rewrite-based* compilation process, which differs from the *lowering-based* approach used by popular frameworks like Halide and TVM. + +The rewrite-based approach offers several advantages: + +- Reduced complexity and less "magic" involved +- Easier to print and inspect the state of the scheduling process at any point + +--- + +# User-Defined Scheduling Operations + +While the flexibility of fine-grained primitives is necessary for achieving peak performance, directly using them can be verbose and laborious. To address this, Exo allows users to define new higher-level scheduling operations by composing the core primitives. + +These user-defined scheduling operations can encapsulate common optimization patterns and hardware-specific transformations such as auto-vectorize, tiling, and even simulate scheduling operations from other USLs (like Halide's `compute_at`). +They can be put together in reusable libraries, further enabling modularity and portability. + +More infomation can be found in the [ASPLOS paper](.) and [Cursor.md](./Cursor.md). + +## The AIR Framework: Action, Inspection, Reference + +We identified that Action, Inspection, and Reference are the key scheduling language design mechanisms that enable user-defined scheduling operations. + +- **[Actions](./primitives)** are scheduling operations that transform the code. This could be compiler-provided *primitive actions* (e.g., `divide_loop`, `reorder`), or *user-defined* (e.g., tile2D in the ASPLOS paper). +- **[Inspections](./inspection.md)** query properties of the code (e.g., loop bounds, memory access patterns). +- **References** point to specific parts of the code to apply actions to. + +Together, AIR allows scheduling operations to be defined as composable rewrites on the code. The language implementation guarantees the correctness of these primitive rewrites with a set of effect analyses. + +## Cursors: Enabling Relative References + +A novel feature in Exo's design is the concept of cursors, which serve as relative references into the code. Similar to a text editing cursor, an Exo cursor can refer to a specific location in the program AST, such as a statement, loop nest, or even the gap between statements. + +Cursors support navigation operations such as `next`, `prev`, `parent`, enabling powerful code transformations using relative positions. +Furthermore, Cursor _forwarding_ let users reuse the cursor from the previous procedure in the current procedure. +Multiple cursors can coexist, allowing different parts of the code to be referenced and modified simultaneously. + diff --git a/docs/Imports.md b/docs/Imports.md new file mode 100644 index 00000000..61eed7f1 --- /dev/null +++ b/docs/Imports.md @@ -0,0 +1,97 @@ +# Imports in Exo + +This document provides an overview of the imports used when writing Exo. + +Exo's parser only resolves names in the local and global namespaces, and Exo reserves the attribute syntax (foo.bar) for configurations. +Therefore, if users wish to utilize Exo constructs, they must import them into their local namespace. + +## Table of Contents + +1. [Standard Python Future Import](#1-standard-python-future-import) +2. [Core Exo Module](#2-core-exo-module) +3. [Memory Libraries](#3-memory-libraries) +4. [Instruction Libraries](#4-instruction-libraries) +5. [Extern Libraries](#5-extern-libraries) +6. [Frontend Syntax Utilities](#6-frontend-syntax-utilities) +7. [Standard Library Scheduling Functions](#7-standard-library-scheduling-functions) +8. [API Cursors](#8-api-cursors) + + +## 1. Standard Python Future Import + +```python +from __future__ import annotations +``` + +Enables postponed evaluation of type annotations, allowing you to use forward references in type hints without causing issues during runtime. This is necessary to support Exo's `x : f32` syntax. + + +## 2. Core Exo Module + +```python +from exo import * +``` + +Imports basic classes and functions necessary for defining and manipulating high-performance computational kernels, such as `proc`, `instr`, `config`, `Memory`, `Extern`, `DRAM`, and `SchedulingError`. + + +## 3. Memory Libraries + +Even though users can define memory definitions externally to the compiler in the user code (see [memories.md](./memories.md)), we provide memory definitions for some architectures for convinience. +The supported memory definitions can be found by looking into `src/exo/libs/memories.py`. + +```python +from exo.libs.memories import DRAM_STATIC, AVX2, AVX512 +``` + +For example, you can import `DRAM_STATIC`, `AVX2`, or `AVX512` as shown above. + + +## 4. Instruction Libraries + +Similar to memories, we provide some hardware instruction definitions for convinience (see [instructions.md](./instructions.md) to learn how to define your own accelerator instructions). + +```python +from exo.platforms.x86 import mm256_loadu_ps, mm256_setzero_ps, mm256_broadcast_ss +``` + +## 5. Extern Libraries + +Similary, convinience extern libraries can be imported as follows. See [externs.md](./externs.md) to learn how to define your own externs. + +```python +from exo.libs.externs import sin, relu +``` + + +## 6. Frontend Syntax Utilities + +```python +from exo.frontend.syntax import * +``` + +This module defines special symbols that are used inside Exo code. +Importing this can suppress warnings inside an IDE (like PyCharm). + + +## 7. Standard Library Scheduling Functions + +Exo provides users with the ability to define new scheduling operations using Cursors. For convenience, we have implemented scheduling libraries (standard library) that contain common scheduling operations users may want to use, such as vectorization and tiling. Users can import the standard library as follows: + +```python +from exo.stdlib.scheduling import repeat, replace_all +from exo.stdlib.stdlib import vectorize, tile_loops +``` + +Alternatively, users can define their own scheduling operations by composing scheduling primitives directly in their code. + +## 8. API Cursors + +Cursors (see [Cursors.md](./Cursors.md)) are Exo's reference mechanism that allows users to navigate and inspect object code. When users define new scheduling operators using Cursors, they may wish to write their own inspection pass (see [inspection.md](./inspection.md)). API Cursors define types that will be useful for user inspection. + +```python +from exo.API_cursors import ForCursor, AssignCursor, InvalidCursor +``` + +These API Cursors provide specific types, such as `ForCursor` for for-loops, `AssignCursor` for assignments, and `InvalidCursor` for invalid cursors. Users can leverage these types when inspecting and manipulating code using Cursors. + diff --git a/docs/API.md b/docs/Procedures.md similarity index 81% rename from docs/API.md rename to docs/Procedures.md index 64b267e9..6516a36a 100644 --- a/docs/API.md +++ b/docs/Procedures.md @@ -32,14 +32,3 @@ Cursors can be obtained by querying patterns on a procedure. All the Cursor rela - `.transpose(arg_cursor)`: Transposes a 2D buffer argument in the signature and the body. Returns a new procedure and is non-equivalence preserving because the signature has changed. - `.add_assertion(assertion)`: Adds an assertion to the procedure. - `.is_eq(other_proc)`: Checks the equivalence of this procedure with another procedure. - -## Scheduling Primitives - -We have classified scheduling primitives into six categories. Here are the links to each: - -- [Buffer Transformations](buffer_ops.md) -- [Loop and Scope Transformations](loop_ops.md) -- [Configuration States](config_ops.md) -- [Subprocedure Operations](subproc_ops.md) -- [Memory, Precision, and Parallelism Transformations](backend_ops.md) -- [Other Operations](other_ops.md) diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 00000000..32288783 --- /dev/null +++ b/docs/README.md @@ -0,0 +1,30 @@ +# Documentation + +This directory provides detailed documentation about Exo's interface and internal systems. + +- To learn about the design principles of Exo, read [Design.md](Design.md). +- To understand how the Exo system is implemented, read [System.md](System.md). +- For information on writing Exo object code, APIs, and imports, refer to [Procedures.md](Procedures.md), [object_code.md](object_code.md), and [Imports.md](Imports.md). +- To learn how to define **hardware targets externally to the compiler**, refer to [externs.md](externs.md), [instructions.md](instructions.md), and [memories.md](memories.md). +- To learn how to define **new scheduling operations externally to the compiler**, refer to [Cursors.md](./Cursors.md) and [inspection.md](./inspection.md). +- To understand the available scheduling primitives and how to use them, look into the [primitives/](./primitives) directory. + +The scheduling primitives are classified into six categories: + +1. [Buffer Transformations](primitives/buffer_ops.md) +2. [Loop and Scope Transformations](primitives/loop_ops.md) +3. [Configuration States](primitives/config_ops.md) +4. [Subprocedure Operations](primitives/subproc_ops.md) +5. [Memory, Precision, and Parallelism Transformations](primitives/backend_ops.md) +6. [Other Operations](primitives/other_ops.md) + +# Further Reading + +The following papers provide a high-level and holistic view of Exo as a project: + +- [PLDI '22 paper](https://people.csail.mit.edu/yuka/pdf/exo_pldi2022_full.pdf) +- [ASPLOS '25 paper](.) +- [Kevin Qian's MEng thesis](https://dspace.mit.edu/handle/1721.1/157187) +- [Samir Droubi's MEng thesis](https://dspace.mit.edu/handle/1721.1/156752) + +For more documentation with running Exo code, refer to the [Examples](../examples/README.md) directory. diff --git a/docs/System.md b/docs/System.md new file mode 100644 index 00000000..c6869f8b --- /dev/null +++ b/docs/System.md @@ -0,0 +1,99 @@ +# System Overview + +This document provides an overview of the Exo compilation process, as illustrated in Figure 1 of the PLDI'22 paper. + +![System overview](images/system-overview.png) + +The Exo compiler's frontend starts by parsing the Python AST and constructing the Untyped Exo AST (UAST). +It then runs various frontend checks before converting the UAST into LoopIR, which serves as Exo's primary IR. +Exo supports rewrite-based user-scheduling, where scheduling primitives take a LoopIR and returns another (transformed) LoopIR. +These primitives take the immutable LoopIR and rewrite it into a new LoopIR. +Finally, in the backend, the optimized LoopIR is code-generated into C code. + +The input to the compiler is a set of Exo source files (`*.py`), and the output is generated C code (`*.c`). + +In this repository, folders are structured as follows: + +1. `src/exo` is where the core Exo implementation resides. + - **APIs.** + - `API.py` defines a stable API for top-level decorators (`proc`, `instr`, and `config`). + - `API_scheduling.py` defines a API for scheduling primitives. + - `API_cursors.py` defines a API for Cursors. + - **Standard libraries.** These could be user-defined, but we provide them for convenience. + - `libs/` contains some common memory definitions (`memories.py`) and custom malloc implementations. + - `platforms/` contains instruction definitions that are part of the release. + - `stdlib/` contains user-level scheduling functions such as `vectorize`. + - Other files are implementation of Exo (e.g., `typecheck.py` implements typecheck), are not exposed to users. +2. `apps/` contains some sample applications written in Exo. +3. `dependencies/` contains submodules that Exo's apps and testing depends on. +4. `examples/` contains examples of scheduling with Exo. +5. `tests/` contains the Exo test suite. +6. `docs/` contains additional Exo documentation. + +--- + +## Core + +`src/exo/core` defines IRs used in Exo and other core implementations. +- `LoopIR.py` is the main file that defines IRs (LoopIR, UAST, PAST), and their visitor functions (LoopIR_Do, LoopIR_Rewrite). +- `LoopIR_pprint.py` implements a printing procedure for the IRs defined in `LoopIR.py`. +- `prelude.py` defines `Sym` and `Srcinfo`. + +User-defined features like config, externs, and Memory's parent class implementations are in `configs.py`, `extern.py`, and `memory.py`, respectively. + +`internal_cursors` defines primitive cursor movements (see Section 5.2 "Cursor implementation" of our ASPLOS paper) that are used internally by `LoopIR_scheduling` implementations of scheduling primitives. +`proc_eqv.py` defines a union-find tree which we use to track the equivalence of procedures. + +--- + +## Frontend + +`API.py` provides various user-facing entry points to Exo. The frontend consists of three types of parsing passes, all of which are located in the `src/exo/frontend` directory. + +### Procedures + +The `@proc` and `@instr` decorators are defined in this section and call into the `Pyparser`. The frontend workflow is as follows: +``` +API -> Parser -> TypeCheck -> BoundsCheck/AssertCheck +``` + +`frontend/pyparser.py` defines a parser that translates the Python AST to UAST/PAST. Instead of implementing a custom lexer, Exo relies on the Python lexer to build the Python AST and hijacks it to translate it into Exo's internal ASTs. UAST (Untyped AST) is an untyped version of LoopIR (LoopIR is the "Exo IR" in the paper terminology). UAST is used when parsing full procedure definitions (`@proc` or `@instr`). PAST (Pattern AST) is an AST with holes, used to parse fragments from the user code outside the procedure (see next two sections). + +`typecheck.py` performs type checking and converts UAST to LoopIR. +`boundscheck.py` checks for any out-of-bounds errors in the frontend code and ensures that all assertions in the code are satisfiable by invoking an SMT solver. + +### New LoopIR Expressions + +Some scheduling primitives (such as `expand_dim` and all primitives that take `NewExprA` as an argument) require the construction of new LoopIR expressions. +`parse_fragment.py` implements this pass by calling into `pyparser.pattern`, which invokes the parser with `is_fragment=True`. +When parsing new expressions, it is not possible to use holes `_`. Holes are used for pattern matching for obtaining a cursor referene. + +### Pattern Match for Reference + +Cursors can be obtained by pattern matching. The pattern gets parsed into PAST and then matched against the LoopIR to obtain a reference. +`frontend/pattern_match.py` implements this functionality. + + +--- + +## Rewrites (User-Scheduling) + +After the frontend pass, we obtain LoopIR. The files in `src/exo/rewrite` implement Exo's rewrite-based user-scheduling process. + +- `LoopIR_scheduling.py` is the main file that implements all the scheduling primitives. Many implementations of primitives call into `Check_...` functions, which are the safety checks implemented in `new_eff.py`. +- The handling of analysis to preserve functional equivalence of rewrites is a separate topic not covered in detail here. `new_eff.py`, `new_analysis_core.py`, and `analysis_simplify.py` are all files related to the analysis. +- `LoopIR_unification.py` implements a unification process to support the `replace(p, ...)` rewrite primitive. + +--- + +## Backend + +The backend is responsible for lowering LoopIR to C code and performing backend checks, including precision analysis, window analysis, and parallelism analysis. + +- `LoopIR_compiler.py` is the main file in the backend, which compiles LoopIR to C code. +- `mem_analysis.py` implements a memory consistency check. For example, if a callee expects an `AVX2` annotation but the caller passes `DRAM` memory, it raises an error. +- `parallel_analysis.py` implements a parallel analysis. +- `prec_analysis.py` implements a precision consistency check and coerces the precision where possible. +- `win_analysis.py` implements a window analysis to check if callee and caller window annotations (tensor or window) match with each other. + + diff --git a/docs/externs.md b/docs/externs.md new file mode 100644 index 00000000..69adc849 --- /dev/null +++ b/docs/externs.md @@ -0,0 +1,172 @@ +# Externs + +Externs in Exo provide a mechanism to interface with external functions and libraries directly from your Exo code. By defining custom extern functions, you can extend the capabilities of Exo and leverage existing code written in other languages like C or C++. Externs can be used as expressions in your code, particularly on the right-hand side (RHS) of assignment and reduction statements. + +## Defining Externs in User Code + +Extern functions are defined by subclassing the `Extern` class provided by Exo. This allows you to specify how the extern function should behave, including type checking, compilation, and any global code it might require. + +### Step-by-Step Guide + +#### 1. Import the Extern Class + +Before you can define an extern function, you need to import the `Extern` class and the `_EErr` exception from `exo.core.extern`. + +```python +from exo.core.extern import Extern, _EErr +``` + +- `Extern`: The base class for creating custom extern functions. +- `_EErr`: An exception class used for error handling during type checking. + +#### 2. Subclass the Extern Class + +Create a new class that inherits from `Extern`. This class represents your custom extern function. + +```python +class _Sin(Extern): + # Implementation details will go here +``` + +#### 3. Implement Required Methods + +Your subclass must implement several methods to define the behavior of the extern function. + +##### `__init__(self)` + +Initialize your extern function with its name. + +```python +def __init__(self): + super().__init__("sin") +``` + +- `"sin"`: The name of the external function as it will appear in the printed Exo object code. + +##### `typecheck(self, args)` + +Define how the function checks the types of its arguments. + +```python +def typecheck(self, args): + if len(args) != 1: + raise _EErr(f"expected 1 argument, got {len(args)}") + + arg_type = args[0].type + if not arg_type.is_real_scalar(): + raise _EErr( + f"expected argument to be a real scalar value, but got type {arg_type}" + ) + return arg_type +``` + +- Checks that there is exactly one argument. +- Ensures the argument is a real scalar type (e.g., `float`, `double`). +- Returns the type of the argument as the return type of the function. + +##### `compile(self, args, prim_type)` + +Define how the function is compiled into target code. +- `args`: list of arguments as C strings +- `prim_type`: A C string representing the primitive data type. It could be one of the following C strings, mapping from LoopIR types to C strings: + - `f16` -> `"_Float16"` + - `f32` -> `"float"` + - `f64` -> `"double"` + - `i8` -> `"int8_t"` + - `ui8` -> `"uint8_t"` + - `ui16`-> `"uint16_t"` + - `i32` -> `"int32_t"` + +```python +def compile(self, args, prim_type): + return f"sin(({prim_type}){args[0]})" +``` + +- Generates the code that calls the external function, ensuring proper casting to the primitive type. + +##### `globl(self, prim_type)` + +Provide any global code or headers needed. + +```python +def globl(self, prim_type): + return "#include " +``` + +- Includes necessary headers required for the external function (e.g., `` for mathematical functions). +- `globl` is called and is instantiated for every `prim_type`s. + +#### 4. Instantiate the Extern Function + +Create an instance of your extern class to make it usable in your code. + +```python +sin = _Sin() +``` + +- `sin` now represents the extern function and can be used like any other expression in Exo. + +## Using Externs as Expressions + +Externs can be used as expressions on the RHS of assignment and reduction statements. This allows you to incorporate external functions seamlessly into your Exo computations. + +Unlike Exo procedures that do not allow aliasing in their arguments, you _can_ pass the same buffer to external arguments (e.g., `select(xi, xi, xi, xi)`). +This is because there is no concern about aliasing since all external arguments are read-only, as opposed to Exo procedure arguments which can have write effects on the input arguments. + +### Example: Using `sin` in an Expression + +Here's a complete example demonstrating how to define and use the `sin` extern function within an expression. + +```python +from __future__ import annotations +from exo import * +from exo.core.extern import Extern, _EErr + +class _Sin(Extern): + def __init__(self): + super().__init__("sin") + + def typecheck(self, args): + if len(args) != 1: + raise _EErr(f"expected 1 argument, got {len(args)}") + + arg_type = args[0].type + if not arg_type.is_real_scalar(): + raise _EErr( + f"expected argument to be a real scalar value, but got type {arg_type}" + ) + return arg_type + + def compile(self, args, prim_type): + return f"sin(({prim_type}){args[0]})" + + def globl(self, prim_type): + return "#include " + + def interpret(self, args): + import math + return math.sin(args[0]) + +# Instantiate the extern function +sin = _Sin() + +# Define an Exo procedure using the extern function in an expression +@proc +def foo(x: f32): + x = sin(x) * 3.0 + +print(foo) +``` + +### Output + +When you run the code above with `exocc`, the generated C code will be: +```c +#include +// foo( +// x : f32 @DRAM +// ) +void foo( void *ctxt, float* x ) { + *x = sin((float)*x) * 3.0f; +} +``` diff --git a/docs/images/system-overview.png b/docs/images/system-overview.png new file mode 100644 index 00000000..bb381714 Binary files /dev/null and b/docs/images/system-overview.png differ diff --git a/docs/inspection.md b/docs/inspection.md new file mode 100644 index 00000000..8760ede4 --- /dev/null +++ b/docs/inspection.md @@ -0,0 +1,50 @@ +# External Inspection Functions + +Inspection is a metaprogramming feature that enables metaprograms (like schedules) to dynamically examine the properties of object code. Exo provides inspection through [Cursors](./Cursors.md), allowing users to examine standard AST properties such as variable names, literal expression values, and annotations (e.g., memory spaces and precisions) at scheduling time. Cursors also support local AST navigation, for example, accessing loop bounds (`loop.hi()`) and bodies (`loop.body()`). Inspection functions can be written externally from the Exo compiler, giving users the ability to customize them according to their needs. +For convinience, standard library inspection functions are provided as `exo.stdlib.inspection` module. + +Cursor types (such as `ForCursor` and `IfCursor`) are defined in `exo.API_cursors`, so you should import it when writing inspection functions: + +```python +from exo.API_cursors import * +``` + +Here are some simple inspection functions: + +```python +def is_loop(proc, loop): + loop = proc.forward(loop) + return isinstance(loop, ForCursor) + +def get_top_level_stmt(proc, c): + c = proc.forward(c) + + while not isinstance(c.parent(), InvalidCursor): + c = c.parent() + return c +``` + +Explanation: +- The `is_loop` function takes a `proc` object and a `loop` cursor as input. It forwards the `loop` cursor using `proc.forward(loop)` and checks if the resulting cursor is an instance of `ForCursor`. This function determines whether the given cursor points to a loop statement. +- The `get_top_level_stmt` function takes a `proc` object and a cursor `c` as input. It forwards the cursor `c` using `proc.forward(c)` and then iteratively moves the cursor to its parent using `c.parent()` until it reaches an `InvalidCursor`, which means the cursor reached the outer-most level of the procedure. This function finds the top-level statement that wraps the given cursor. + +Exo also exposes `ExoType` for expression types (defined in `src/exo/API_types.py`), which users can access using constructs like `ExoType.F16` and branch on it. + +```python +class ExoType(Enum): + F16 = auto() + F32 = auto() + F64 = auto() + UI8 = auto() + I8 = auto() + UI16 = auto() + I32 = auto() + R = auto() + Index = auto() + Bool = auto() + Size = auto() + Int = auto() + Stride = auto() +``` + +All the Cursor types and the kind of navigation you can perform on them are documented in [Cursors.md](./Cursors.md). diff --git a/docs/instructions.md b/docs/instructions.md new file mode 100644 index 00000000..ca9fc228 --- /dev/null +++ b/docs/instructions.md @@ -0,0 +1,165 @@ +# External Instruction Definitions + +Exo allows users to define custom hardware instructions within their code using the `@instr` annotation. +These user-defined instructions can be leveraged during the scheduling process to replace specific code fragments with calls to hardware-optimized instructions. + +## Overview + +- **Custom Instructions**: Define hardware-specific instructions as procedures using the `@instr` decorator. +- **Replace**: Use the `replace` primitive to substitute code fragments with calls to these instructions. +- **Code Generation**: Custom instructions can emit arbitrary C code, including inline assembly, with placeholders for arguments. + +## Defining Custom Instructions + +Custom instructions are defined as procedures annotated with `@instr`. +The `@instr` decorator allows you to specify the C code to be emitted when the instruction is called. + +### Syntax + +```python +@instr("C code") +def instruction_name(args): + # Specification of the instruction's behavior +``` +- **`@instr`**: Decorator that specifies the C code to emit. In the string provided to `@instr`, you can include placeholders wrapped in `{}`. These placeholders will be replaced with the names of the arguments when the code is compiled. +- **`instruction_name`**: The name of your custom instruction. +- **`args`**: Arguments to the instruction. +- **semantics**: Semantics of the hardware instruction, written as Exo object code. + +### Example: Defining a Neon Load Instruction + +Below is an example of defining a NEON load instruction that loads four `f32` values into Neon memory. + +```python +from exo import * + +@instr("{dst_data} = vld1q_f32(&{src_data});") +def neon_vld_4xf32(dst: [f32][4] @ Neon, src: [f32][4] @ DRAM): + assert stride(src, 0) == 1 + assert stride(dst, 0) == 1 + + for i in seq(0, 4): + dst[i] = src[i] +``` + +- **`@instr(...)`**: Specifies the semantics of the hardware instruction and the C code to emit. + - `{dst_data}` and `{src_data}` are format strings that will be replaced with the actual arguments during codegen. You can put `_data` after the function argument names and surround them with curly braces (`{dst_data}`). + - `"{dst_data} = vld1q_f32(&{src_data});"`: The argument to `@instr` decorators specifies the C code to emit for this instruction. +- **`dst: [f32][4] @ Neon`**: Declares `dst` as a 4-element array of `f32` in `Neon` memory. +- **`src: [f32][4] @ DRAM`**: Declares `src` as a 4-element array of `f32` in `DRAM`. +- **Assertions**: Ensure that the strides of `src` and `dst` are 1 for correct memory access. +- **Body**: The function body specifies the semantics of the instruction (written in Exo object code), copying elements from `src` to `dst`. + +### Defining the Memory Annotation `Neon` + +The `Neon` memory type can be defined similarly to how custom memories are defined, as explained in [memories.md](memories.md). + +```python +class Neon(Memory): + @classmethod + def global_(cls): + return "#include " + + # Implement other required methods +``` + +## Using Custom Instructions + +Once you've defined a custom instruction, you can use it to replace code fragments in your procedures. + +### Define Your Procedure + +Define your Exo procedure as usual. + +```python +@proc +def foo(src: [f32][4] @ DRAM, dst: [f32][4] @ Neon): + ... + for i in seq(0, ...): + ... + for j in seq(0, 4): + dst[j] = src[j] + ... +``` + +### Use `replace` to Substitute the Instruction + +Use the `replace` primitive to substitute the loop with the custom instruction. + +```python +# Replace the loop with the custom instruction +foo = replace(foo, "for j in _:_", neon_vld_4xf32) +``` + +- **`replace(foo, "for i in _:_", neon_vld_4xf32)`**: + - **`foo`**: The procedure in which to perform the replacement. + - **`"for i in _:_"`**: A cursor pointing to the loop to replace. + - **`neon_vld_4xf32`**: The instruction to replace the loop with. + +After `replace`, the procedure `foo` will look like: +```python +@proc +def foo(M: size, src: [f32][4] @ DRAM, dst: [f32][4] @ Neon): + ... + for i in seq(0, M/4): + ... + neon_vld_4xf32(dst, src) + ... +``` + +#### How `replace` Works + +The `replace` primitive is used to substitute a fragment of code within a procedure with a call to another procedure (e.g., a custom instruction). The syntax for `replace` is as follows: + +```python +replace(proc, cursor_path, subproc) +``` + +- **`proc`**: The procedure containing the code to be replaced. +- **`cursor`**: A cursor pointing to the code fragment to be replaced. +- **`subproc`**: The procedure whose body will replace the code fragment. + +The `replace` primitive works by performing an unification modulo linear equalities. The process can be broken down into two main steps: + +1. **Pattern Matching**: The body of the sub-procedure `subproc` is unified (pattern matched) with the designated statement block `s` in the original procedure `proc`. During this process: + - The arguments of `subproc` are treated as unknowns. + - The free variables of `s` are treated as known symbols. + - Any symbols introduced or bound within the body of `subproc` or within `s` are unified. + + The ASTs (Abstract Syntax Trees) of `subproc` and `s` are required to match exactly with respect to statements and all expressions that are not simply integer-typed control. + +2. **Solving Linear Equations**: Any equivalences between integer-typed control expressions are recorded as a system of linear equations. These equations are then solved to determine the values of the unknowns and ensure a consistent substitution. + +By following this process, the `replace` primitive effectively replaces the designated code fragment with a call to the sub-procedure, while ensuring that the substitution is valid and consistent. + + +### Generated C Code + +`exocc` can be used to compile Exo code into C. + +```c +void foo(float src[4], float32x4_t dst) { + ... + for (int_fast32_t i = 0; i < ...; i++) { + ... + dst = vld1q_f32(&src[0]); + } + ... +} +``` + +- **`dst = vld1q_f32(&src[0]);`**: The custom instruction is emitted as specified in the `@instr` decorator, with arguments replaced. + +## Understanding the Magic + +By defining the behavior of hardware instructions in Python using Exo procedures, you can express the semantics of your accelerator or specialized hardware. The `replace` primitive allows Exo to reason about whether it's safe to offload certain computations to hardware instructions based on their specifications. + +- **No Compiler Backend Needed**: The heavy lifting is done within Exo, eliminating the need for a separate compiler backend. +- **Semantics Encoding**: The instruction's body acts as a specification, encoding its semantics for Exo's pattern matching. +- **Flexible and Extensible**: Users can define any instruction and specify how it should be matched and replaced. + + +## Further Reading and Examples + +- **RVM Tutorial**: [https://exo-lang.dev/tutorial.html](https://exo-lang.dev/tutorial.html) +- **Running Code Examples**: [examples/rvm_conv1d/exo/conv1d.py](https://github.com/exo-lang/exo/blob/main/examples/rvm_conv1d/exo/conv1d.py) diff --git a/docs/memories.md b/docs/memories.md new file mode 100644 index 00000000..2cc946df --- /dev/null +++ b/docs/memories.md @@ -0,0 +1,200 @@ +# External Memory Definitions + +Exo allows users to define custom memory types external to the compiler. +This feature enables modeling of specialized memory systems, such as vector machines and hardware accelerator memories, directly within your Exo code. +By defining custom memories, you can optimize your programs to target specific hardware architectures. + +## Overview + +- **Custom Memories**: Define your own memory types by subclassing the `Memory` class. +- **Usage**: Use custom memories as annotations in your Exo code or set them during scheduling. + +## Defining Custom Memories + +To define a custom memory, you need to create a class that inherits from `Memory` and implement the required methods. +Below is an example of defining an `AVX512` memory, which models the AVX-512 vector registers. + +### Example: Defining AVX512 Memory + +```python +class AVX512(Memory): + @classmethod + def global_(cls): + return "#include " + + @classmethod + def can_read(cls): + return False + + @classmethod + def alloc(cls, new_name, prim_type, shape, srcinfo): + if not shape: + raise MemGenError(f"{srcinfo}: AVX512 vectors are not scalar values") + if not prim_type == "float": + raise MemGenError(f"{srcinfo}: AVX512 vectors must be f32 (for now)") + if not shape[-1].isdecimal() and int(shape[-1]) == 16: + raise MemGenError(f"{srcinfo}: AVX512 vectors must be 16-wide") + shape = shape[:-1] + if shape: + result = f'__m512 {new_name}[{"][".join(map(str, shape))}];' + else: + result = f"__m512 {new_name};" + return result + + @classmethod + def free(cls, new_name, prim_type, shape, srcinfo): + return "" + + @classmethod + def window(cls, basetyp, baseptr, indices, strides, srcinfo): + assert strides[-1] == "1" + idxs = indices[:-1] or "" + if idxs: + idxs = "[" + "][".join(idxs) + "]" + return f"{baseptr}{idxs}" +``` + +#### Explanation of Methods + +- **`global_(cls)`**: Returns any global code or headers needed. Here, it includes the AVX-512 intrinsic header. + + ```python + @classmethod + def global_(cls): + return "#include " + ``` + +- **`can_read(cls)`**: Controls whether the memory can be read directly. Setting it to `False` means you cannot read/write directly to this memory using standard array access. + + ```python + @classmethod + def can_read(cls): + return False + ``` + +- **`alloc(cls, new_name, prim_type, shape, srcinfo)`**: Defines how to lower `LoopIR.Alloc` into C code. + Allocation in Exo is expressed as `x : f32[N, M]`. + - `new_name`: A C string representing the allocated variable name. In this example, it would be `"x"`. + - `prim_type`: A C string representing the primitive data type. In this example, it would be `"float"`. The mapping from LoopIR types to C types is as follows: + - `f16` -> `"_Float16"` + - `f32` -> `"float"` + - `f64` -> `"double"` + - `i8` -> `"int8_t"` + - `ui8` -> `"uint8_t"` + - `ui16`-> `"uint16_t"` + - `i32` -> `"int32_t"` + - `shape`: A list of C strings representing the shape of each dimension. In the example above, it would be `["N", "M"]`. + + For `AVX512` memory, the `alloc` method ensures that the allocated memory represents 16-wide vectors (`shape[-1].isdecimal() and int(shape[-1]) == 16`) of the `float` type (`prim_type == "float"`). + + +- **`free(cls, new_name, prim_type, shape, srcinfo)`**: Handles memory deallocation. For `AVX512`, no action is needed. + + ```python + @classmethod + def free(cls, new_name, prim_type, shape, srcinfo): + return "" + ``` + +- **`window(cls, basetyp, baseptr, indices, strides, srcinfo)`**: Defines how array accesses are lowered into C code. + + Usually, you cannot access your specialized hardware accelerator memory from C code, and you will need to use your accelerator instructions to access it, like the following: + + ```python + x : f32[16,16] @ your_memory + your_instr(x[0, 0:16]) + ``` + + The `window` member defines how `x[0, 0:16]` should be lowered to C code, as different accelerator instructions and memory have different addressing schemes. + + For example, the Gemmini accelerator's scratchpad memory is 2D and has a fixed column width of 16. The Gemmini instruction expects accessing the scratchpad by *number of rows* only, and accessing columns is not permitted. Therefore, the window definition will look like: + + ```python + @classmethod + def window(cls, basetyp, baseptr, indices, strides, srcinfo): + # Assume that strides[-1] == 1 + # and that strides[-2] == 16 (if there is a strides[-2]) + assert len(indices) == len(strides) and len(strides) >= 2 + prim_type = basetyp.basetype().ctype() + offset = generate_offset(indices, strides) + return ( + f"*({prim_type}*)((uint64_t)( " + f"((uint32_t)((uint64_t){baseptr})) + " + f"({offset})/16))" + ) + ``` + + Explanation of arguments: + - `basetyp`: type of the buffer in `LoopIR.type` + - `baseptr`: C pointer string to the buffer (e.g., `x`) + - `indices`: List of C strings for index accesses for each dimension + - `strides`: List of C strings for strides for each dimension + - `srcinfo`: Source location information, Can be used for error messages + + Both tensor and window expressions will be resolved to vanilla indices and strides. + + +## Understanding `can_read` + +The `can_read` method controls whether direct array access is allowed for the memory type. When `can_read` is set to `False`, you cannot read or write to the memory using standard array indexing in Exo or the generated C code. This models hardware that requires special instructions for memory access, such as vector registers. + +### Invalid Usage + +Attempting to read or write directly results in an error. + +```python +x: f32[16] @ AVX512 +x[0] = 3.0 # Invalid when can_read() is False +``` + +### Valid Usage + +To interact with the memory, you must use specific instructions or operations designed for that memory type (e.g., AVX-512 intrinsics). + +```python +# Use AVX-512 instructions to manipulate x +x: f32[16] @ AVX512 +mm512_loadu_ps(x, inp[16*i : 16*i+16]) +``` +To learn more about how to define and use instructions in Exo, see [instructions.md](./instructions.md). + +## Using Custom Memories + +There are two primary ways to use custom memories in Exo: + +1. **Direct Annotation**: Annotate variables with the custom memory type using the `@` symbol. +2. **Scheduling Primitive**: Change the memory annotation during scheduling using `set_memory`. + +### 1. Direct Annotation + +Annotate buffers at the time of declaration. +```python +from exo import * +from exo.libs.memories import AVX512 + +@proc +def foo(x: f32[16] @ AVX512): + y: f32[16] @ AVX512 + # Function body +``` + +- **`x: f32[16] @ AVX512`**: Declares `x` as a 16-element array of `f32` stored in `AVX512` memory. +- **`y: f32[16] @ AVX512`**: Similarly declares `y` in `AVX512` memory. + +### 2. Changing Memory During Scheduling + +Use the `set_memory` primitive to change the memory annotation of a variable during scheduling. +- **`set_memory(p, "C", AVX512)`**: Changes the memory of variable `C` in procedure `p` to `AVX512`. +- This is common when optimizing simple object code (e.g., GEMM) for specific hardware. + +#### Documentation for `set_memory` + +The `set_memory` primitive is documented in [primitives/buffer_ops.md](primitives/buffer_ops.md). + + +## Additional Examples + +- **Memory Definitions**: More examples of custom memory definitions can be found in [src/exo/libs/memories.py](https://github.com/exo-lang/exo/blob/main/src/exo/libs/memories.py). +- **Usage in Applications**: Examples of using custom memories in real applications are available in [examples/rvm_conv1d/exo/conv1d.py](https://github.com/exo-lang/exo/blob/main/examples/rvm_conv1d/exo/conv1d.py). + + diff --git a/docs/object_code.md b/docs/object_code.md new file mode 100644 index 00000000..86328826 --- /dev/null +++ b/docs/object_code.md @@ -0,0 +1,367 @@ +# Exo Object Code Syntax + +In Exo, object code can be defined using Python-like syntax with specific annotations and constructs to model low-level programming concepts. + +This documentation explains Exo's object code syntax using the following example of a 1D convolution operation: + +```python +@proc +def generic_conv1d( + data: i32[IC, N] @ DRAM, + kernels: i32[OC, IC, W] @ DRAM, + out: i32[OC, N] @ DRAM, +): + # Perform the convolution + for i in seq(0, OC): + for j in seq(0, N): + # Zero out the output memory + out[i, j] = 0.0 + for c in seq(0, IC): + for r in seq(0, W): + y: i32 + if j + r < N: + y = data[c, j + r] + else: + y = 0 + out[i, j] += kernels[i, c, r] * y +``` + +## Table of Contents + +- [Annotations and Decorators](#annotations-and-decorators) + - [`@proc` Decorator](#proc-decorator) + - [Type and Memory Annotations](#type-and-memory-annotations) + - [Procedure Arguments](#procedure-arguments) + - [Allocations](#allocations) + - [Memories](#memories) +- [Loops](#loops) + - [`for` Loop Syntax](#for-loop-syntax) +- [Conditional Statements](#conditional-statements) +- [Assignments](#assignments) +- [Understanding the Example](#understanding-the-example) + +## Annotations and Decorators + +### `@proc` Decorator + +The `@proc` decorator is used to define an Exo procedure (analogous to a function in other programming languages). It indicates that the following function definition should be treated as Exo object code (not Python), which can be further optimized and transformed. + +```python +@proc +def function_name(arguments): + # Function body +``` + +### Type and Memory Annotations + +In Exo, types and memory spaces are explicitly annotated. The syntax is: + +```python +name: type[size] @ memory +``` + +- **`name`**: The variable name. +- **`type`**: The data type. Supported precision types are: `f16`, `f32`, `f64`, `i8`, `i32`, `ui8`, and `ui16`. +- **`[size]`**: The dimensions of the array (optional for scalars). +- **`@ memory`**: The memory space where the variable resides. + + +### Procedure Arguments + +Procedure arguments are declared with their types, sizes, and memory spaces. They can have sizes that depend on other arguments. + +Example from the code: + +```python +data: i32[IC, N] @ DRAM +``` + +- **`data`**: The name of the argument. +- **`i32`**: The data type (32-bit integer). +- **`[IC, N]`**: A 2D array with dimensions `IC` and `N`. +- **`@ DRAM`**: Specifies that `data` resides in DRAM memory. + +The `data` buffer above represents **tensor** types, which means the stride of the innermost dimension is 1, and the strides of other dimensions are simple multiples of the shapes of the inner dimensions. + +Exo allows **window expressions** as well, which are similar to array slicing in Python. Instead of accessing the buffer point-wise (e.g., `x[i]`), users can *window* the array as `x[i:i+2]`. This will create a windowed array of size 2. +Exo procedures take tensor expressions when annotated with `x:f32[3]` syntax and take window expressions when annotated with `x:[f32][3]`, with square brackets around the types. + +```python +@proc +def foo(x: [f32][3]): + for i in seq(0, 3): + x[i] = 0.0 + +@proc +def bar(y: f32[10], z: f32[20, 20]): + foo(y[2:5]) + foo(z[1, 10:13]) +``` + +In this example, `foo` takes a window array of size 3, and `bar` calls `foo` by slicing `y` and `z`, respectively. Running `exocc` on this will generate the following C code: + +```c +#include "tmp.h" + +#include +#include + +// bar( +// y : f32[10] @DRAM, +// z : f32[20, 20] @DRAM +// ) +void bar(void *ctxt, float* y, float* z) { + foo(ctxt, (struct exo_win_1f32){ &y[2], { 1 } }); + foo(ctxt, (struct exo_win_1f32){ &z[20 + 10], { 1 } }); +} + +// foo( +// x : [f32][3] @DRAM +// ) +void foo(void *ctxt, struct exo_win_1f32 x) { + for (int_fast32_t i = 0; i < 3; i++) { + x.data[i * x.strides[0]] = 0.0f; + } +} +``` + +Moreover, Exo checks the consistency of tensor and window bounds in the frontend. If you modify `foo(y[2:5])` to `foo(y[2:6])` in the code above, the bounds check will fail and emit the following error: + +``` +TypeError: Errors occurred during effect checking: +/private/tmp/tmp.py:12:8: type-shape of calling argument may not equal the required type-shape: [Effects.BinOp(op='-', lhs=Effects.Const(val=6, type=LoopIR.Int(), srcinfo=), rhs=Effects.Const(val=2, type=LoopIR.Int(), srcinfo=), type=LoopIR.Index(), srcinfo=)] vs. [Effects.Const(val=3, type=LoopIR.Int(), srcinfo=)]. It could be non equal when: + y_stride_0 = 1, z_stride_0 = 20, z_stride_1 = 1 +``` + +#### Aliasing Limitations + +When passing buffers to procedure arguments, aliasing is not allowed. Concretely, you cannot write something like: + +```python +foo(y, y) +foo(y[0:5], y[2:7]) +``` + +This limitation exists because the analysis would be imprecise if we allowed such aliasing. This is similar to how C++ compilers can perform more optimization when you use the `__restrict__` keyword to explicitly indicate that you're not aliasing your buffers. + + +#### Passing Tensor Window Slices to Functions Expecting Non-Window Tensors + +It is not allowed to pass a _window_ to a function that expects a non-window tensor as an argument. Consider the following example: + +```python +@proc +def callee(x: f32[10]): + pass + +@proc +def caller(x: f32[2, 10]): + callee(x[0]) # Error: Passing a window slice to a function expecting a non-window tensor + callee(x[1, :]) # Error: Passing a window slice to a function expecting a non-window tensor +``` + +In this code snippet, the `callee` function expects a non-window tensor `x` of shape `f32[10]`. However, in the `caller` function, we attempt to pass slices of the `x` tensor (`x[0]` and `x[1]`) to the `callee` function. These slices are windows of the original tensor, and passing them to a function expecting a non-window tensor is not allowed. + +To resolve this issue, you can either: +1. Modify the `callee` function to accept a window tensor as an argument, or +2. Create a new non-window tensor from the slice before passing it to the `callee` function. + + +### Allocations + +Variables within the procedure are declared similarly to arguments. + +Example: + +```python +y: i32 +``` + +- **`y`**: The variable name. +- **`i32`**: The data type (32-bit integer). +- **No memory annotation**: Defaults to `DRAM` if memory is unspecified. + +### Memories + +Memory annotations in Exo are used to model different hardware memory regions, such as DRAM, caches, or specialized memories. The `@` symbol is used to specify the memory space, for example: `@DRAM`, `@AVX2`, or `@Neon`. +Memory annotations for your custom hardware accelerators can be defined externally to Exo and can be used as annotations in the same way. +While Exo provides default memory (`DRAM`) and some library memory definitions for convenience (`AVX2`, `AVX512`, `Neon`, `GEMM_SCRATCH`, etc.), it is recommended and encouraged that users define their own memory annotations for their specific hardware. For more information on defining custom memory annotations, refer to [memories.md](./memories.md). + + + +## Loops + +### `for` Loop Syntax + +Exo uses explicit loop constructs to model iteration. The `for` loop syntax is: + +```python +for loop_variable in seq(start, end): + # Loop body +``` + +- **`loop_variable`**: The loop counter variable. +- **`seq(start, end)`**: Iterates from `start` to `end - 1`. + +Example from the code: + +```python +for i in seq(0, OC): + # Iterates i from 0 to OC - 1 +``` + +## Conditional Statements + +Conditional logic is expressed using `if` and `else` statements. + +Syntax: + +```python +if condition: + # True branch +else: + # False branch +``` + +Example: + +```python +if j + r < N: + y = data[c, j + r] +else: + y = 0 +``` + +- Checks if `j + r` is less than `N`. +- Assigns `y` accordingly. + +## Assignments + +- **Assignment (`=`)**: Assigns a value to a variable. + + ```python + y = data[c, j + r] + ``` + +- **Reduction (`+=`)**: Adds a value to a variable and stores the result back. + + ```python + out[i, j] += kernels[i, c, r] * y + ``` + +- **Array Access**: Uses square brackets to access array elements. + + ```python + data[c, j + r] + ``` + +- **Window Statements**: Creates a slice (in other words, _window_) of the buffer and assign a new name. + ```python + y = x[0:3] + ``` + +## Limitations + +Exo has a few limitations that users should be aware of: + +1. **Non-affine indexing**: Exo does not support non-affine indexing. This means that any indexing operation must be a linear combination of loop variables and constants. For example, the following expressions are not allowed: + + ```python + data[i * j + r] = 0.0 # i * j is non-affine + if n * m < 30: # n * m is non-affine + pass + ``` + + Exo allows quasi-affine indexing by division (e.g., `i/3`) and modulo (e.g., `i%3`) by constants. + + To work around this limitation, you may need to restructure your code or use additional variables to represent the non-affine expressions. + +2. **Value-dependent control flow**: Exo separates control values from buffer values, which means that it is not possible to write value-dependent control flow. For instance, the following code is not allowed: + + ```python + if data[i] < 3.0: + pass + ``` + + If you need to express such operations, consider using externs (see [externs documentation](./externs.md)). + + +## Understanding the Example + +Let's break down the example code step by step. + +### Procedure Definition + +```python +@proc +def generic_conv1d( + data: i32[IC, N] @ DRAM, + kernels: i32[OC, IC, W] @ DRAM, + out: i32[OC, N] @ DRAM, +): +``` + +- **`generic_conv1d`**: The procedure name. +- **Arguments**: + - **`data`**: Input data array of shape `[IC, N]` in DRAM. + - **`kernels`**: Kernel weights array of shape `[OC, IC, W]` in DRAM. + - **`out`**: Output data array of shape `[OC, N]` in DRAM. +- **Variables**: + - **`IC`**, **`OC`**, **`N`**, **`W`**: Dimensions, assumed to be defined elsewhere or passed as parameters. + +### Loop Nest + +```python +for i in seq(0, OC): + for j in seq(0, N): + # Zero out the output memory + out[i, j] = 0.0 + for c in seq(0, IC): + for r in seq(0, W): + y: i32 + if j + r < N: + y = data[c, j + r] + else: + y = 0 + out[i, j] += kernels[i, c, r] * y +``` + +#### Outer Loops + +- **`for i in seq(0, OC):`**: Iterates over the output channels. +- **`for j in seq(0, N):`**: Iterates over the spatial dimension of the output. + +#### Initialization + +- **`out[i, j] = 0.0`**: Initializes the output element at `(i, j)` to zero. + +#### Inner Loops + +- **`for c in seq(0, IC):`**: Iterates over the input channels. +- **`for r in seq(0, W):`**: Iterates over the kernel width. + +#### Conditional Data Access + +```python +y: i32 +if j + r < N: + y = data[c, j + r] +else: + y = 0 +``` + +- **Purpose**: Handles boundary conditions where the kernel extends beyond the input data. +- **`y`**: Temporary variable to hold the input data or zero. +- **Condition**: + - **If `j + r < N`**: Valid index; assign `data[c, j + r]` to `y`. + - **Else**: Out-of-bounds; assign `0` to `y`. + +#### Accumulation + +```python +out[i, j] += kernels[i, c, r] * y +``` + +- **Operation**: Accumulates the product of the kernel weight and the input data into the output. +- **`kernels[i, c, r]`**: Kernel weight for output channel `i`, input channel `c`, at position `r`. +- **`y`**: The input data value or zero. diff --git a/docs/backend_ops.md b/docs/primitives/backend_ops.md similarity index 100% rename from docs/backend_ops.md rename to docs/primitives/backend_ops.md diff --git a/docs/buffer_ops.md b/docs/primitives/buffer_ops.md similarity index 100% rename from docs/buffer_ops.md rename to docs/primitives/buffer_ops.md diff --git a/docs/config_ops.md b/docs/primitives/config_ops.md similarity index 100% rename from docs/config_ops.md rename to docs/primitives/config_ops.md diff --git a/docs/loop_ops.md b/docs/primitives/loop_ops.md similarity index 100% rename from docs/loop_ops.md rename to docs/primitives/loop_ops.md diff --git a/docs/other_ops.md b/docs/primitives/other_ops.md similarity index 100% rename from docs/other_ops.md rename to docs/primitives/other_ops.md diff --git a/docs/subproc_ops.md b/docs/primitives/subproc_ops.md similarity index 100% rename from docs/subproc_ops.md rename to docs/primitives/subproc_ops.md diff --git a/examples/README.md b/examples/README.md new file mode 100644 index 00000000..caf1c065 --- /dev/null +++ b/examples/README.md @@ -0,0 +1,10 @@ +# Scheduling Examples + +This directory contains several examples, along with documentation and code. +If you are new to Exo, we recommend going through the examples in the following order: + +1. [AVX2 Matmul](./avx2_matmul/README.md): This example demonstrates how to take a simple matrix multiplication kernel and transform it into an implementation that can make use of AVX2 instructions. It provides an overview of Exo and its scheduling system. + +2. [Cursor](./cursors/README.md): This example shows how to use Cursors to efficiently write schedules and define a new scheduling operator. + +3. [RVM](./rvm_conv1d/README.md): This example illustrates how to use Exo to define and target a new hardware accelerator entirely in the user code. diff --git a/examples/avx2_matmul/README.md b/examples/avx2_matmul/README.md index 4328da35..4db6742c 100644 --- a/examples/avx2_matmul/README.md +++ b/examples/avx2_matmul/README.md @@ -259,10 +259,9 @@ This will print out the results of running kernel with and without the AVX instr [blas]: https://www.netlib.org/blas/ [blis]: https://github.com/flame/blis -## Stay tuned for more automation! +## More Automation? Congratulations on completing this example! -You might have felt that the scheduling operations in this example were very low-level and might be laborious to write. -We felt the same! We have a pre-release version of Exo that provides scheduling automation _external_ to the compiler implementation. -By sharing the repeated pattern of schedules and using our novel reference mechanism called Cursors, we achieve fewer lines of code than what we've shown here in the upcoming release. Please contact Exo developers at exo@mit.edu if you want to learn more or wish to collaborate! - +You might have felt that the scheduling operations in this example were very low-level and could be laborious to write. +We felt the same! We implemented a new feature called Cursors that provides scheduling automation *external* to the compiler implementation. +To learn more, please take a look at the [cursors example](cursors/README.md) and our ASPLOS '25 paper. diff --git a/examples/cursors/.gitignore b/examples/cursors/.gitignore new file mode 100644 index 00000000..6cbfadf7 --- /dev/null +++ b/examples/cursors/.gitignore @@ -0,0 +1 @@ +cursors/ diff --git a/examples/cursors/README.md b/examples/cursors/README.md new file mode 100644 index 00000000..90a3cef5 --- /dev/null +++ b/examples/cursors/README.md @@ -0,0 +1,22 @@ +# Cursor Step-by-Step Tutorial + +This example demonstrates Cursors using the tile2D example (as shown in our ASPLOS '25 paper). + +## Overview + +This example covers the key concepts presented in the paper: +- Finding Cursors with pattern-matching +- Cursor navigation +- Applying scheduling primitives using cursors +- Cursor forwarding after code transformations +- Defining a new scheduling operation + +## Getting Started + +To run this example: +```bash +exocc cursors.py +``` +Running `exocc` on `cursors.py` will generate the C code in the `cursors/cursors.c` file. +It will also print out the intermediate steps of the example. + diff --git a/examples/cursors/cursors.py b/examples/cursors/cursors.py new file mode 100644 index 00000000..8cc58dda --- /dev/null +++ b/examples/cursors/cursors.py @@ -0,0 +1,142 @@ +from __future__ import annotations +from exo import * +from exo.API_scheduling import * + +""" +Cursor Example + +This example introduces the concept of Cursors in Exo 2 paper and demonstrates +how to use scheduling operators with them to manipulate loops and optimize code. + +Cursors allow you to select and refer to parts of the code such as expressions, +statements, and code blocks. They also support spatial navigation within a procedure +to proximate locations. + +Key concepts covered: +- Finding Cursors with pattern-matching +- Cursor navigation +- Applying scheduling primitives using cursors +- Cursor forwarding after code transformations +- Defining a new scheduling operation +""" + + +""" +1: Basic loop example using Exo 2 + +GEMV kernel: y = A * x +Args: + M (size): Number of rows in matrix A + N (size): Number of columns in matrix A + A (tensor): M x N matrix stored in DRAM + x (tensor): N-dimensional vector stored in DRAM + y (tensor): M-dimensional vector stored in DRAM +""" + + +@proc +def gemv(M: size, N: size, A: f32[M, N], x: f32[N], y: f32[M]): + assert M % 8 == 0 + assert N % 8 == 0 + + for i in seq(0, M): + for j in seq(0, N): + y[i] += A[i, j] * x[j] + + +print("1: Original GEMV kernel") +print(gemv) +print() + + +""" +2: Finding cursors +""" +# Find a cursor to the i loop by name +i_loop = gemv.find_loop("i") + +# Find the same i loop by pattern +i_loop2 = gemv.find("for i in _: _") + +# Check that two cursors are pointing to the same 'i' loop +assert i_loop == i_loop2 + +print("2: i_loop points to:") +print(i_loop) +print() + + +""" +3: Navigating with cursors +""" +# Find cursors to key parts of the code +j_loop = i_loop.body()[0] # j is the only statement in i's body +C_store = j_loop.body()[0] # y[i] = ... is the only statement in j's body +j_loop_parent = j_loop.parent() # The parent of the j loop + +# Check that j_loop's parent is indeed pointing to the i_loop +assert i_loop == j_loop_parent + +print("3: j_loop points to:") +print(j_loop) +print() + + +""" +4: Applying scheduling primitives & Cursor forwarding +""" +# First, rename the gemv +g = rename(gemv, "gemv_scheduled") + +# Divide the i loop by 8 +g = divide_loop(g, i_loop, 8, ["io", "ii"], perfect=True) + +# Divide the j loop by 8 +g = divide_loop(g, j_loop, 8, ["jo", "ji"], perfect=True) + +# Now, we want to reorder ii and jo loops, by lifting the scope of j_loop +# We can still use the j_loop cursor! +g1 = lift_scope(g, j_loop) +g2 = lift_scope(g, g.forward(j_loop)) + +# Assert that g1 and g2 are the same (`j_loop` is implicitly forwarded in the first line) +assert g1 == g2 + +print("4: Tiled gemv") +print(g1) +print("4: g.forward(j_loop) points to:") +print(g.forward(j_loop)) +print() + + +""" +5: Defining a new scheduling operator +""" + + +def tile_2D(p, i_lp, j_lp, i_itrs, j_itrs, i_sz, j_sz): + """ + Perform a 2D tiling of the i and j loops. + Args: + p: Procedure to be tiled + i_lp: Name of the i loop + j_lp: Name of the j loop + i_itrs: New iterators for the i loop + j_itrs: New iterators for the j loop + i_sz: Tile size for the i loop + j_sz: Tile size for the j loop + """ + p = divide_loop(p, i_lp, i_sz, i_itrs, perfect=True) + p = divide_loop(p, j_lp, j_sz, j_itrs, perfect=True) + p = lift_scope(p, j_itrs[0]) + return p + + +# Example usage of tile_2D to perform 2D tiling on the gemv kernel. +final_g = tile_2D(gemv, i_loop, j_loop, ["io", "ii"], ["jo", "ji"], 8, 8) + +print("5: tile_2D applied gemv:") +print(final_g) + + +__all__ = ["final_g"]