Benchmark
We evaluate the following categories of operations:
- FP16 Matrix Operations
- GEMM (Matrix Multiplication)
- GEMV (Matrix-Vector Multiplication)
- INT8 Matrix Operations
- GEMM (Matrix Multiplication)
- GEMV (Matrix-Vector Multiplication)
- Dequantization Operations
- Weight Quantization (WQ) GEMM and GEMV
- Contiguous batching performance for enhanced GPU utilization
FP16 GEMM and GEMV
Dequantize GEMM and GEMV
Contiguous Batching Performance
Benchmark Configuration
The benchmark configurations for each test scenario are detailed below:
config | Provider | M | N | K |
---|---|---|---|---|
V0 | None | 1 | 16384 | 16384 |
V1 | BLOOM | 1 | 43008 | 14336 |
V2 | BLOOM | 1 | 14336 | 14336 |
V3 | BLOOM | 1 | 57344 | 14336 |
V4 | BLOOM | 1 | 14336 | 57344 |
V5 | OPT | 1 | 9216 | 9216 |
V6 | OPT | 1 | 36864 | 9216 |
V7 | OPT | 1 | 9216 | 36864 |
V8 | LLAMA | 1 | 22016 | 8192 |
V9 | LLAMA | 1 | 8192 | 22016 |
V10 | LLAMA-2 | 1 | 8192 | 8192 |
V11 | LLAMA-2 | 1 | 28672 | 8192 |
V12 | LLAMA-2 | 1 | 8192 | 28672 |
M0 | None | 16384 | 16384 | 16384 |
M1 | BLOOM | 8192 | 43008 | 14336 |
M2 | BLOOM | 8192 | 14336 | 14336 |
M3 | BLOOM | 8192 | 57344 | 14336 |
M4 | BLOOM | 8192 | 14336 | 57344 |
M5 | OPT | 8192 | 9216 | 9216 |
M6 | OPT | 8192 | 36864 | 9216 |
M7 | OPT | 8192 | 9216 | 36864 |
M8 | LLAMA | 8192 | 22016 | 8192 |
M9 | LLAMA | 8192 | 8192 | 22016 |
M10 | LLAMA-2 | 8192 | 8192 | 8192 |
M11 | LLAMA-2 | 8192 | 28672 | 8192 |
M12 | LLAMA-2 | 8192 | 8192 | 28672 |
What's Changed
- fix typos by @xzyaoi in #23
- [Kernel] Extend Fast Decoding to UINT2 + QZeros by @LeiWang1999 in #25
- [FP8] Support FP8 MatrixCore Code gen and related test by @LeiWang1999 in #29
- [FP8] Improve tensor adapter to support fp8 conversion between torch and numpy by @LeiWang1999 in #30
- [Bug] Improve the Default Config Value and fix a Bug for TensorCore Config with Small shapes by @LeiWang1999 in #32
- [BUG] Make sure the torch tensor is contiguous by @LeiWang1999 in #34
- [BitNet] Disable accelerate for BitNET by @LeiWang1999 in #36
- [FP8] Support Weight Dequantize FP16xFP8_E4M3 by @LeiWang1999 in #42
- [DEV][FP8] Improve e4m3 decoding by @LeiWang1999 in #43
- [Target] Improve TVM Target related items by @LeiWang1999 in #45
- [BUGFix] Fix UINT/INT8 dequantize implementation and optimize the schedule template for float32 accum by @LeiWang1999 in #46
- [Feature] Enhancing MatmulOps with Splitk Support by @LeiWang1999 in #48
- [Dev] Bump Version to dev0.8 and fix issue INT8xINT2 by @LeiWang1999 in #49
- [Dev] Improve General Matmul With Splitk by @LeiWang1999 in #50
- [Dev] Bump Version to 0.0.1.dev9 by @LeiWang1999 in #51
- [Dev] Fix GEMV Dynamic Scheduling with Splitk by @LeiWang1999 in #52
- [BugFix] Fix a bug in Static shape build by @LeiWang1999 in #53
- [Dev] Fix a but within FP8 E4M3 Fast Decoding by @LeiWang1999 in #54
- [Dev] Issue#24: FIx a bug of repack AutoGPTQ quantized parameters by @tzj-fxz in #57
- [FIX] GPU detection in multigpu env and OEM A100 not matching TVM by @Qubitium in #58
- [FIX] Must validate ENV settings or wrong gpu selected by nvidia-smi by @Qubitium in #59
- Fix gpu model missing from tvm target remap by @Qubitium in #61
- [Dev] Potentially improve performance through block reduction by @LeiWang1999 in #63
- [Readme] Update support matrix in README by @LeiWang1999 in #67
- [Dev] Move bitblas package to the project root by @LeiWang1999 in #68
- [Dev] Refactor scripts based on our new directory structure by @LeiWang1999 in #69
- [Dev] Refactor testing scripts and fix security issues by @LeiWang1999 in #72
- [CI] Auto Format Checking and test checking. by @LeiWang1999 in #73
- [Fix] Fix Bitblas Relax relevant pass and test by @LeiWang1999 in #74
- [CI] Edit the notify setting in our CI by @LeiWang1999 in #76
- [Dev] Move Relax Pass from testing to integration by @LeiWang1999 in #77
- [Dev] Refactor the ops script implementation with SE by @LeiWang1999 in #78
- [Dev] Fix a bug in general matmul ops with zero by @LeiWang1999 in #79
- [Dev] Append Efficient CUDA test for low precision batch decoding by @LeiWang1999 in #80
- [Dev] Refactor Backend Dispatch and Kernel Wrap Related Design by @LeiWang1999 in #83
- [Dev] Refactor Modeling BitNet to support vLLM quant linear by @LeiWang1999 in #84
- Fix database path default by @janEbert in #85
- [Issue 62] flexible whl for different cuda version by @tzj-fxz in #86
- Limiting parallel jobs for local build by @bibo-msft in #88
- [Dev] Bump version to 0.0.1.dev13 by @LeiWang1999 in #87
- [Dev] Feature Improves for bitnet and block reduction by @LeiWang1999 in #92
- [Dev] Bug fix within block reduce schedule template by @LeiWang1999 in #93
- [Dev] Fix a correctness issue when block reduce is applied with pipeline stage by @LeiWang1999 in #94
- [Dev] Transform 3rdparty tvm from bitblas into bitblas_tl by @LeiWang1999 in #95
- [Dev] Append CUTLASS submodule by @LeiWang1999 in #96
- [Dev] Add Basic Benchmark Implementation for operators by @LeiWang1999 in #98
- [Dev] Improve benchmark scripts by @LeiWang1999 in #99
- Fix virtual env issue for our benchmark workflow by @LeiWang1999 in #101
- [BUG Fix] Add missing checkout statements in benchmark workflow by @LeiWang1999 in #102
- Update benchmark.yml by @LeiWang1999 in #103
- [BUG Fix] remove ref assignments of the pr commit by @LeiWang1999 in #104
- Ref GPTQModel for 3rd support/integration by @Qubitium in #106
- [Dev] Complete benchmark op sets of ci by @LeiWang1999 in #100
- [Dev] Remove Redundant Dynamic Shared Memory sync by @LeiWang1999 in #107
- [Dev] Enhancing Lower Warp Memory Pass to support decode within warp memory by @LeiWang1999 in #110
- [Dev] Enhance Lower Warp memory to support multi stage tensorization by @LeiWang1999 in #111
- Refactor benchmark yml to disable alters on issue by @LeiWang1999 in #113
- [Dev] Enhance LOP3 Instruction Registration to support incoming warp level lop3 instructions by @LeiWang1999 in #114
- [Dev] Merge BlockReduce with naive schedule template by @LeiWang1999 in #119
- [Dev] Implement ScheduleUnsafeInjectCallArgument Primitive to Hack decoding by @LeiWang1999 in #124
- [Fix][Dev] Typo fix for our workflow and enhance lop3 decode to support scaling by @LeiWang1999 in #125
- [Dev] Convert the quant compress from numpy into tvm runtime by @LeiWang1999 in #126
- Update documents by @xysmlx in #129
- [Dev] Refactor the weight transformation to support upcoming stage3 transform by @LeiWang1999 in #130
- [Dev] Bring Block Reduction into our seach space and policy by @LeiWang1999 in #132
- Fix retrieve head commit in benchmark by @LeiWang1999 in #134
- [Integration] Upload tutorial for making a bitnet ckpt for vLLM by @LeiWang1999 in #135
- [Typo] Fix missing links in the bitnet integration's docs by @LeiWang1999 in #136
- fix BitNet integration for vLLM by @xysmlx in #137
- fix BitNet integration for vLLM by @xysmlx in #139
- [Dev] Set default weight transformation into Ladder Stage3 LDMatrixTransform by @LeiWang1999 in #133
- [Dev] Disable Block reduction for int8 by default by @LeiWang1999 in #140
- [Dev] BUG Fix for bitnet integration by @LeiWang1999 in #141
- [Feature] Register Missing FastDecoding for INT8xINT4 by @LeiWang1999 in #142
- [BUG Fix] Fix the NVCC Comple options for CUDA Version >= 12.5 by @LeiWang1999 in #143
- [Integration] Compress Gateup and QKV for bitnet integration by @LeiWang1999 in #144
- [Enhancement] Improve elementwise schedule via vectorization by @LeiWang1999 in #145
- [Dev] Add LowerAllReduce Pass to support cross thread Reduction lowering by @LeiWang1999 in #146
- [Fix] Fix scale and zero scopes for scale only template by @LeiWang1999 in #147
- [Dev] Support Numeric Precision BFloat16 as activation type by @LeiWang1999 in #148
- [Version] Bump Version to 0.0.1.dev15 by @LeiWang1999 in #149
- [Dev] Serialize Generated Kernel Name with Operator Config and Hint by @LeiWang1999 in #153
- [BUG] Set Device when kernel be applied into Multiple GPUs. by @LeiWang1999 in #155
- [Benchmark] Fast Decoding Benchmark by @LeiWang1999 in #158
- [BUGFix] Disable tensorcore when shape is really small by @LeiWang1999 in #159
- [BUGFix] Resgiter missing FP8 LDMATRIX Instructions for dynamic shared memory by @LeiWang1999 in #162
- [Docs] Update install command from github repo by @LeiWang1999 in #163
- [BugFix] Fix BitBLAS Linear with BFloat16 input by @LeiWang1999 in #164
- [BUGFix] Fix LowerThreadAllReduce Pass for Hopper Arch by @LeiWang1999 in #165
- [Dev] Enhance Thread Sync Injector for Stream-K Implementation by @LeiWang1999 in #166
- [Dev] Revert Hack impl for memory caching by @LeiWang1999 in #167
- [TL] Update several TL Examples by @LeiWang1999 in #168
- [TL] Enhance Layout Annotate Pass to handle PTX Inst by @LeiWang1999 in #170
- chore(deps): bump actions/download-artifact from 3 to 4.1.7 in /.github/workflows by @dependabot in #175
- [TL] Add TL Layout and Macro utils by @LeiWang1999 in #174
- [TL] Support GEMM_SS Macro to perform gemm directly from shared memory by @LeiWang1999 in #176
- [TL] Inject Storage Sync Scope Automatically for TL by @LeiWang1999 in #177
- [TL] Allow T.clear be applied on a "local" Buffer and improve L2 Swizzle by @LeiWang1999 in #178
- [TL] Enhance TL to import customized c headers by @LeiWang1999 in #179
- [Dev] Bug fix for Block Reduce Template and improve TL by @LeiWang1999 in #183
- [BugFix] Disable 8bit TensorCore for SM Version lower than 80 by @LeiWang1999 in #185
- [Dev] Dequante SIMT Matmul Implementation. by @LeiWang1999 in #188
- [Dev] Improve Dequant performance on CUDA Simt by @LeiWang1999 in #189
- [TL] Append Macro Test Case for GEMM and Dequant GEMM by @LeiWang1999 in #190
- [TL] Add example usage/test case for Dynamic Symbolic by @LeiWang1999 in #191
- [BugFix]Fix llvm install bug by @tzj-fxz in #193
- [Test] Add Thread Level Macro Dequantize Gemm Test Cases by @LeiWang1999 in #194
- [TL][BugFix] Add implementation of TL Gemm and Fix a bug for TL Jit by @LeiWang1999 in #195
- [TL] test flashattention script by @tzj-fxz in #196
- [TL][BugFix] Disable Buffer Vectorization and Add OP Related TL Test Cases by @LeiWang1999 in #197
- [TL] Wrap TL Kernel with Scheduler by @LeiWang1999 in #199
- [Dev][TL] Add TL BaseScheduler and Library Generator by @LeiWang1999 in #200
- [Dev][TL] Hardware Aware Tuning Examples with TL by @LeiWang1999 in #201
- [TL] initial implement flashattention op in TL by @tzj-fxz in #202
- [Dev] Enhance Operator Cache to support multi-thread environments by @LeiWang1999 in #205
- [TL] Adapt TL Hardware-aware Search Space with Roller by @LeiWang1999 in #207
- [TL] [Doc] add flash attention usage document by @tzj-fxz in #210
- [Dev] Add support and test case for Ladder Weight only Transformation Matmul Operator by @LeiWang1999 in #212
- [Dev][TL] Merge Hopper and Pipeline Modifications by @LeiWang1999 in #213
- [Dev][TL] Integrate TL Dequant Implementation into BitBLAS OPs by @LeiWang1999 in #214
- [TL] [Issue215] add simplify pass for TL and test script, fixing issue by @tzj-fxz in #216
- [Bugfix] Enhance LowerAsyncCopy Pass to handle INT8 dma copy with predicate by @LeiWang1999 in #219
- [Dev] Disable smooth layout rewrite for buffer store in some case by @LeiWang1999 in #220
- [Dev][TL] Enhance TL Paser to support flexible tile lang kernel implementation by @LeiWang1999 in #222
- [Dev][TL] Implement Tile Language Dequant Matmul and Test Case by @LeiWang1999 in #224
- [Issue 192] Tail split support for dynamic matmul by @tzj-fxz in #227
- [Dev][TL] Following updates of Tile Language Backend by @LeiWang1999 in #226
- [Dev] Add some tests and examples by @LeiWang1999 in #228
- [AMD][HIP] Add HIP Code Generation with Block Primitives from Composable kernel Tile by @LeiWang1999 in #223
- [Dev][Bugfix] Add target argument and remove override register for hip callback compile by @LeiWang1999 in #229
- [Bugfix] Fix build bug due to submodule update by @LeiWang1999 in #230
- [Dev] Support Tile Lang INT8xINT8 TensorCore Macro by @LeiWang1999 in #231
- [Dev][TL] Implement MMA INT4 Tensor Core and Correctness Test Case. by @LeiWang1999 in #232
- [Dev][BitNET] Implement INT4xINT2 GEMM by @LeiWang1999 in #233
- [Dev][Bitnet] Implement Operator with INT4xINT4/INT2 by @LeiWang1999 in #234
- [Dev] Update News in Readme by @LeiWang1999 in #235
- [Dev] Enhance TileLang Backend and fix a bug for INT4xINT2 by @LeiWang1999 in #236
- [DEV][TL] Support AMD Matrix Code Implementation by @LeiWang1999 in #237
- [Dev][HIP] Fix MFMA Codegen by @LeiWang1999 in #238
- [CI] Disable Benchmark workflow due to github action v4 updates by @LeiWang1999 in #239
- [Dev] Enhance Infra for ROCM by @LeiWang1999 in #240
- [Dev][AMD] Add AMD CDNA Arch by @Cunxiao2002 in #225
- [Dev] Fix some lint issues by @LeiWang1999 in #241
- [Dev][Relax] Update Bitblas end2end tuning example with relax by @LeiWang1999 in #242
- [Dev] Fix illegal pass order by @LeiWang1999 in #243
- [Docs] update the contributing's table of contents by @emmanuel-ferdman in #245
- [Dev][AMD] Implement LDS Async Copy for CDNA Arch by @LeiWang1999 in #246
- [Dev][AMD] Support LDS and Flash Attention for AMD Backend by @LeiWang1999 in #247
- [AMD][TL] Introduce K Pack and a Conflict Free swizzling into Matrix Core by @LeiWang1999 in #248
- [BUGFix] Introduce our own
asser_close
to allow few mismatch elements for some case by @LeiWang1999 in #249 - [Dev][AMD] Implement conditional async load for AMD HIP Backend by @LeiWang1999 in #250
- [BUGFix] Fix MatmulDequantize with FP4 Format by @LeiWang1999 in #254
- [Dev] Enhance Backend Abstraction for TileLang by @LeiWang1999 in #255
- [Docker] Add Dockerfile to set up the application environment by @LeiWang1999 in #260
- [Relax] Fix end2end tuning for relax graph by @LeiWang1999 in #261
- [Dev] Refactor codebase to save import time by @LeiWang1999 in #262
- [Enhancement][TileLang] Introduce Pass
LegalizeSafeMemoryAccess
to auto protect memory access by Injecting IfThenElse Node by @LeiWang1999 in #267 - [TileLang][Dev] Enhance Layout Inference Pass to infer with complex parallel primitives by @LeiWang1999 in #268
- [Dev] Migrate default backend from tir into tilelang by @LeiWang1999 in #270
- [Dev] Fallback NF format to TIR backend as TileLang implementation is not currently supported. by @LeiWang1999 in #272
- [Dev] Implement TileLang NF Format Dequantize by @LeiWang1999 in #273
- [Release] Bump version to 0.1.0 by @LeiWang1999 in #274
- [Bugfix] Fix Mismatched Retnet LinearAttention Layout by @LeiWang1999 in #275
- [Bugfix] Fix correctness issue for float16xuint1 with fast dequantize by @LeiWang1999 in #277
- Fix: Remove composable_kernel include from setup.py by @LeslinD in #279
- [Bugfix] Fix VERSION FileNotFoundError bugs via pip installation by @senlyu163 in #285
- [Doc] Move Torch Tensors to GPU by @senlyu163 in #286
New Contributors
- @xzyaoi made their first contribution in #23
- @tzj-fxz made their first contribution in #57
- @Qubitium made their first contribution in #58
- @janEbert made their first contribution in #85
- @dependabot made their first contribution in #175
- @Cunxiao2002 made their first contribution in #225
- @emmanuel-ferdman made their first contribution in #245
- @LeslinD made their first contribution in #279
- @senlyu163 made their first contribution in #285
Full Changelog: v0.0.1dev...v0.1.0