Release v0.1.0 · microsoft/BitBLAS

Benchmark

We evaluate the following categories of operations:

FP16 Matrix Operations
- GEMM (Matrix Multiplication)
- GEMV (Matrix-Vector Multiplication)
INT8 Matrix Operations
- GEMM (Matrix Multiplication)
- GEMV (Matrix-Vector Multiplication)
Dequantization Operations
- Weight Quantization (WQ) GEMM and GEMV
Contiguous batching performance for enhanced GPU utilization

FP16 GEMM and GEMV

2. INT8 GEMM and GEMV

Dequantize GEMM and GEMV

Contiguous Batching Performance

Benchmark Configuration

The benchmark configurations for each test scenario are detailed below:

config	Provider	M	N	K
V0	None	1	16384	16384
V1	BLOOM	1	43008	14336
V2	BLOOM	1	14336	14336
V3	BLOOM	1	57344	14336
V4	BLOOM	1	14336	57344
V5	OPT	1	9216	9216
V6	OPT	1	36864	9216
V7	OPT	1	9216	36864
V8	LLAMA	1	22016	8192
V9	LLAMA	1	8192	22016
V10	LLAMA-2	1	8192	8192
V11	LLAMA-2	1	28672	8192
V12	LLAMA-2	1	8192	28672
M0	None	16384	16384	16384
M1	BLOOM	8192	43008	14336
M2	BLOOM	8192	14336	14336
M3	BLOOM	8192	57344	14336
M4	BLOOM	8192	14336	57344
M5	OPT	8192	9216	9216
M6	OPT	8192	36864	9216
M7	OPT	8192	9216	36864
M8	LLAMA	8192	22016	8192
M9	LLAMA	8192	8192	22016
M10	LLAMA-2	8192	8192	8192
M11	LLAMA-2	8192	28672	8192
M12	LLAMA-2	8192	8192	28672

What's Changed

fix typos by @xzyaoi in #23
[Kernel] Extend Fast Decoding to UINT2 + QZeros by @LeiWang1999 in #25
[FP8] Support FP8 MatrixCore Code gen and related test by @LeiWang1999 in #29
[FP8] Improve tensor adapter to support fp8 conversion between torch and numpy by @LeiWang1999 in #30
[Bug] Improve the Default Config Value and fix a Bug for TensorCore Config with Small shapes by @LeiWang1999 in #32
[BUG] Make sure the torch tensor is contiguous by @LeiWang1999 in #34
[BitNet] Disable accelerate for BitNET by @LeiWang1999 in #36
[FP8] Support Weight Dequantize FP16xFP8_E4M3 by @LeiWang1999 in #42
[DEV][FP8] Improve e4m3 decoding by @LeiWang1999 in #43
[Target] Improve TVM Target related items by @LeiWang1999 in #45
[BUGFix] Fix UINT/INT8 dequantize implementation and optimize the schedule template for float32 accum by @LeiWang1999 in #46
[Feature] Enhancing MatmulOps with Splitk Support by @LeiWang1999 in #48
[Dev] Bump Version to dev0.8 and fix issue INT8xINT2 by @LeiWang1999 in #49
[Dev] Improve General Matmul With Splitk by @LeiWang1999 in #50
[Dev] Bump Version to 0.0.1.dev9 by @LeiWang1999 in #51
[Dev] Fix GEMV Dynamic Scheduling with Splitk by @LeiWang1999 in #52
[BugFix] Fix a bug in Static shape build by @LeiWang1999 in #53
[Dev] Fix a but within FP8 E4M3 Fast Decoding by @LeiWang1999 in #54
[Dev] Issue#24: FIx a bug of repack AutoGPTQ quantized parameters by @tzj-fxz in #57
[FIX] GPU detection in multigpu env and OEM A100 not matching TVM by @Qubitium in #58
[FIX] Must validate ENV settings or wrong gpu selected by nvidia-smi by @Qubitium in #59
Fix gpu model missing from tvm target remap by @Qubitium in #61
[Dev] Potentially improve performance through block reduction by @LeiWang1999 in #63
[Readme] Update support matrix in README by @LeiWang1999 in #67
[Dev] Move bitblas package to the project root by @LeiWang1999 in #68
[Dev] Refactor scripts based on our new directory structure by @LeiWang1999 in #69
[Dev] Refactor testing scripts and fix security issues by @LeiWang1999 in #72
[CI] Auto Format Checking and test checking. by @LeiWang1999 in #73
[Fix] Fix Bitblas Relax relevant pass and test by @LeiWang1999 in #74
[CI] Edit the notify setting in our CI by @LeiWang1999 in #76
[Dev] Move Relax Pass from testing to integration by @LeiWang1999 in #77
[Dev] Refactor the ops script implementation with SE by @LeiWang1999 in #78
[Dev] Fix a bug in general matmul ops with zero by @LeiWang1999 in #79
[Dev] Append Efficient CUDA test for low precision batch decoding by @LeiWang1999 in #80
[Dev] Refactor Backend Dispatch and Kernel Wrap Related Design by @LeiWang1999 in #83
[Dev] Refactor Modeling BitNet to support vLLM quant linear by @LeiWang1999 in #84
Fix database path default by @janEbert in #85
[Issue 62] flexible whl for different cuda version by @tzj-fxz in #86
Limiting parallel jobs for local build by @bibo-msft in #88
[Dev] Bump version to 0.0.1.dev13 by @LeiWang1999 in #87
[Dev] Feature Improves for bitnet and block reduction by @LeiWang1999 in #92
[Dev] Bug fix within block reduce schedule template by @LeiWang1999 in #93
[Dev] Fix a correctness issue when block reduce is applied with pipeline stage by @LeiWang1999 in #94
[Dev] Transform 3rdparty tvm from bitblas into bitblas_tl by @LeiWang1999 in #95
[Dev] Append CUTLASS submodule by @LeiWang1999 in #96
[Dev] Add Basic Benchmark Implementation for operators by @LeiWang1999 in #98
[Dev] Improve benchmark scripts by @LeiWang1999 in #99
Fix virtual env issue for our benchmark workflow by @LeiWang1999 in #101
[BUG Fix] Add missing checkout statements in benchmark workflow by @LeiWang1999 in #102
Update benchmark.yml by @LeiWang1999 in #103
[BUG Fix] remove ref assignments of the pr commit by @LeiWang1999 in #104
Ref GPTQModel for 3rd support/integration by @Qubitium in #106
[Dev] Complete benchmark op sets of ci by @LeiWang1999 in #100
[Dev] Remove Redundant Dynamic Shared Memory sync by @LeiWang1999 in #107
[Dev] Enhancing Lower Warp Memory Pass to support decode within warp memory by @LeiWang1999 in #110
[Dev] Enhance Lower Warp memory to support multi stage tensorization by @LeiWang1999 in #111
Refactor benchmark yml to disable alters on issue by @LeiWang1999 in #113
[Dev] Enhance LOP3 Instruction Registration to support incoming warp level lop3 instructions by @LeiWang1999 in #114
[Dev] Merge BlockReduce with naive schedule template by @LeiWang1999 in #119
[Dev] Implement ScheduleUnsafeInjectCallArgument Primitive to Hack decoding by @LeiWang1999 in #124
[Fix][Dev] Typo fix for our workflow and enhance lop3 decode to support scaling by @LeiWang1999 in #125
[Dev] Convert the quant compress from numpy into tvm runtime by @LeiWang1999 in #126
Update documents by @xysmlx in #129
[Dev] Refactor the weight transformation to support upcoming stage3 transform by @LeiWang1999 in #130
[Dev] Bring Block Reduction into our seach space and policy by @LeiWang1999 in #132
Fix retrieve head commit in benchmark by @LeiWang1999 in #134
[Integration] Upload tutorial for making a bitnet ckpt for vLLM by @LeiWang1999 in #135
[Typo] Fix missing links in the bitnet integration's docs by @LeiWang1999 in #136
fix BitNet integration for vLLM by @xysmlx in #137
fix BitNet integration for vLLM by @xysmlx in #139
[Dev] Set default weight transformation into Ladder Stage3 LDMatrixTransform by @LeiWang1999 in #133
[Dev] Disable Block reduction for int8 by default by @LeiWang1999 in #140
[Dev] BUG Fix for bitnet integration by @LeiWang1999 in #141
[Feature] Register Missing FastDecoding for INT8xINT4 by @LeiWang1999 in #142
[BUG Fix] Fix the NVCC Comple options for CUDA Version >= 12.5 by @LeiWang1999 in #143
[Integration] Compress Gateup and QKV for bitnet integration by @LeiWang1999 in #144
[Enhancement] Improve elementwise schedule via vectorization by @LeiWang1999 in #145
[Dev] Add LowerAllReduce Pass to support cross thread Reduction lowering by @LeiWang1999 in #146
[Fix] Fix scale and zero scopes for scale only template by @LeiWang1999 in #147
[Dev] Support Numeric Precision BFloat16 as activation type by @LeiWang1999 in #148
[Version] Bump Version to 0.0.1.dev15 by @LeiWang1999 in #149
[Dev] Serialize Generated Kernel Name with Operator Config and Hint by @LeiWang1999 in #153
[BUG] Set Device when kernel be applied into Multiple GPUs. by @LeiWang1999 in #155
[Benchmark] Fast Decoding Benchmark by @LeiWang1999 in #158
[BUGFix] Disable tensorcore when shape is really small by @LeiWang1999 in #159
[BUGFix] Resgiter missing FP8 LDMATRIX Instructions for dynamic shared memory by @LeiWang1999 in #162
[Docs] Update install command from github repo by @LeiWang1999 in #163
[BugFix] Fix BitBLAS Linear with BFloat16 input by @LeiWang1999 in #164
[BUGFix] Fix LowerThreadAllReduce Pass for Hopper Arch by @LeiWang1999 in #165
[Dev] Enhance Thread Sync Injector for Stream-K Implementation by @LeiWang1999 in #166
[Dev] Revert Hack impl for memory caching by @LeiWang1999 in #167
[TL] Update several TL Examples by @LeiWang1999 in #168
[TL] Enhance Layout Annotate Pass to handle PTX Inst by @LeiWang1999 in #170
chore(deps): bump actions/download-artifact from 3 to 4.1.7 in /.github/workflows by @dependabot in #175
[TL] Add TL Layout and Macro utils by @LeiWang1999 in #174
[TL] Support GEMM_SS Macro to perform gemm directly from shared memory by @LeiWang1999 in #176
[TL] Inject Storage Sync Scope Automatically for TL by @LeiWang1999 in #177
[TL] Allow T.clear be applied on a "local" Buffer and improve L2 Swizzle by @LeiWang1999 in #178
[TL] Enhance TL to import customized c headers by @LeiWang1999 in #179
[Dev] Bug fix for Block Reduce Template and improve TL by @LeiWang1999 in #183
[BugFix] Disable 8bit TensorCore for SM Version lower than 80 by @LeiWang1999 in #185
[Dev] Dequante SIMT Matmul Implementation. by @LeiWang1999 in #188
[Dev] Improve Dequant performance on CUDA Simt by @LeiWang1999 in #189
[TL] Append Macro Test Case for GEMM and Dequant GEMM by @LeiWang1999 in #190
[TL] Add example usage/test case for Dynamic Symbolic by @LeiWang1999 in #191
[BugFix]Fix llvm install bug by @tzj-fxz in #193
[Test] Add Thread Level Macro Dequantize Gemm Test Cases by @LeiWang1999 in #194
[TL][BugFix] Add implementation of TL Gemm and Fix a bug for TL Jit by @LeiWang1999 in #195
[TL] test flashattention script by @tzj-fxz in #196
[TL][BugFix] Disable Buffer Vectorization and Add OP Related TL Test Cases by @LeiWang1999 in #197
[TL] Wrap TL Kernel with Scheduler by @LeiWang1999 in #199
[Dev][TL] Add TL BaseScheduler and Library Generator by @LeiWang1999 in #200
[Dev][TL] Hardware Aware Tuning Examples with TL by @LeiWang1999 in #201
[TL] initial implement flashattention op in TL by @tzj-fxz in #202
[Dev] Enhance Operator Cache to support multi-thread environments by @LeiWang1999 in #205
[TL] Adapt TL Hardware-aware Search Space with Roller by @LeiWang1999 in #207
[TL] [Doc] add flash attention usage document by @tzj-fxz in #210
[Dev] Add support and test case for Ladder Weight only Transformation Matmul Operator by @LeiWang1999 in #212
[Dev][TL] Merge Hopper and Pipeline Modifications by @LeiWang1999 in #213
[Dev][TL] Integrate TL Dequant Implementation into BitBLAS OPs by @LeiWang1999 in #214
[TL] [Issue215] add simplify pass for TL and test script, fixing issue by @tzj-fxz in #216
[Bugfix] Enhance LowerAsyncCopy Pass to handle INT8 dma copy with predicate by @LeiWang1999 in #219
[Dev] Disable smooth layout rewrite for buffer store in some case by @LeiWang1999 in #220
[Dev][TL] Enhance TL Paser to support flexible tile lang kernel implementation by @LeiWang1999 in #222
[Dev][TL] Implement Tile Language Dequant Matmul and Test Case by @LeiWang1999 in #224
[Issue 192] Tail split support for dynamic matmul by @tzj-fxz in #227
[Dev][TL] Following updates of Tile Language Backend by @LeiWang1999 in #226
[Dev] Add some tests and examples by @LeiWang1999 in #228
[AMD][HIP] Add HIP Code Generation with Block Primitives from Composable kernel Tile by @LeiWang1999 in #223
[Dev][Bugfix] Add target argument and remove override register for hip callback compile by @LeiWang1999 in #229
[Bugfix] Fix build bug due to submodule update by @LeiWang1999 in #230
[Dev] Support Tile Lang INT8xINT8 TensorCore Macro by @LeiWang1999 in #231
[Dev][TL] Implement MMA INT4 Tensor Core and Correctness Test Case. by @LeiWang1999 in #232
[Dev][BitNET] Implement INT4xINT2 GEMM by @LeiWang1999 in #233
[Dev][Bitnet] Implement Operator with INT4xINT4/INT2 by @LeiWang1999 in #234
[Dev] Update News in Readme by @LeiWang1999 in #235
[Dev] Enhance TileLang Backend and fix a bug for INT4xINT2 by @LeiWang1999 in #236
[DEV][TL] Support AMD Matrix Code Implementation by @LeiWang1999 in #237
[Dev][HIP] Fix MFMA Codegen by @LeiWang1999 in #238
[CI] Disable Benchmark workflow due to github action v4 updates by @LeiWang1999 in #239
[Dev] Enhance Infra for ROCM by @LeiWang1999 in #240
[Dev][AMD] Add AMD CDNA Arch by @Cunxiao2002 in #225
[Dev] Fix some lint issues by @LeiWang1999 in #241
[Dev][Relax] Update Bitblas end2end tuning example with relax by @LeiWang1999 in #242
[Dev] Fix illegal pass order by @LeiWang1999 in #243
[Docs] update the contributing's table of contents by @emmanuel-ferdman in #245
[Dev][AMD] Implement LDS Async Copy for CDNA Arch by @LeiWang1999 in #246
[Dev][AMD] Support LDS and Flash Attention for AMD Backend by @LeiWang1999 in #247
[AMD][TL] Introduce K Pack and a Conflict Free swizzling into Matrix Core by @LeiWang1999 in #248
[BUGFix] Introduce our own asser_close to allow few mismatch elements for some case by @LeiWang1999 in #249
[Dev][AMD] Implement conditional async load for AMD HIP Backend by @LeiWang1999 in #250
[BUGFix] Fix MatmulDequantize with FP4 Format by @LeiWang1999 in #254
[Dev] Enhance Backend Abstraction for TileLang by @LeiWang1999 in #255
[Docker] Add Dockerfile to set up the application environment by @LeiWang1999 in #260
[Relax] Fix end2end tuning for relax graph by @LeiWang1999 in #261
[Dev] Refactor codebase to save import time by @LeiWang1999 in #262
[Enhancement][TileLang] Introduce Pass LegalizeSafeMemoryAccess to auto protect memory access by Injecting IfThenElse Node by @LeiWang1999 in #267
[TileLang][Dev] Enhance Layout Inference Pass to infer with complex parallel primitives by @LeiWang1999 in #268
[Dev] Migrate default backend from tir into tilelang by @LeiWang1999 in #270
[Dev] Fallback NF format to TIR backend as TileLang implementation is not currently supported. by @LeiWang1999 in #272
[Dev] Implement TileLang NF Format Dequantize by @LeiWang1999 in #273
[Release] Bump version to 0.1.0 by @LeiWang1999 in #274
[Bugfix] Fix Mismatched Retnet LinearAttention Layout by @LeiWang1999 in #275
[Bugfix] Fix correctness issue for float16xuint1 with fast dequantize by @LeiWang1999 in #277
Fix: Remove composable_kernel include from setup.py by @LeslinD in #279
[Bugfix] Fix VERSION FileNotFoundError bugs via pip installation by @senlyu163 in #285
[Doc] Move Torch Tensors to GPU by @senlyu163 in #286

New Contributors

@xzyaoi made their first contribution in #23
@tzj-fxz made their first contribution in #57
@Qubitium made their first contribution in #58
@janEbert made their first contribution in #85
@dependabot made their first contribution in #175
@Cunxiao2002 made their first contribution in #225
@emmanuel-ferdman made their first contribution in #245
@LeslinD made their first contribution in #279
@senlyu163 made their first contribution in #285

Full Changelog: v0.0.1dev...v0.1.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.1.0

Benchmark

FP16 GEMM and GEMV

Dequantize GEMM and GEMV

Contiguous Batching Performance

Benchmark Configuration

What's Changed

New Contributors

Contributors