Releases · microsoft/BitBLAS

Benchmark

We evaluate the following categories of operations:

FP16 Matrix Operations
- GEMM (Matrix Multiplication)
- GEMV (Matrix-Vector Multiplication)
INT8 Matrix Operations
- GEMM (Matrix Multiplication)
- GEMV (Matrix-Vector Multiplication)
Dequantization Operations
- Weight Quantization (WQ) GEMM and GEMV
Contiguous batching performance for enhanced GPU utilization

FP16 GEMM and GEMV

2. INT8 GEMM and GEMV

Dequantize GEMM and GEMV

Contiguous Batching Performance

Benchmark Configuration

The benchmark configurations for each test scenario are detailed below:

config	Provider	M	N	K
V0	None	1	16384	16384
V1	BLOOM	1	43008	14336
V2	BLOOM	1	14336	14336
V3	BLOOM	1	57344	14336
V4	BLOOM	1	14336	57344
V5	OPT	1	9216	9216
V6	OPT	1	36864	9216
V7	OPT	1	9216	36864
V8	LLAMA	1	22016	8192
V9	LLAMA	1	8192	22016
V10	LLAMA-2	1	8192	8192
V11	LLAMA-2	1	28672	8192
V12	LLAMA-2	1	8192	28672
M0	None	16384	16384	16384
M1	BLOOM	8192	43008	14336
M2	BLOOM	8192	14336	14336
M3	BLOOM	8192	57344	14336
M4	BLOOM	8192	14336	57344
M5	OPT	8192	9216	9216
M6	OPT	8192	36864	9216
M7	OPT	8192	9216	36864
M8	LLAMA	8192	22016	8192
M9	LLAMA	8192	8192	22016
M10	LLAMA-2	8192	8192	8192
M11	LLAMA-2	8192	28672	8192
M12	LLAMA-2	8192	8192	28672

What's Changed

fix typos by @xzyaoi in #23
[Kernel] Extend Fast Decoding to UINT2 + QZeros by @LeiWang1999 in #25
[FP8] Support FP8 MatrixCore Code gen and related test by @LeiWang1999 in #29
[FP8] Improve tensor adapter to support fp8 conversion between torch and numpy by @LeiWang1999 in #30
[Bug] Improve the Default Config Value and fix a Bug for TensorCore Config with Small shapes by @LeiWang1999 in #32
[BUG] Make sure the torch tensor is contiguous by @LeiWang1999 in #34
[BitNet] Disable accelerate for BitNET by @LeiWang1999 in #36
[FP8] Support Weight Dequantize FP16xFP8_E4M3 by @LeiWang1999 in #42
[DEV][FP8] Improve e4m3 decoding by @LeiWang1999 in #43
[Target] Improve TVM Target related items by @LeiWang1999 in #45
[BUGFix] Fix UINT/INT8 dequantize implementation and optimize the schedule template for float32 accum by @LeiWang1999 in #46
[Feature] Enhancing MatmulOps with Splitk Support by @LeiWang1999 in #48
[Dev] Bump Version to dev0.8 and fix issue INT8xINT2 by @LeiWang1999 in #49
[Dev] Improve General Matmul With Splitk by @LeiWang1999 in #50
[Dev] Bump Version to 0.0.1.dev9 by @LeiWang1999 in #51
[Dev] Fix GEMV Dynamic Scheduling with Splitk by @LeiWang1999 in #52
[BugFix] Fix a bug in Static shape build by @LeiWang1999 in #53
[Dev] Fix a but within FP8 E4M3 Fast Decoding by @LeiWang1999 in #54
[Dev] Issue#24: FIx a bug of repack AutoGPTQ quantized parameters by @tzj-fxz in #57
[FIX] GPU detection in multigpu env and OEM A100 not matching TVM by @Qubitium in #58
[FIX] Must validate ENV settings or wrong gpu selected by nvidia-smi by @Qubitium in #59
Fix gpu model missing from tvm target remap by @Qubitium in #61
[Dev] Potentially improve performance through block reduction by @LeiWang1999 in #63
[Readme] Update support matrix in README by @LeiWang1999 in #67
[Dev] Move bitblas package to the project root by @LeiWang1999 in #68
[Dev] Refactor scripts based on our new directory structure by @LeiWang1999 in #69
[Dev] Refactor testing scripts and fix security issues by @LeiWang1999 in #72
[CI] Auto Format Checking and test checking. by @LeiWang1999 in #73
[Fix] Fix Bitblas Relax relevant pass and test by @LeiWang1999 in #74
[CI] Edit the notify setting in our CI by @LeiWang1999 in #76
[Dev] Move Relax Pass from testing to integration by @LeiWang1999 in #77
[Dev] Refactor the ops script implementation with SE by @LeiWang1999 in #78
[Dev] Fix a bug in general matmul ops with zero by @LeiWang1999 in #79
[Dev] Append Efficient CUDA test for low precision batch decoding by @LeiWang1999 in #80
[Dev] Refactor Backend Dispatch and Kernel Wrap Related Design by @LeiWang1999 in #83
[Dev] Refactor Modeling BitNet to support vLLM quant linear by @LeiWang1999 in #84
Fix database path default by @janEbert in #85
[Issue 62] flexible whl for different cuda version by @tzj-fxz in #86
Limiting parallel jobs for local build by @bibo-msft in #88
[Dev] Bump version to 0.0.1.dev13 by @LeiWang1999 in #87
[Dev] Feature Improves for bitnet and block reduction by @LeiWang1999 in #92
[Dev] Bug fix within block reduce schedule template by @LeiWang1999 in #93
[Dev] Fix a correctness issue when block reduce is applied with pipeline stage by @LeiWang1999 in #94
[Dev] Transform 3rdparty tvm from bitblas into bitblas_tl by @LeiWang1999 in #95
[Dev] Append CUTLASS submodule by @LeiWang1999 in #96
[Dev] Add Basic Benchmark Implementation for operators by @LeiWang1999 in #98
[Dev] Improve benchmark scripts by @LeiWang1999 in #99
Fix virtual env issue for our benchmark workflow by @LeiWang1999 in #101
[BUG Fix] Add missing checkout statements in benchmark workflow by @LeiWang1999 in #102
Update benchmark.yml by @LeiWang1999 in #103
[BUG Fix] remove ref assignments of the pr commit by @LeiWang1999 in #104
Ref GPTQMo...