Releases: microsoft/BitBLAS
Releases · microsoft/BitBLAS
v0.1.0
Benchmark
We evaluate the following categories of operations:
- FP16 Matrix Operations
- GEMM (Matrix Multiplication)
- GEMV (Matrix-Vector Multiplication)
- INT8 Matrix Operations
- GEMM (Matrix Multiplication)
- GEMV (Matrix-Vector Multiplication)
- Dequantization Operations
- Weight Quantization (WQ) GEMM and GEMV
- Contiguous batching performance for enhanced GPU utilization
FP16 GEMM and GEMV
Dequantize GEMM and GEMV
Contiguous Batching Performance
Benchmark Configuration
The benchmark configurations for each test scenario are detailed below:
config | Provider | M | N | K |
---|---|---|---|---|
V0 | None | 1 | 16384 | 16384 |
V1 | BLOOM | 1 | 43008 | 14336 |
V2 | BLOOM | 1 | 14336 | 14336 |
V3 | BLOOM | 1 | 57344 | 14336 |
V4 | BLOOM | 1 | 14336 | 57344 |
V5 | OPT | 1 | 9216 | 9216 |
V6 | OPT | 1 | 36864 | 9216 |
V7 | OPT | 1 | 9216 | 36864 |
V8 | LLAMA | 1 | 22016 | 8192 |
V9 | LLAMA | 1 | 8192 | 22016 |
V10 | LLAMA-2 | 1 | 8192 | 8192 |
V11 | LLAMA-2 | 1 | 28672 | 8192 |
V12 | LLAMA-2 | 1 | 8192 | 28672 |
M0 | None | 16384 | 16384 | 16384 |
M1 | BLOOM | 8192 | 43008 | 14336 |
M2 | BLOOM | 8192 | 14336 | 14336 |
M3 | BLOOM | 8192 | 57344 | 14336 |
M4 | BLOOM | 8192 | 14336 | 57344 |
M5 | OPT | 8192 | 9216 | 9216 |
M6 | OPT | 8192 | 36864 | 9216 |
M7 | OPT | 8192 | 9216 | 36864 |
M8 | LLAMA | 8192 | 22016 | 8192 |
M9 | LLAMA | 8192 | 8192 | 22016 |
M10 | LLAMA-2 | 8192 | 8192 | 8192 |
M11 | LLAMA-2 | 8192 | 28672 | 8192 |
M12 | LLAMA-2 | 8192 | 8192 | 28672 |
What's Changed
- fix typos by @xzyaoi in #23
- [Kernel] Extend Fast Decoding to UINT2 + QZeros by @LeiWang1999 in #25
- [FP8] Support FP8 MatrixCore Code gen and related test by @LeiWang1999 in #29
- [FP8] Improve tensor adapter to support fp8 conversion between torch and numpy by @LeiWang1999 in #30
- [Bug] Improve the Default Config Value and fix a Bug for TensorCore Config with Small shapes by @LeiWang1999 in #32
- [BUG] Make sure the torch tensor is contiguous by @LeiWang1999 in #34
- [BitNet] Disable accelerate for BitNET by @LeiWang1999 in #36
- [FP8] Support Weight Dequantize FP16xFP8_E4M3 by @LeiWang1999 in #42
- [DEV][FP8] Improve e4m3 decoding by @LeiWang1999 in #43
- [Target] Improve TVM Target related items by @LeiWang1999 in #45
- [BUGFix] Fix UINT/INT8 dequantize implementation and optimize the schedule template for float32 accum by @LeiWang1999 in #46
- [Feature] Enhancing MatmulOps with Splitk Support by @LeiWang1999 in #48
- [Dev] Bump Version to dev0.8 and fix issue INT8xINT2 by @LeiWang1999 in #49
- [Dev] Improve General Matmul With Splitk by @LeiWang1999 in #50
- [Dev] Bump Version to 0.0.1.dev9 by @LeiWang1999 in #51
- [Dev] Fix GEMV Dynamic Scheduling with Splitk by @LeiWang1999 in #52
- [BugFix] Fix a bug in Static shape build by @LeiWang1999 in #53
- [Dev] Fix a but within FP8 E4M3 Fast Decoding by @LeiWang1999 in #54
- [Dev] Issue#24: FIx a bug of repack AutoGPTQ quantized parameters by @tzj-fxz in #57
- [FIX] GPU detection in multigpu env and OEM A100 not matching TVM by @Qubitium in #58
- [FIX] Must validate ENV settings or wrong gpu selected by nvidia-smi by @Qubitium in #59
- Fix gpu model missing from tvm target remap by @Qubitium in #61
- [Dev] Potentially improve performance through block reduction by @LeiWang1999 in #63
- [Readme] Update support matrix in README by @LeiWang1999 in #67
- [Dev] Move bitblas package to the project root by @LeiWang1999 in #68
- [Dev] Refactor scripts based on our new directory structure by @LeiWang1999 in #69
- [Dev] Refactor testing scripts and fix security issues by @LeiWang1999 in #72
- [CI] Auto Format Checking and test checking. by @LeiWang1999 in #73
- [Fix] Fix Bitblas Relax relevant pass and test by @LeiWang1999 in #74
- [CI] Edit the notify setting in our CI by @LeiWang1999 in #76
- [Dev] Move Relax Pass from testing to integration by @LeiWang1999 in #77
- [Dev] Refactor the ops script implementation with SE by @LeiWang1999 in #78
- [Dev] Fix a bug in general matmul ops with zero by @LeiWang1999 in #79
- [Dev] Append Efficient CUDA test for low precision batch decoding by @LeiWang1999 in #80
- [Dev] Refactor Backend Dispatch and Kernel Wrap Related Design by @LeiWang1999 in #83
- [Dev] Refactor Modeling BitNet to support vLLM quant linear by @LeiWang1999 in #84
- Fix database path default by @janEbert in #85
- [Issue 62] flexible whl for different cuda version by @tzj-fxz in #86
- Limiting parallel jobs for local build by @bibo-msft in #88
- [Dev] Bump version to 0.0.1.dev13 by @LeiWang1999 in #87
- [Dev] Feature Improves for bitnet and block reduction by @LeiWang1999 in #92
- [Dev] Bug fix within block reduce schedule template by @LeiWang1999 in #93
- [Dev] Fix a correctness issue when block reduce is applied with pipeline stage by @LeiWang1999 in #94
- [Dev] Transform 3rdparty tvm from bitblas into bitblas_tl by @LeiWang1999 in #95
- [Dev] Append CUTLASS submodule by @LeiWang1999 in #96
- [Dev] Add Basic Benchmark Implementation for operators by @LeiWang1999 in #98
- [Dev] Improve benchmark scripts by @LeiWang1999 in #99
- Fix virtual env issue for our benchmark workflow by @LeiWang1999 in #101
- [BUG Fix] Add missing checkout statements in benchmark workflow by @LeiWang1999 in #102
- Update benchmark.yml by @LeiWang1999 in #103
- [BUG Fix] remove ref assignments of the pr commit by @LeiWang1999 in #104
- Ref GPTQMo...
BitBLAS v0.0.1 Pre-release
Pre-release for the v0.0.1. Under testing.