Skip to content

Releases: microsoft/BitBLAS

v0.1.0

31 Jan 17:40
1082fbf
Compare
Choose a tag to compare

Benchmark

We evaluate the following categories of operations:

  1. FP16 Matrix Operations
    • GEMM (Matrix Multiplication)
    • GEMV (Matrix-Vector Multiplication)
  2. INT8 Matrix Operations
    • GEMM (Matrix Multiplication)
    • GEMV (Matrix-Vector Multiplication)
  3. Dequantization Operations
    • Weight Quantization (WQ) GEMM and GEMV
  4. Contiguous batching performance for enhanced GPU utilization

FP16 GEMM and GEMV

op_benchmark_a100_fp16_gemm
op_benchmark_a100_fp16_gemv
2. INT8 GEMM and GEMV
op_benchmark_a100_int8_gemm
op_benchmark_a100_int8_gemv

Dequantize GEMM and GEMV

op_benchmark_a100_wq_gemm
op_benchmark_a100_wq_gemv

Contiguous Batching Performance

contiguous_batching_benchmark_a100

Benchmark Configuration

The benchmark configurations for each test scenario are detailed below:

configProviderMNK
V0None11638416384
V1BLOOM14300814336
V2BLOOM11433614336
V3BLOOM15734414336
V4BLOOM11433657344
V5OPT192169216
V6OPT1368649216
V7OPT1921636864
V8LLAMA1220168192
V9LLAMA1819222016
V10LLAMA-2181928192
V11LLAMA-21286728192
V12LLAMA-21819228672
M0None163841638416384
M1BLOOM81924300814336
M2BLOOM81921433614336
M3BLOOM81925734414336
M4BLOOM81921433657344
M5OPT819292169216
M6OPT8192368649216
M7OPT8192921636864
M8LLAMA8192220168192
M9LLAMA8192819222016
M10LLAMA-2819281928192
M11LLAMA-28192286728192
M12LLAMA-28192819228672

What's Changed

  • fix typos by @xzyaoi in #23
  • [Kernel] Extend Fast Decoding to UINT2 + QZeros by @LeiWang1999 in #25
  • [FP8] Support FP8 MatrixCore Code gen and related test by @LeiWang1999 in #29
  • [FP8] Improve tensor adapter to support fp8 conversion between torch and numpy by @LeiWang1999 in #30
  • [Bug] Improve the Default Config Value and fix a Bug for TensorCore Config with Small shapes by @LeiWang1999 in #32
  • [BUG] Make sure the torch tensor is contiguous by @LeiWang1999 in #34
  • [BitNet] Disable accelerate for BitNET by @LeiWang1999 in #36
  • [FP8] Support Weight Dequantize FP16xFP8_E4M3 by @LeiWang1999 in #42
  • [DEV][FP8] Improve e4m3 decoding by @LeiWang1999 in #43
  • [Target] Improve TVM Target related items by @LeiWang1999 in #45
  • [BUGFix] Fix UINT/INT8 dequantize implementation and optimize the schedule template for float32 accum by @LeiWang1999 in #46
  • [Feature] Enhancing MatmulOps with Splitk Support by @LeiWang1999 in #48
  • [Dev] Bump Version to dev0.8 and fix issue INT8xINT2 by @LeiWang1999 in #49
  • [Dev] Improve General Matmul With Splitk by @LeiWang1999 in #50
  • [Dev] Bump Version to 0.0.1.dev9 by @LeiWang1999 in #51
  • [Dev] Fix GEMV Dynamic Scheduling with Splitk by @LeiWang1999 in #52
  • [BugFix] Fix a bug in Static shape build by @LeiWang1999 in #53
  • [Dev] Fix a but within FP8 E4M3 Fast Decoding by @LeiWang1999 in #54
  • [Dev] Issue#24: FIx a bug of repack AutoGPTQ quantized parameters by @tzj-fxz in #57
  • [FIX] GPU detection in multigpu env and OEM A100 not matching TVM by @Qubitium in #58
  • [FIX] Must validate ENV settings or wrong gpu selected by nvidia-smi by @Qubitium in #59
  • Fix gpu model missing from tvm target remap by @Qubitium in #61
  • [Dev] Potentially improve performance through block reduction by @LeiWang1999 in #63
  • [Readme] Update support matrix in README by @LeiWang1999 in #67
  • [Dev] Move bitblas package to the project root by @LeiWang1999 in #68
  • [Dev] Refactor scripts based on our new directory structure by @LeiWang1999 in #69
  • [Dev] Refactor testing scripts and fix security issues by @LeiWang1999 in #72
  • [CI] Auto Format Checking and test checking. by @LeiWang1999 in #73
  • [Fix] Fix Bitblas Relax relevant pass and test by @LeiWang1999 in #74
  • [CI] Edit the notify setting in our CI by @LeiWang1999 in #76
  • [Dev] Move Relax Pass from testing to integration by @LeiWang1999 in #77
  • [Dev] Refactor the ops script implementation with SE by @LeiWang1999 in #78
  • [Dev] Fix a bug in general matmul ops with zero by @LeiWang1999 in #79
  • [Dev] Append Efficient CUDA test for low precision batch decoding by @LeiWang1999 in #80
  • [Dev] Refactor Backend Dispatch and Kernel Wrap Related Design by @LeiWang1999 in #83
  • [Dev] Refactor Modeling BitNet to support vLLM quant linear by @LeiWang1999 in #84
  • Fix database path default by @janEbert in #85
  • [Issue 62] flexible whl for different cuda version by @tzj-fxz in #86
  • Limiting parallel jobs for local build by @bibo-msft in #88
  • [Dev] Bump version to 0.0.1.dev13 by @LeiWang1999 in #87
  • [Dev] Feature Improves for bitnet and block reduction by @LeiWang1999 in #92
  • [Dev] Bug fix within block reduce schedule template by @LeiWang1999 in #93
  • [Dev] Fix a correctness issue when block reduce is applied with pipeline stage by @LeiWang1999 in #94
  • [Dev] Transform 3rdparty tvm from bitblas into bitblas_tl by @LeiWang1999 in #95
  • [Dev] Append CUTLASS submodule by @LeiWang1999 in #96
  • [Dev] Add Basic Benchmark Implementation for operators by @LeiWang1999 in #98
  • [Dev] Improve benchmark scripts by @LeiWang1999 in #99
  • Fix virtual env issue for our benchmark workflow by @LeiWang1999 in #101
  • [BUG Fix] Add missing checkout statements in benchmark workflow by @LeiWang1999 in #102
  • Update benchmark.yml by @LeiWang1999 in #103
  • [BUG Fix] remove ref assignments of the pr commit by @LeiWang1999 in #104
  • Ref GPTQMo...
Read more

BitBLAS v0.0.1 Pre-release

19 Apr 08:54
Compare
Choose a tag to compare
Pre-release

Pre-release for the v0.0.1. Under testing.