Skip to content

v0.1.0

Latest
Compare
Choose a tag to compare
@LeiWang1999 LeiWang1999 released this 31 Jan 17:40
· 4 commits to main since this release
1082fbf

Benchmark

We evaluate the following categories of operations:

  1. FP16 Matrix Operations
    • GEMM (Matrix Multiplication)
    • GEMV (Matrix-Vector Multiplication)
  2. INT8 Matrix Operations
    • GEMM (Matrix Multiplication)
    • GEMV (Matrix-Vector Multiplication)
  3. Dequantization Operations
    • Weight Quantization (WQ) GEMM and GEMV
  4. Contiguous batching performance for enhanced GPU utilization

FP16 GEMM and GEMV

op_benchmark_a100_fp16_gemm
op_benchmark_a100_fp16_gemv
2. INT8 GEMM and GEMV
op_benchmark_a100_int8_gemm
op_benchmark_a100_int8_gemv

Dequantize GEMM and GEMV

op_benchmark_a100_wq_gemm
op_benchmark_a100_wq_gemv

Contiguous Batching Performance

contiguous_batching_benchmark_a100

Benchmark Configuration

The benchmark configurations for each test scenario are detailed below:

configProviderMNK
V0None11638416384
V1BLOOM14300814336
V2BLOOM11433614336
V3BLOOM15734414336
V4BLOOM11433657344
V5OPT192169216
V6OPT1368649216
V7OPT1921636864
V8LLAMA1220168192
V9LLAMA1819222016
V10LLAMA-2181928192
V11LLAMA-21286728192
V12LLAMA-21819228672
M0None163841638416384
M1BLOOM81924300814336
M2BLOOM81921433614336
M3BLOOM81925734414336
M4BLOOM81921433657344
M5OPT819292169216
M6OPT8192368649216
M7OPT8192921636864
M8LLAMA8192220168192
M9LLAMA8192819222016
M10LLAMA-2819281928192
M11LLAMA-28192286728192
M12LLAMA-28192819228672

What's Changed

  • fix typos by @xzyaoi in #23
  • [Kernel] Extend Fast Decoding to UINT2 + QZeros by @LeiWang1999 in #25
  • [FP8] Support FP8 MatrixCore Code gen and related test by @LeiWang1999 in #29
  • [FP8] Improve tensor adapter to support fp8 conversion between torch and numpy by @LeiWang1999 in #30
  • [Bug] Improve the Default Config Value and fix a Bug for TensorCore Config with Small shapes by @LeiWang1999 in #32
  • [BUG] Make sure the torch tensor is contiguous by @LeiWang1999 in #34
  • [BitNet] Disable accelerate for BitNET by @LeiWang1999 in #36
  • [FP8] Support Weight Dequantize FP16xFP8_E4M3 by @LeiWang1999 in #42
  • [DEV][FP8] Improve e4m3 decoding by @LeiWang1999 in #43
  • [Target] Improve TVM Target related items by @LeiWang1999 in #45
  • [BUGFix] Fix UINT/INT8 dequantize implementation and optimize the schedule template for float32 accum by @LeiWang1999 in #46
  • [Feature] Enhancing MatmulOps with Splitk Support by @LeiWang1999 in #48
  • [Dev] Bump Version to dev0.8 and fix issue INT8xINT2 by @LeiWang1999 in #49
  • [Dev] Improve General Matmul With Splitk by @LeiWang1999 in #50
  • [Dev] Bump Version to 0.0.1.dev9 by @LeiWang1999 in #51
  • [Dev] Fix GEMV Dynamic Scheduling with Splitk by @LeiWang1999 in #52
  • [BugFix] Fix a bug in Static shape build by @LeiWang1999 in #53
  • [Dev] Fix a but within FP8 E4M3 Fast Decoding by @LeiWang1999 in #54
  • [Dev] Issue#24: FIx a bug of repack AutoGPTQ quantized parameters by @tzj-fxz in #57
  • [FIX] GPU detection in multigpu env and OEM A100 not matching TVM by @Qubitium in #58
  • [FIX] Must validate ENV settings or wrong gpu selected by nvidia-smi by @Qubitium in #59
  • Fix gpu model missing from tvm target remap by @Qubitium in #61
  • [Dev] Potentially improve performance through block reduction by @LeiWang1999 in #63
  • [Readme] Update support matrix in README by @LeiWang1999 in #67
  • [Dev] Move bitblas package to the project root by @LeiWang1999 in #68
  • [Dev] Refactor scripts based on our new directory structure by @LeiWang1999 in #69
  • [Dev] Refactor testing scripts and fix security issues by @LeiWang1999 in #72
  • [CI] Auto Format Checking and test checking. by @LeiWang1999 in #73
  • [Fix] Fix Bitblas Relax relevant pass and test by @LeiWang1999 in #74
  • [CI] Edit the notify setting in our CI by @LeiWang1999 in #76
  • [Dev] Move Relax Pass from testing to integration by @LeiWang1999 in #77
  • [Dev] Refactor the ops script implementation with SE by @LeiWang1999 in #78
  • [Dev] Fix a bug in general matmul ops with zero by @LeiWang1999 in #79
  • [Dev] Append Efficient CUDA test for low precision batch decoding by @LeiWang1999 in #80
  • [Dev] Refactor Backend Dispatch and Kernel Wrap Related Design by @LeiWang1999 in #83
  • [Dev] Refactor Modeling BitNet to support vLLM quant linear by @LeiWang1999 in #84
  • Fix database path default by @janEbert in #85
  • [Issue 62] flexible whl for different cuda version by @tzj-fxz in #86
  • Limiting parallel jobs for local build by @bibo-msft in #88
  • [Dev] Bump version to 0.0.1.dev13 by @LeiWang1999 in #87
  • [Dev] Feature Improves for bitnet and block reduction by @LeiWang1999 in #92
  • [Dev] Bug fix within block reduce schedule template by @LeiWang1999 in #93
  • [Dev] Fix a correctness issue when block reduce is applied with pipeline stage by @LeiWang1999 in #94
  • [Dev] Transform 3rdparty tvm from bitblas into bitblas_tl by @LeiWang1999 in #95
  • [Dev] Append CUTLASS submodule by @LeiWang1999 in #96
  • [Dev] Add Basic Benchmark Implementation for operators by @LeiWang1999 in #98
  • [Dev] Improve benchmark scripts by @LeiWang1999 in #99
  • Fix virtual env issue for our benchmark workflow by @LeiWang1999 in #101
  • [BUG Fix] Add missing checkout statements in benchmark workflow by @LeiWang1999 in #102
  • Update benchmark.yml by @LeiWang1999 in #103
  • [BUG Fix] remove ref assignments of the pr commit by @LeiWang1999 in #104
  • Ref GPTQModel for 3rd support/integration by @Qubitium in #106
  • [Dev] Complete benchmark op sets of ci by @LeiWang1999 in #100
  • [Dev] Remove Redundant Dynamic Shared Memory sync by @LeiWang1999 in #107
  • [Dev] Enhancing Lower Warp Memory Pass to support decode within warp memory by @LeiWang1999 in #110
  • [Dev] Enhance Lower Warp memory to support multi stage tensorization by @LeiWang1999 in #111
  • Refactor benchmark yml to disable alters on issue by @LeiWang1999 in #113
  • [Dev] Enhance LOP3 Instruction Registration to support incoming warp level lop3 instructions by @LeiWang1999 in #114
  • [Dev] Merge BlockReduce with naive schedule template by @LeiWang1999 in #119
  • [Dev] Implement ScheduleUnsafeInjectCallArgument Primitive to Hack decoding by @LeiWang1999 in #124
  • [Fix][Dev] Typo fix for our workflow and enhance lop3 decode to support scaling by @LeiWang1999 in #125
  • [Dev] Convert the quant compress from numpy into tvm runtime by @LeiWang1999 in #126
  • Update documents by @xysmlx in #129
  • [Dev] Refactor the weight transformation to support upcoming stage3 transform by @LeiWang1999 in #130
  • [Dev] Bring Block Reduction into our seach space and policy by @LeiWang1999 in #132
  • Fix retrieve head commit in benchmark by @LeiWang1999 in #134
  • [Integration] Upload tutorial for making a bitnet ckpt for vLLM by @LeiWang1999 in #135
  • [Typo] Fix missing links in the bitnet integration's docs by @LeiWang1999 in #136
  • fix BitNet integration for vLLM by @xysmlx in #137
  • fix BitNet integration for vLLM by @xysmlx in #139
  • [Dev] Set default weight transformation into Ladder Stage3 LDMatrixTransform by @LeiWang1999 in #133
  • [Dev] Disable Block reduction for int8 by default by @LeiWang1999 in #140
  • [Dev] BUG Fix for bitnet integration by @LeiWang1999 in #141
  • [Feature] Register Missing FastDecoding for INT8xINT4 by @LeiWang1999 in #142
  • [BUG Fix] Fix the NVCC Comple options for CUDA Version >= 12.5 by @LeiWang1999 in #143
  • [Integration] Compress Gateup and QKV for bitnet integration by @LeiWang1999 in #144
  • [Enhancement] Improve elementwise schedule via vectorization by @LeiWang1999 in #145
  • [Dev] Add LowerAllReduce Pass to support cross thread Reduction lowering by @LeiWang1999 in #146
  • [Fix] Fix scale and zero scopes for scale only template by @LeiWang1999 in #147
  • [Dev] Support Numeric Precision BFloat16 as activation type by @LeiWang1999 in #148
  • [Version] Bump Version to 0.0.1.dev15 by @LeiWang1999 in #149
  • [Dev] Serialize Generated Kernel Name with Operator Config and Hint by @LeiWang1999 in #153
  • [BUG] Set Device when kernel be applied into Multiple GPUs. by @LeiWang1999 in #155
  • [Benchmark] Fast Decoding Benchmark by @LeiWang1999 in #158
  • [BUGFix] Disable tensorcore when shape is really small by @LeiWang1999 in #159
  • [BUGFix] Resgiter missing FP8 LDMATRIX Instructions for dynamic shared memory by @LeiWang1999 in #162
  • [Docs] Update install command from github repo by @LeiWang1999 in #163
  • [BugFix] Fix BitBLAS Linear with BFloat16 input by @LeiWang1999 in #164
  • [BUGFix] Fix LowerThreadAllReduce Pass for Hopper Arch by @LeiWang1999 in #165
  • [Dev] Enhance Thread Sync Injector for Stream-K Implementation by @LeiWang1999 in #166
  • [Dev] Revert Hack impl for memory caching by @LeiWang1999 in #167
  • [TL] Update several TL Examples by @LeiWang1999 in #168
  • [TL] Enhance Layout Annotate Pass to handle PTX Inst by @LeiWang1999 in #170
  • chore(deps): bump actions/download-artifact from 3 to 4.1.7 in /.github/workflows by @dependabot in #175
  • [TL] Add TL Layout and Macro utils by @LeiWang1999 in #174
  • [TL] Support GEMM_SS Macro to perform gemm directly from shared memory by @LeiWang1999 in #176
  • [TL] Inject Storage Sync Scope Automatically for TL by @LeiWang1999 in #177
  • [TL] Allow T.clear be applied on a "local" Buffer and improve L2 Swizzle by @LeiWang1999 in #178
  • [TL] Enhance TL to import customized c headers by @LeiWang1999 in #179
  • [Dev] Bug fix for Block Reduce Template and improve TL by @LeiWang1999 in #183
  • [BugFix] Disable 8bit TensorCore for SM Version lower than 80 by @LeiWang1999 in #185
  • [Dev] Dequante SIMT Matmul Implementation. by @LeiWang1999 in #188
  • [Dev] Improve Dequant performance on CUDA Simt by @LeiWang1999 in #189
  • [TL] Append Macro Test Case for GEMM and Dequant GEMM by @LeiWang1999 in #190
  • [TL] Add example usage/test case for Dynamic Symbolic by @LeiWang1999 in #191
  • [BugFix]Fix llvm install bug by @tzj-fxz in #193
  • [Test] Add Thread Level Macro Dequantize Gemm Test Cases by @LeiWang1999 in #194
  • [TL][BugFix] Add implementation of TL Gemm and Fix a bug for TL Jit by @LeiWang1999 in #195
  • [TL] test flashattention script by @tzj-fxz in #196
  • [TL][BugFix] Disable Buffer Vectorization and Add OP Related TL Test Cases by @LeiWang1999 in #197
  • [TL] Wrap TL Kernel with Scheduler by @LeiWang1999 in #199
  • [Dev][TL] Add TL BaseScheduler and Library Generator by @LeiWang1999 in #200
  • [Dev][TL] Hardware Aware Tuning Examples with TL by @LeiWang1999 in #201
  • [TL] initial implement flashattention op in TL by @tzj-fxz in #202
  • [Dev] Enhance Operator Cache to support multi-thread environments by @LeiWang1999 in #205
  • [TL] Adapt TL Hardware-aware Search Space with Roller by @LeiWang1999 in #207
  • [TL] [Doc] add flash attention usage document by @tzj-fxz in #210
  • [Dev] Add support and test case for Ladder Weight only Transformation Matmul Operator by @LeiWang1999 in #212
  • [Dev][TL] Merge Hopper and Pipeline Modifications by @LeiWang1999 in #213
  • [Dev][TL] Integrate TL Dequant Implementation into BitBLAS OPs by @LeiWang1999 in #214
  • [TL] [Issue215] add simplify pass for TL and test script, fixing issue by @tzj-fxz in #216
  • [Bugfix] Enhance LowerAsyncCopy Pass to handle INT8 dma copy with predicate by @LeiWang1999 in #219
  • [Dev] Disable smooth layout rewrite for buffer store in some case by @LeiWang1999 in #220
  • [Dev][TL] Enhance TL Paser to support flexible tile lang kernel implementation by @LeiWang1999 in #222
  • [Dev][TL] Implement Tile Language Dequant Matmul and Test Case by @LeiWang1999 in #224
  • [Issue 192] Tail split support for dynamic matmul by @tzj-fxz in #227
  • [Dev][TL] Following updates of Tile Language Backend by @LeiWang1999 in #226
  • [Dev] Add some tests and examples by @LeiWang1999 in #228
  • [AMD][HIP] Add HIP Code Generation with Block Primitives from Composable kernel Tile by @LeiWang1999 in #223
  • [Dev][Bugfix] Add target argument and remove override register for hip callback compile by @LeiWang1999 in #229
  • [Bugfix] Fix build bug due to submodule update by @LeiWang1999 in #230
  • [Dev] Support Tile Lang INT8xINT8 TensorCore Macro by @LeiWang1999 in #231
  • [Dev][TL] Implement MMA INT4 Tensor Core and Correctness Test Case. by @LeiWang1999 in #232
  • [Dev][BitNET] Implement INT4xINT2 GEMM by @LeiWang1999 in #233
  • [Dev][Bitnet] Implement Operator with INT4xINT4/INT2 by @LeiWang1999 in #234
  • [Dev] Update News in Readme by @LeiWang1999 in #235
  • [Dev] Enhance TileLang Backend and fix a bug for INT4xINT2 by @LeiWang1999 in #236
  • [DEV][TL] Support AMD Matrix Code Implementation by @LeiWang1999 in #237
  • [Dev][HIP] Fix MFMA Codegen by @LeiWang1999 in #238
  • [CI] Disable Benchmark workflow due to github action v4 updates by @LeiWang1999 in #239
  • [Dev] Enhance Infra for ROCM by @LeiWang1999 in #240
  • [Dev][AMD] Add AMD CDNA Arch by @Cunxiao2002 in #225
  • [Dev] Fix some lint issues by @LeiWang1999 in #241
  • [Dev][Relax] Update Bitblas end2end tuning example with relax by @LeiWang1999 in #242
  • [Dev] Fix illegal pass order by @LeiWang1999 in #243
  • [Docs] update the contributing's table of contents by @emmanuel-ferdman in #245
  • [Dev][AMD] Implement LDS Async Copy for CDNA Arch by @LeiWang1999 in #246
  • [Dev][AMD] Support LDS and Flash Attention for AMD Backend by @LeiWang1999 in #247
  • [AMD][TL] Introduce K Pack and a Conflict Free swizzling into Matrix Core by @LeiWang1999 in #248
  • [BUGFix] Introduce our own asser_close to allow few mismatch elements for some case by @LeiWang1999 in #249
  • [Dev][AMD] Implement conditional async load for AMD HIP Backend by @LeiWang1999 in #250
  • [BUGFix] Fix MatmulDequantize with FP4 Format by @LeiWang1999 in #254
  • [Dev] Enhance Backend Abstraction for TileLang by @LeiWang1999 in #255
  • [Docker] Add Dockerfile to set up the application environment by @LeiWang1999 in #260
  • [Relax] Fix end2end tuning for relax graph by @LeiWang1999 in #261
  • [Dev] Refactor codebase to save import time by @LeiWang1999 in #262
  • [Enhancement][TileLang] Introduce Pass LegalizeSafeMemoryAccess to auto protect memory access by Injecting IfThenElse Node by @LeiWang1999 in #267
  • [TileLang][Dev] Enhance Layout Inference Pass to infer with complex parallel primitives by @LeiWang1999 in #268
  • [Dev] Migrate default backend from tir into tilelang by @LeiWang1999 in #270
  • [Dev] Fallback NF format to TIR backend as TileLang implementation is not currently supported. by @LeiWang1999 in #272
  • [Dev] Implement TileLang NF Format Dequantize by @LeiWang1999 in #273
  • [Release] Bump version to 0.1.0 by @LeiWang1999 in #274
  • [Bugfix] Fix Mismatched Retnet LinearAttention Layout by @LeiWang1999 in #275
  • [Bugfix] Fix correctness issue for float16xuint1 with fast dequantize by @LeiWang1999 in #277
  • Fix: Remove composable_kernel include from setup.py by @LeslinD in #279
  • [Bugfix] Fix VERSION FileNotFoundError bugs via pip installation by @senlyu163 in #285
  • [Doc] Move Torch Tensors to GPU by @senlyu163 in #286

New Contributors

Full Changelog: v0.0.1dev...v0.1.0