- Add getAllAlgos extension APIs
- TensileLite support new epilogues: gradient gelu, gradient D, gradient A/B, aux
- Add sample package including three sample apps
- Add new C++ GEMM class in hipblaslt extension
- refactor GroupGemm APIs as C++ class in hipblaslt extension
- change scaleD vector enum as HIPBLASLT_MATMUL_DESC_D_SCALE_VECTOR_POINTER
- Enable norm check validation for CI
- GSU kernel optimization: wider memory, PGR N
- update logic yaml to improve some FP16 NN sizes
- GroupGemm support GSU kernel
- Add grouped gemm tuning for aldebaran
- Added CI tests for tensilelite
- Initilized extension group gemm APIs (FP16 only)
- Added group gemm sample app: example_hipblaslt_groupedgemm
- Fixed ScaleD kernel incorrect results
- Tuned equality sizes for HHS data type
- Reduced host side overhead for hipblasLtMatmul()
- Removed unused kernel arguments
- Schedule valus setup before first s_waitcnt
- Refactored tensilelite host codes
- Optimized building time
- Enable hipBLASLt APIs
- Support gfx90a
- Support problem type: fp32, fp16, bf16
- Support activation: relu, gelu
- Support bias vector
- Integreate with tensilelite kernel generator
- Add Gtest: hipblaslt-test
- Add full function tool: hipblaslt-bench
- Add sample app: example_hipblaslt_preference
- Gridbase solution search algorithm for untuned size
- Tune 10k sizes for each problem type