CUTLASS Grouped GEMM #6

imoneoi · 2023-12-27T06:56:29Z

Use CUTLASS for grouped GEMM (both no transposition, trans_a, trans_b).

~20% speedup on A100

This reverts commit b97a205.

152334H · 2024-01-01T06:14:06Z

NVIDIA/cutlass#1286 should be available on H100 now as well

tgale96 · 2024-01-02T16:20:01Z

Hi! Thanks for the PR!

We have users who currently rely on the cuBLAS path for Hopper, which this PR deletes, I think.

NVIDIA/cutlass#1286 should be available on H100 now as well

Since this is now available, it'd be great to support for SM90! It looks like it requires a very new version of CUDA so perhaps it would be best to keep around the simple cuBLAS implementation to fallback to if we can't support CUTLASS grouped GEMM?

152334H · 2024-01-07T11:54:44Z

is there any critical reason for why grouped gemm is hardcoded to use BFloat16? or would a string replace of bfloat16 with float16 just work?

tgale96 · 2024-01-08T15:35:16Z

There is no reason why we only support BFloat16. I implemented only bfloat because that was what our user who needed this feature uses. It would be relatively easy to template our helpers and dispatch based on input tensor type.

dfyz · 2024-06-19T21:05:14Z

@tgale96
Hi! Are there any plans for this branch to eventually be merged? I'm not sure what the exact CUDA requirements for the newer CUTLASS versions are, but it might indeed be a good idea to restore the simple cuBLAS fallback for H100 to make this PR only work for A100 (so far).

For context: I'm working on a branch that removes the CPU<->GPU sync for batch_sizes, and having both forward and backward passes use CUTLASS is a prerequisite for that (since using cuBLAS kind of implies that you have to know the batch sizes on the host).

I also stumbled upon a nasty CUTLASS bug when one of the elements in batch_sizes is 0. I'm not sure when/if the fix is going to be merged upstream, but it might be a good idea to backport the relevant changes to the CUTLASS version used in this repo.

tgale96 · 2024-06-24T15:56:02Z

Hey! It would be great to have a full CUTLASS path but I do not personally have the cycles for it at the moment. Contributions would be very welcome, and I'd be happy to provide any guidance that is necessary!

dfyz · 2024-06-24T21:22:55Z

Cool! I opened #14 as a starting point. Any guidance would be much appreciated! :)

imoneoi added 5 commits December 26, 2023 21:59

full cutlass group gemm

f91f57e

change setup.py

b97a205

Revert "change setup.py"

7fbcf9b

This reverts commit b97a205.

Use default parameters for grouped mm

64b1e12

use default schedule mode

f5d70bd

dfyz mentioned this pull request Jun 24, 2024

Use CUTLASS for both trans_a and trans_b on Ampere #14

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUTLASS Grouped GEMM #6

CUTLASS Grouped GEMM #6

imoneoi commented Dec 27, 2023

152334H commented Jan 1, 2024

tgale96 commented Jan 2, 2024

152334H commented Jan 7, 2024

tgale96 commented Jan 8, 2024

dfyz commented Jun 19, 2024 •

edited

Loading

tgale96 commented Jun 24, 2024

dfyz commented Jun 24, 2024

CUTLASS Grouped GEMM #6

Are you sure you want to change the base?

CUTLASS Grouped GEMM #6

Conversation

imoneoi commented Dec 27, 2023

152334H commented Jan 1, 2024

tgale96 commented Jan 2, 2024

152334H commented Jan 7, 2024

tgale96 commented Jan 8, 2024

dfyz commented Jun 19, 2024 • edited Loading

tgale96 commented Jun 24, 2024

dfyz commented Jun 24, 2024

dfyz commented Jun 19, 2024 •

edited

Loading