Reasons for switching to CUTLASS-based kernel instead of custom kernel #2

Yard1 · 2023-11-09T18:52:38Z

Hey folks, awesome and really impactful work with the repo and the paper.

I was wondering what was the reason for switching from the original bgmv kernel to a CUTLASS-based sgmv one. I understand that one advantage of sgmv is that it doesn't require the LoRA tensors to be in a single contiguous block of memory, but aside from that, are there any performance considerations that made you switch?

I can also see that there is a custom sgmv shrink kernel implementation but the expand version is WIP. Is that something you are planning to work on in the near future?

Furthermore, do the performance results in the paper concern the CUTLASS kernel or the custom kernel? From the description of the implementation I inferred the later, but I was confused by the lack of the custom expand kernel in the repo.

Thanks, and great work!

The text was updated successfully, but these errors were encountered:

abcdabcd987 · 2023-11-09T22:10:07Z

Really good questions. And thanks for taking a close look at our code.

The semantics of BGMV and SGMV differs. BGMV was our first attempt [1]. It was designed for the case where every input is for a different LoRA model. We realized that the performance improvement of BGMV comes solely from utilizing more compute units. It does not scale well. The "free lunch" has a limit. But if consider other use cases, where not every input is for a different model, we can extend this free lunch to much bigger batch size. You can see this difference in the roofline plot in the paper.

You are correct that our BGMV implementation requires a continuous weight block. But that is not our primary reason for using cutlass. We can also pass pointers to hand-written kernels as well, just like what we do for SGMV shrink. We use cutlass simply because we were curious about cutlass. We gave it a try. Cutlass turned out to work quite well for expand.

However, we were not able to obtain an efficient shrink from cutlass. We searched the whole configuration space. It just does not work. We also briefly tried Triton, but it didn't work either. So we rolled our own cuda implementation.

For the benchmarks, we use cutlass expand and our hand-written shrink. We will release the hand-written expand soon.

[1] https://le.qun.ch/en/blog/2023/09/11/multi-lora-potentials/

Yard1 · 2023-11-09T22:21:30Z

Awesome, thank you! Looking forward to custom expand.

jcao-ai mentioned this issue Nov 21, 2023

The smallest rank supported is 16？ #17

Closed

abcdabcd987 mentioned this issue Nov 30, 2023

flashinfer shrink vs cutlass #25

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reasons for switching to CUTLASS-based kernel instead of custom kernel #2

Reasons for switching to CUTLASS-based kernel instead of custom kernel #2

Yard1 commented Nov 9, 2023

abcdabcd987 commented Nov 9, 2023

Yard1 commented Nov 9, 2023

Reasons for switching to CUTLASS-based kernel instead of custom kernel #2

Reasons for switching to CUTLASS-based kernel instead of custom kernel #2

Comments

Yard1 commented Nov 9, 2023

abcdabcd987 commented Nov 9, 2023

Yard1 commented Nov 9, 2023