Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reasons for switching to CUTLASS-based kernel instead of custom kernel #2

Open
Yard1 opened this issue Nov 9, 2023 · 2 comments
Open

Comments

@Yard1
Copy link

Yard1 commented Nov 9, 2023

Hey folks, awesome and really impactful work with the repo and the paper.

I was wondering what was the reason for switching from the original bgmv kernel to a CUTLASS-based sgmv one. I understand that one advantage of sgmv is that it doesn't require the LoRA tensors to be in a single contiguous block of memory, but aside from that, are there any performance considerations that made you switch?

I can also see that there is a custom sgmv shrink kernel implementation but the expand version is WIP. Is that something you are planning to work on in the near future?

Furthermore, do the performance results in the paper concern the CUTLASS kernel or the custom kernel? From the description of the implementation I inferred the later, but I was confused by the lack of the custom expand kernel in the repo.

Thanks, and great work!

@abcdabcd987
Copy link
Contributor

Really good questions. And thanks for taking a close look at our code.

The semantics of BGMV and SGMV differs. BGMV was our first attempt [1]. It was designed for the case where every input is for a different LoRA model. We realized that the performance improvement of BGMV comes solely from utilizing more compute units. It does not scale well. The "free lunch" has a limit. But if consider other use cases, where not every input is for a different model, we can extend this free lunch to much bigger batch size. You can see this difference in the roofline plot in the paper.

You are correct that our BGMV implementation requires a continuous weight block. But that is not our primary reason for using cutlass. We can also pass pointers to hand-written kernels as well, just like what we do for SGMV shrink. We use cutlass simply because we were curious about cutlass. We gave it a try. Cutlass turned out to work quite well for expand.

However, we were not able to obtain an efficient shrink from cutlass. We searched the whole configuration space. It just does not work. We also briefly tried Triton, but it didn't work either. So we rolled our own cuda implementation.

For the benchmarks, we use cutlass expand and our hand-written shrink. We will release the hand-written expand soon.

[1] https://le.qun.ch/en/blog/2023/09/11/multi-lora-potentials/

@Yard1
Copy link
Author

Yard1 commented Nov 9, 2023

Awesome, thank you! Looking forward to custom expand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants