-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The smallest rank supported is 16? #17
Comments
Correct. Would you mind clarify your use case? I'm interested to learn. |
Thanks for you quick response. Actually we are interested in serving lora adapters with variant number of ranks, ranging from 8 to 128 at the same time. BTW, your code implementation is quite neat and good. |
Thanks for the kind words. Kudos to @yzh119 Currently, the rank needs to be either 16 or multiples of 32. And all models need to have the same rank. If I understand correctly, you have two usages that current Punica don't support.
Here's how you can mitigate it today if you don't want to be blocked by me: (essentially the same solution to both problems) In the case of other ranks, my current suggestions is to round up the rank to the closest bin (e.g., 12 -> 16, 17 -> 32, 45 -> 64). In the case of running different ranks in one batch, my suggestion is to round up to the biggest rank. I haven't tested Here's why: Figure 9 in our paper shows the latency difference across different ranks. The gist is that the latency difference is not that big, especially when a model is used more than once in the batch. More importantly, Figure 10 shows that SGMV latency difference will be submerged by other parts in the transformer layer (self attention, dense projections of the base model). That's why I'd expect to see negligible overhead when you round up all models to My TODO would be:
|
@abcdabcd987 Further more, I got a question about And I found some discuss in this repo about this topic. Which pattern/path do you think is better for production deployment? Thanks again. |
BGMV is our first attempt on multi-LoRA serving, which assumes each input is for a different model. Although it was already a very good fit for this particular use case, we identified that BGMV has limitations and there are some other opportunities:
That's why we created SGMV kernel. BGMV semantics is a strict subset of SGMV. And @yzh119 found a few cool tricks such that SGMV runs faster than BGMV even in the same semantics. Even if your use case is 100% BGMV semantics, we'd still recommend use SGMV. Just treat it as a faster and better-API version of SGMV. I'll do some benchmarks to clear the confusion. I'm very glad that our idea and early implementation got recognized by the research community. Our early version (BGMV) was open sourced on 9/11. While we keep developing actively, we didn't push new commits to the OSS repo since 9/16. After we submitted the paper and cleaned up the code a little bit, we pushed latest versions (SGMV) on 11/2 (#1). |
Hi, grateful for this nice work.
Confused that it seems only rank of 16, 32, 64 is supported right now? ref
The text was updated successfully, but these errors were encountered: