Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The smallest rank supported is 16? #17

Closed
jcao-ai opened this issue Nov 21, 2023 · 5 comments
Closed

The smallest rank supported is 16? #17

jcao-ai opened this issue Nov 21, 2023 · 5 comments

Comments

@jcao-ai
Copy link

jcao-ai commented Nov 21, 2023

Hi, grateful for this nice work.

Confused that it seems only rank of 16, 32, 64 is supported right now? ref

@abcdabcd987
Copy link
Contributor

Correct. Would you mind clarify your use case? I'm interested to learn.

@jcao-ai
Copy link
Author

jcao-ai commented Nov 21, 2023

Correct. Would you mind clarify your use case? I'm interested to learn.

Thanks for you quick response. Actually we are interested in serving lora adapters with variant number of ranks, ranging from 8 to 128 at the same time. BTW, your code implementation is quite neat and good.

@abcdabcd987
Copy link
Contributor

Thanks for the kind words. Kudos to @yzh119

Currently, the rank needs to be either 16 or multiples of 32. And all models need to have the same rank.

If I understand correctly, you have two usages that current Punica don't support.

  1. Run with ranks other than 16, 32, 64.
  2. Run different ranks in one batch.

Here's how you can mitigate it today if you don't want to be blocked by me: (essentially the same solution to both problems)

In the case of other ranks, my current suggestions is to round up the rank to the closest bin (e.g., 12 -> 16, 17 -> 32, 45 -> 64). In the case of running different ranks in one batch, my suggestion is to round up to the biggest rank. I haven't tested r=128, but I'd say if you round up all models to r=64, you'll see negligible overhead. This is my educated guess. I'll run a benchmark soon.

Here's why:

Figure 9 in our paper shows the latency difference across different ranks. The gist is that the latency difference is not that big, especially when a model is used more than once in the batch.

More importantly, Figure 10 shows that SGMV latency difference will be submerged by other parts in the transformer layer (self attention, dense projections of the base model). That's why I'd expect to see negligible overhead when you round up all models to r=64.

My TODO would be:

  1. Do some benchmarks to assure users that this mitigation is not slow.
  2. Improve API so that users don't need to implement this mitigation.

@jcao-ai
Copy link
Author

jcao-ai commented Nov 21, 2023

@abcdabcd987
Thanks a lot for your explanation. Padding adapters into the same shape is indeed a workaround. I will take a try soon.

Further more, I got a question about BGMV and SGMV. Since there goes another project named S-LoRA which is based on this project and they seem to keep on the BGMV pattern and they declare quite good performance on throughput.

And I found some discuss in this repo about this topic.

Which pattern/path do you think is better for production deployment? Thanks again.

@abcdabcd987
Copy link
Contributor

abcdabcd987 commented Nov 21, 2023

BGMV is our first attempt on multi-LoRA serving, which assumes each input is for a different model. Although it was already a very good fit for this particular use case, we identified that BGMV has limitations and there are some other opportunities:

  1. Using BGMV for prefill is inefficient.
  2. The BGMV speed up primarily comes from utilizing more compute units on batched small computes. This free lunch has a limit.
  3. I believe there are use cases in the decode stage where multiple inputs map to the same LoRA models. For example, AAABCC is a batch of size 6, but with only 3 LoRA models. When such sharing exists, the speed up comes from one more dimension, i.e., increasing arithmetic intensity (Figure 7 in our paper). As such, the free lunch can go to even higher batch size (Figure 8 and 9).

That's why we created SGMV kernel. BGMV semantics is a strict subset of SGMV. And @yzh119 found a few cool tricks such that SGMV runs faster than BGMV even in the same semantics. Even if your use case is 100% BGMV semantics, we'd still recommend use SGMV. Just treat it as a faster and better-API version of SGMV. I'll do some benchmarks to clear the confusion.

I'm very glad that our idea and early implementation got recognized by the research community. Our early version (BGMV) was open sourced on 9/11. While we keep developing actively, we didn't push new commits to the OSS repo since 9/16. After we submitted the paper and cleaned up the code a little bit, we pushed latest versions (SGMV) on 11/2 (#1).

@jcao-ai jcao-ai closed this as completed Nov 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants