-
Notifications
You must be signed in to change notification settings - Fork 870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prompt lookup decoding #202
Conversation
Haha awesome I was actually planning this after finishing my speculative decoding but didn't get to it, glad someone else did! :D |
thanks for laying the groundwork! I had a review comment on your PR that was ignored but I think its quite important, any thoughts on that? |
What was your comment, I couldn't find it? |
You've since changed that part of the code but it was about this line in my
I'm fairly confident the accepted tokens should be |
Yea I think you are correct and it should have that behavior now. Let me know if you still think it's an issue. |
Nice, looks good! I like that you avoided main model sampling and instead just compare draft / main probs directly. Do you have a reference for this code, I haven't seen this before? note: |
Everything there except the inclusion of the
The intention is to normalize them since the raw output of the model is not normalized. |
Thanks!
I missed the |
Hey @LeonEricsson I like this PR a lot but TBH I'm not entirely sure what to do with it. The prompt look up decoding is a bit niche to dedicate a whole example to it. So one thing we could do is try and merge it into the speculative decoding example (and support different draft model strategies). The only challenge there is that I changed it to be a T5 example instead of a more traditonal causal LM. So I see a couple options for moving forward (both of which involve integrating into speculative decoding).
What do you think? |
Gotchu, it could probably slide into the speculative example; just want to make sure things remain modular as to not strain users attempting to understand and reimplement the examples I haven't looked through the speculative example thoroughly since the change to T5 but I'll give it a look and try to decide what's most appropriate between 1) and 2). Does T5 vs LLM really change that much in Would you like a new PR for 1) and/or 2) or keep it all here? |
Not really that much TBH
Whatever is easier for you.
I very much agree with that goal. These examples are meant to be instructive and hackable (hence simple). With that in mind, I would say worry a bit less about code duplication and a bit more about keeping the actual implementation simple and modular. For example if you need to make a different class in |
I've begun implementing a PromptLookupDecoder class inside the speculative |
I don't disagree. I would prefer the example to work with an LLM instead of T5. If you prefer to wait, I might have some time to look into it a bit more this weekend / next week. |
I have done most of the work now, just got to debug this one error then it should be ready (started a draft PR #237). That being said I don't mind updating PromptLookupDecoder if you have some time over to fix Speculative. Please double check and make sure you're satisfied with the division of speculative // prompt lookup in #237, I'd be happy to change things around if needed. |
Implementation moved to #237. |
This PR implements a example for the "Prompt Lookup Decoding" technique:
https://github.com/apoorvumang/prompt-lookup-decoding
using a local Mistral model.
There is an ongoing PR for speculative decoding (#149) and I imagine we'll want to sync our implementations before merging. I've mostly followed #149 but some things differ mainly because #149 uses a Llama model (with loading from HF) and I load a local Mistral model. The Mistral and Llama example implementations are inherently different which also leads to some minor differences.
I have confirmed speedups through a simple repetition test:
with prompt lookup (accepted draft tokens in blue)
without prompt lookup
There are a couple of key considerations when implementing the prompt lookup search:
or move on to look for a match with a smaller ngram? We risk not finding a match if we
move on but on the other hand we might draft just a single token.
scheme should be used?
This implementation:
TODO: