-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement DeepSeek V2 #2744
Implement DeepSeek V2 #2744
Conversation
// (n, topk_group) | ||
let group_idx = scores.topk_unsorted(self.cfg.topk_group)?.indices; | ||
// (n, n_group) | ||
let mut group_mask = group_scores.zeros_like()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cannot you just avoid this mut
by chaining calls or using a local scope, seems fairly easy to do. Please also review the other remaining muts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed this mut
. I removed all other mut
occurrences in the model execution path (unless completely impossible), too.
(q_pe, k_pe) = self.rotary_emb.forward(&q_pe, &k_pe, seqlen_offset)?; | ||
|
||
let q = Tensor::cat(&[q_nope, q_pe], D::Minus1)?; | ||
let mut k = Tensor::cat(&[k_nope, k_pe], D::Minus1)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @EricLBuehler I got the following error for running DeepSeek-V2-Lite-Chat model:
called `Result::unwrap()` on an `Err` value: APIError { data: "shape mismatch in cat for dim 1, shape for arg 1: [1, 16, 5, 128] shape for arg 2: [1, 1, 5, 64]" }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
I found that you have repeated the second dim of Tensor k_pe
to match Tensor k_nope
and padded Tensor v
with zeros in mistral.rs but not over here. Are there something specials for this case? I also tested the mistral.rs implementation, while, the lite model gives random outputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @guoqingbao!
I found that you have repeated the second dim of Tensor k_pe to match Tensor k_nope and padded Tensor v with zeros in mistral.rs but not over here. Are there something specials for this case?
That case is only for the v head dim to match q/k as PagedAttention requires that (be sure to unpad too). We don't have PagedAttention in Candle (yet?), so this is not included.
I also tested the mistral.rs implementation, while, the lite model gives random outputs.
Just checked it there with DS V2 Lite on Metal in mistral.rs, and it works. Was that failing?
@LaurentMazare I updated the model with some fixes and removed all the I tested the model, and it is working:
|
Thank you! I'll update the DeepSeek V3 PR #2745. I also have some MoE-specific optimizations which I'll be adding a PR for shortly! |
This PR implements the DeepSeek V2 architecture.