-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question]In the output results of attention_with_fused_qkv funcs, some slice accuracies are abnormal #3093
Comments
Hi @ifndefendif, thank you so much for bringing this up! Just want to make sure I understand your description, do you mind elaborating a bit on this?
I'm wondering what the 0.78 factor here means. |
It seems like there might be a proportional range issue. The dimensions of qkv are [batch_size, token_lens, hidden_state]. When there are 100 tokens, the accuracy is correct for the first 78 tokens, but problematic for the last 22. Similarly, with 200 tokens, the results are approximately correct for the first 156 tokens. This pattern suggests that the abnormal calculation accuracy could be related to the length of token_lens. I would like to understand what potential factors within the function itself might be influencing these calculation results. |
Thanks @ifndefendif! Did you observe this on CUDA or on other platforms? It would be very helpful if you can share the commands you used to run |
And just to make sure I'm understanding correctly--this observation is for prefill (not decode) right? |
Thank you for your response. The hardware device is NVIDIA A10. |
Yes, the issue indeed occurs during the prefill stage. However, since there was a calculation error during the prefill stage, the correctness of the results in the decode stage could not be verified. |
❓ General Questions
Hello, I encountered an issue while deploying using mlc_llm in cpp.
The model is using Qwen2.5-0.5B.
kv_cache is created using "creat_tir_cged_kv_cache".
When performing a prefill, it was found that the calculation result of "paged_kv_cache.attention_with_fused_qkv" did not meet expectations.
The input of qkv here is normal, and the dimensions of the output are [b, s, hq * d], of which approximately [b, s * 0.78 : , hq * d]The results are abnormal (testing tokens of different lengths for prefill all follow this rule), but there is a significant accuracy error in the subsequent results.
What could be the reason?
Thanks~
The text was updated successfully, but these errors were encountered: