Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]In the output results of attention_with_fused_qkv funcs, some slice accuracies are abnormal #3093

Open
ifndefendif opened this issue Jan 17, 2025 · 6 comments
Labels
question Question about the usage

Comments

@ifndefendif
Copy link

ifndefendif commented Jan 17, 2025

❓ General Questions

Hello, I encountered an issue while deploying using mlc_llm in cpp.
The model is using Qwen2.5-0.5B.
kv_cache is created using "creat_tir_cged_kv_cache".
When performing a prefill, it was found that the calculation result of "paged_kv_cache.attention_with_fused_qkv" did not meet expectations.
The input of qkv here is normal, and the dimensions of the output are [b, s, hq * d], of which approximately [b, s * 0.78 : , hq * d]The results are abnormal (testing tokens of different lengths for prefill all follow this rule), but there is a significant accuracy error in the subsequent results.
What could be the reason?
Thanks~
Image

@ifndefendif ifndefendif added the question Question about the usage label Jan 17, 2025
@MasterJH5574
Copy link
Member

Hi @ifndefendif, thank you so much for bringing this up! Just want to make sure I understand your description, do you mind elaborating a bit on this?

the dimensions of the output are [b, s, hq * d], of which approximately [b, s * 0.78 : , hq * d]The results are abnormal

I'm wondering what the 0.78 factor here means.

@ifndefendif
Copy link
Author

Hi @ifndefendif, thank you so much for bringing this up! Just want to make sure I understand your description, do you mind elaborating a bit on this?

the dimensions of the output are [b, s, hq * d], of which approximately [b, s * 0.78 : , hq * d]The results are abnormal

I'm wondering what the 0.78 factor here means.

It seems like there might be a proportional range issue. The dimensions of qkv are [batch_size, token_lens, hidden_state]. When there are 100 tokens, the accuracy is correct for the first 78 tokens, but problematic for the last 22. Similarly, with 200 tokens, the results are approximately correct for the first 156 tokens. This pattern suggests that the abnormal calculation accuracy could be related to the length of token_lens. I would like to understand what potential factors within the function itself might be influencing these calculation results.
Thanks~

@MasterJH5574
Copy link
Member

Thanks @ifndefendif! Did you observe this on CUDA or on other platforms? It would be very helpful if you can share the commands you used to run Qwen2.5-0.5B. It can be a TIR attention kernel bug in this case and we need to look into it.

@MasterJH5574
Copy link
Member

And just to make sure I'm understanding correctly--this observation is for prefill (not decode) right?

@ifndefendif
Copy link
Author

Thanks @ifndefendif! Did you observe this on CUDA or on other platforms? It would be very helpful if you can share the commands you used to run Qwen2.5-0.5B. It can be a TIR attention kernel bug in this case and we need to look into it.

Thank you for your response. The hardware device is NVIDIA A10.

@ifndefendif
Copy link
Author

And just to make sure I'm understanding correctly--this observation is for prefill (not decode) right?

Yes, the issue indeed occurs during the prefill stage. However, since there was a calculation error during the prefill stage, the correctness of the results in the decode stage could not be verified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Question about the usage
Projects
None yet
Development

No branches or pull requests

2 participants