Question about chunked prefill #10

SimonSongg · 2024-11-11T01:34:24Z

Hi,
When I read through the LONG-CONTEXT BENCHMARKS section of your paper, I found there will be no latency improvement when using chunk size = context length. I can understand this result as this seems like just full attention, without KV cache pruned.

I found you mentioned in the appendix of the paper that block-sparse-attention was used for training. I am wondering if it could also be used in inference for pre-filling the full context, if memory is enough. As it calculate the full context with some middle attention calculation skipped.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about chunked prefill #10

Question about chunked prefill #10

SimonSongg commented Nov 11, 2024

Question about chunked prefill #10

Question about chunked prefill #10

Comments

SimonSongg commented Nov 11, 2024