Python libraries & slow inference with large context windows for v.0.1.8 vs 0.1.7 #572
Replies: 2 comments 2 replies
-
I'm not seeing any significant difference here, and the Did you update flash-attn when you updated PyTorch? |
Beta Was this translation helpful? Give feedback.
-
flash-attn isn't included in requirements.txt because it isn't a requirement. It enables a lot of features, but making it a strict requirement would prevent a lot of people from using the library. Trying to pre-inference a set of system prompts doesn't sound practical, and you'd run out of system memory faster than you probably think (?). Or I'm misunderstanding what you're trying to do, maybe. The dynamic generator does have prompt caching which sounds similar to what you're suggesting. It will retain as many previous sequences as it can and reuse the keys/values from those as much as possible. |
Beta Was this translation helpful? Give feedback.
-
In my 70B llama3.1 journey, I noticed that with version 0.1.8 (Llama3.1 update), inference latency is increasing substantially as the context window grows. While the numbers are subjective, the degradation is evident.
In my troubleshooting I have discovered the issue appears to be related to the installed python components. Ultimately I ran a diff between the two virtual environments (the new 0.1.8 and my original from 0.1.0 slowly upgraded to 0.1.7).
Findings:
The first number is the version that was installed through the requirements.txt for 0.1.8, the second one was 0.1.0 updated to 0.1.7 over time.
Some stats that, I suspect, can be attributed to the inner workings of either torch 2.4 vs 2.3 or the updated CUDA drivers -- posting it here for visibility.
The hardware: 4x A10G GPUs (96GB VRAM), 48 CPUs, 196GB RAM, using auto-split during load
The model: Llama3 fine-tuned, max window size 256k, allocated length 100k, BW6 : (We are using the same exact model for these two tests)
Using the base setup, running 0.1.7 in venv initialized by 0.1.8 (the newer torch and cuda drivers) makes them behave the same way, degradation with large context windows. Given the findings, where using 0.1.7 environment with the older torch and cuda libraries, makes 0.1.8 behave normally, the versions below, are about the environments initialized by exllamav2, and not the code itself.
Context window 0
Context window 12,500
Additional notes:
Version 0.1.7 doesn't support Llama3.1, so it is impossible to test. However, in version 0.1.8, Llama3.1's performance is consistent with Llama3.0, exhibiting the same level of degradation as the context window grows.
Profiling at 12K filled context window shows the main offender appears to be a call to gemm_half_q_half:
0.1.8:
0.1.7:
Beta Was this translation helpful? Give feedback.
All reactions