Python libraries & slow inference with large context windows for v.0.1.8 vs 0.1.7 #572

arthurmnev · 2024-07-25T23:42:36Z

arthurmnev
Jul 25, 2024

In my 70B llama3.1 journey, I noticed that with version 0.1.8 (Llama3.1 update), inference latency is increasing substantially as the context window grows. While the numbers are subjective, the degradation is evident.
In my troubleshooting I have discovered the issue appears to be related to the installed python components. Ultimately I ran a diff between the two virtual environments (the new 0.1.8 and my original from 0.1.0 slowly upgraded to 0.1.7).

Findings:
The first number is the version that was installed through the requirements.txt for 0.1.8, the second one was 0.1.0 updated to 0.1.7 over time.

certifi: 2024.7.4 vs 2024.6.2
filelock: 3.15.4 vs 3.14.0
fsspec: 2024.6.1 vs 2024.5.0
huggingface-hub: 0.24.2 vs 0.23.2
nvidia-cudnn-cu12: 9.1.0.70 vs 8.9.2.26
nvidia-nvjitlink-cu12: 12.5.82 vs 12.5.40
packaging: 24.1 vs 24.0
regex: 2024.7.24 vs 2024.5.15
setuptools: 71.1.0 vs 70.0.0
torch: 2.4.0 vs 2.3.0
typing_extensions: 4.12.2 vs 4.12.1
urllib3: 2.2.2 vs 2.2.1
wheel: 0.43.0 vs 0.43.0

Some stats that, I suspect, can be attributed to the inner workings of either torch 2.4 vs 2.3 or the updated CUDA drivers -- posting it here for visibility.

The hardware: 4x A10G GPUs (96GB VRAM), 48 CPUs, 196GB RAM, using auto-split during load

The model: Llama3 fine-tuned, max window size 256k, allocated length 100k, BW6 : (We are using the same exact model for these two tests)

Using the base setup, running 0.1.7 in venv initialized by 0.1.8 (the newer torch and cuda drivers) makes them behave the same way, degradation with large context windows. Given the findings, where using 0.1.7 environment with the older torch and cuda libraries, makes 0.1.8 behave normally, the versions below, are about the environments initialized by exllamav2, and not the code itself.

Context window 0

0.1.7 performance: Request 1: 8.98 t/s, Request 2+: 9.25 - 9.85 t/s. Over time, by the time we get to 80K tokens, the performance drops slightly to 6 - 7 t/s.
0.1.8 performance: Request 1: 8.76 t/s, Request 2+ shows progressive degradation, approximately 0.05 tokens/second per 100 output tokens.

Context window 12,500

0.1.7 performance: Request 1: 7.68 t/s, Request 2+: 8.09 - 8.25 t/s.
0.1.8 performance: Request 1: 3.76 t/s, Request 2+ shows progressive degradation, approximately 0.05 tokens/second per 100 output tokens.

Additional notes:

Version 0.1.7 doesn't support Llama3.1, so it is impossible to test. However, in version 0.1.8, Llama3.1's performance is consistent with Llama3.0, exhibiting the same level of degradation as the context window grows.

Profiling at 12K filled context window shows the main offender appears to be a call to gemm_half_q_half:

0.1.8:

1    0.001    0.001    0.325    0.325 /ai/venv-ai-llama-3.1/lib/python3.12/site-packages/exllamav2/model.py:943(forward_chunk)
1    0.000    0.000    0.300    0.300 /ai/venv-ai-llama-3.1/lib/python3.12/site-packages/exllamav2/linear.py:244(forward)
1    0.300    0.300    0.300    0.300 {built-in method exllamav2_ext.gemm_half_q_half}

0.1.7:

  1    0.100    0.100    0.100    0.100 {built-in method exllamav2_ext.gemm_half_q_half}

turboderp · 2024-07-26T02:06:13Z

turboderp
Jul 26, 2024
Maintainer

I'm not seeing any significant difference here, and the gemm_half_q_half function (along with associated matmul kernels) shouldn't care about the PyTorch version.

Did you update flash-attn when you updated PyTorch?

1 reply

arthurmnev Jul 26, 2024
Author

I figured it out. The part that was missing from the requirements.txt is flash-attn. It seems the minimal requirements are:

  pandas
  ninja
  wheel
  setuptools
  fastparquet
  torch>=2.2.0
  safetensors>=0.4.3
  sentencepiece>=0.1.97
  pygments
  websockets
  regex
  numpy~=1.26.4
  tokenizers
  rich
  flash-attn

On a different note, I experimented with reusing the same loaded model by swapping contexts and system prompts from a cache to reduce (re) computation time. Is there an effective way to do this, and do you think it's a practical approach? Specifically, I'm considering initializing the context, performing a forward pass, copying all values into RAM, applying another prompt/context, and repeating the process. From a hardware standpoint, a PCIe 4 interfaces should, in theory, handle this in about 0.15 seconds one way + SW processing. This approach would allow us to use the same model to refine the output of a user prompt to the nth degree. So, instead of using a separate draft model, we could reuse the same one but with a different instruction set.

turboderp · 2024-07-26T05:44:11Z

turboderp
Jul 26, 2024
Maintainer

flash-attn isn't included in requirements.txt because it isn't a requirement. It enables a lot of features, but making it a strict requirement would prevent a lot of people from using the library.

Trying to pre-inference a set of system prompts doesn't sound practical, and you'd run out of system memory faster than you probably think (?). Or I'm misunderstanding what you're trying to do, maybe. The dynamic generator does have prompt caching which sounds similar to what you're suggesting. It will retain as many previous sequences as it can and reuse the keys/values from those as much as possible.

1 reply

arthurmnev Jul 26, 2024
Author

I'll take a look at the dynamic generator. Thanks.

For cache swapping, my idea is to reuse the same model for tasks like agents and output refiners. Refiners would likely share prompts frequently, whereas agents, with their diverse prompts and workloads, probably would not. RAM is relatively cheap, and most systems have a balanced ratio of VRAM to RAM. For example, a system with 4 GPUs might have 192 GB of RAM, while a smaller system with a 16 GB video card might have 64 GB of RAM. The proportions are generally aligned.

The practical concern is whether the performance of swapping between VRAM and RAM will be adequate. This process will also impact streaming, which could result in a less smooth user experience.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python libraries & slow inference with large context windows for v.0.1.8 vs 0.1.7 #572

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Python libraries & slow inference with large context windows for v.0.1.8 vs 0.1.7 #572

arthurmnev Jul 25, 2024

Replies: 2 comments · 2 replies

turboderp Jul 26, 2024 Maintainer

arthurmnev Jul 26, 2024 Author

turboderp Jul 26, 2024 Maintainer

arthurmnev Jul 26, 2024 Author

arthurmnev
Jul 25, 2024

Replies: 2 comments 2 replies

turboderp
Jul 26, 2024
Maintainer

arthurmnev Jul 26, 2024
Author

turboderp
Jul 26, 2024
Maintainer

arthurmnev Jul 26, 2024
Author