Importing Exllamav2 taking so much time #342

luisfrentzen-cc · 2024-02-20T05:20:43Z

luisfrentzen-cc
Feb 20, 2024

Why am I frequently getting stuck while importing exllamav2?

import exllamav2

If I try to interrupt it then this error shows up, seems like there's a time.sleep holding it up(?):

>>> import exllamav2

^CTraceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/luis_frentzen/.cache/pypoetry/virtualenvs/llm-api-LqunwCJV-py3.10/lib/python3.10/site-packages/exllamav2/__init__.py", line 3, in <module>
    from exllamav2.model import ExLlamaV2
  File "/home/luis_frentzen/.cache/pypoetry/virtualenvs/llm-api-LqunwCJV-py3.10/lib/python3.10/site-packages/exllamav2/model.py", line 17, in <module>
    from exllamav2.cache import ExLlamaV2CacheBase
  File "/home/luis_frentzen/.cache/pypoetry/virtualenvs/llm-api-LqunwCJV-py3.10/lib/python3.10/site-packages/exllamav2/cache.py", line 2, in <module>
    from exllamav2.ext import exllamav2_ext as ext_c
  File "/home/luis_frentzen/.cache/pypoetry/virtualenvs/llm-api-LqunwCJV-py3.10/lib/python3.10/site-packages/exllamav2/ext.py", line 131, in <module>
    exllamav2_ext = load \
  File "/home/luis_frentzen/.cache/pypoetry/virtualenvs/llm-api-LqunwCJV-py3.10/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1308, in load
    return _jit_compile(
  File "/home/luis_frentzen/.cache/pypoetry/virtualenvs/llm-api-LqunwCJV-py3.10/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1724, in _jit_compile
    baton.wait()
  File "/home/luis_frentzen/.cache/pypoetry/virtualenvs/llm-api-LqunwCJV-py3.10/lib/python3.10/site-packages/torch/utils/file_baton.py", line 42, in wait
    time.sleep(self.wait_seconds)
KeyboardInterrupt

The import works just fine after I cancelled the first import and then import it again, Is there anything I can do to prevent this?

Answered by turboderp

Feb 20, 2024

2+ minutes sounds excessive, but I guess it's possible for a slow CPU?

I don't know if copying the cache works, but if you can't use any of the prebuilt wheels you can also build your own wheel with something like:

pip wheel --no-deps -w dist .

This should create a .whl file in the dist directory, containing both the exllamav2 and exllamav2_ext packages. Then install it as part of the docker image build with pip install whatever.whl

View full answer

turboderp · 2024-02-20T11:34:40Z

turboderp
Feb 20, 2024
Maintainer

The first time you import the library it's going to build the C++ extension which can take a little while depending on your CPU. After building it once it should be cached in ~/.cache/torch_extensions and then subsequent imports should be very quick. If something's wrong with the build system (jinja) the caching might not work correctly, I guess?

One solution is to install ExLlamaV2 with the extension built. Either get one of the prebuilt wheels from the releases or run this from the repo directory:

pip uninstall exllamav2
pip install .

4 replies

luisfrentzen-cc Feb 20, 2024
Author

@turboderp is it normal that it takes >2 mins? If I am using a docker container, can I pre build the C++ extension on image build and not on import? The image is frequently destroyed and re-instantiated so I need to make sure that when the image is up and taking requests, the extension is ready. Can I just copy the build cache directly when I'm building the image?

turboderp Feb 20, 2024
Maintainer

2+ minutes sounds excessive, but I guess it's possible for a slow CPU?

I don't know if copying the cache works, but if you can't use any of the prebuilt wheels you can also build your own wheel with something like:

pip wheel --no-deps -w dist .

This should create a .whl file in the dist directory, containing both the exllamav2 and exllamav2_ext packages. Then install it as part of the docker image build with pip install whatever.whl

Answer selected by luisfrentzen-cc

luisfrentzen-cc Feb 21, 2024
Author

@turboderp ok so I tried copying exllamav2_ext folder in ~/.cache/torch_extensions on build and it doesn't seem to work. It fails on model loading.

Traceback (most recent call last):
  File "/app/runpod_handler.py", line 29, in <module>
    EXL_RUNNER.load()
  File "/app/runner/exl.py", line 69, in load
    self.model, self.tokenizer = model_init.init(args, allow_auto_split = True)
  File "/opt/.cache/virtualenvs/llm-api-9TtSrW0h-py3.10/lib/python3.10/site-packages/exllamav2/model_init.py", line 93, in init
    model.load(split)
  File "/opt/.cache/virtualenvs/llm-api-9TtSrW0h-py3.10/lib/python3.10/site-packages/exllamav2/model.py", line 244, in load
    for item in f: return item
  File "/opt/.cache/virtualenvs/llm-api-9TtSrW0h-py3.10/lib/python3.10/site-packages/exllamav2/model.py", line 263, in load_gen
    module.load()
  File "/opt/.cache/virtualenvs/llm-api-9TtSrW0h-py3.10/lib/python3.10/site-packages/exllamav2/attn.py", line 87, in load
    self.q_proj.load()
  File "/opt/.cache/virtualenvs/llm-api-9TtSrW0h-py3.10/lib/python3.10/site-packages/exllamav2/linear.py", line 51, in load
    self.q_handle = ext.make_q_matrix(w, self.temp_dq)
  File "/opt/.cache/virtualenvs/llm-api-9TtSrW0h-py3.10/lib/python3.10/site-packages/exllamav2/ext.py", line 188, in make_q_matrix
    return ext_c.make_q_matrix(w["q_weight"],
AttributeError: module 'exllamav2_ext' has no attribute 'make_q_matrix'

But yeah I think the machine is taking its time building the extension packages because if I wait long enough then it will run just fine, and it uses 100% cpu while it's on that wait. I am using runpod's serverless gpu so I don't know what cpu they use. I guess the only solution is to prebuilt that and bake it to my container? Will installing exllamav2 from wheels help with that? If not, is there any way to prebuilt that extensions? I want to make sure that they are always ready on import

luisfrentzen-cc Feb 22, 2024
Author

This should create a .whl file in the dist directory, containing both the exllamav2 and exllamav2_ext packages. Then install it as part of the docker image build with pip install whatever.whl

This works well, thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Importing Exllamav2 taking so much time #342

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Importing Exllamav2 taking so much time #342

luisfrentzen-cc Feb 20, 2024

Replies: 1 comment · 4 replies

turboderp Feb 20, 2024 Maintainer

luisfrentzen-cc Feb 20, 2024 Author

turboderp Feb 20, 2024 Maintainer

luisfrentzen-cc Feb 21, 2024 Author

luisfrentzen-cc Feb 22, 2024 Author

luisfrentzen-cc
Feb 20, 2024

Replies: 1 comment 4 replies

turboderp
Feb 20, 2024
Maintainer

luisfrentzen-cc Feb 20, 2024
Author

turboderp Feb 20, 2024
Maintainer

luisfrentzen-cc Feb 21, 2024
Author

luisfrentzen-cc Feb 22, 2024
Author