ExllamaV2 tensor parallelism to increase multi gpu inference speeds code help #6356

RandomInternetPreson · 2024-08-29T22:27:00Z

Checklist:

[ x] I have read the Contributing guidelines.

One needs to add "--enable_tp" to the CMD_FLAGS.txt to enable tensor parallelism with exllamav2.

I'm offering the code as a potential useful reference point for your own code. I don't know if this will help you or save you any time, but I'm offering it up if it is useful in any way.

https://github.com/RandomInternetPreson/TextGenTips?tab=readme-ov-file#exllamav2-tensor-parallelism-for-oob-v114

This how I'm current using it with a 33%+ increase in inference output with my gpu setup. The speedup increases if I don't use auto-split and fit the model onto a fewer number of cards.

Merge dev branch

Code to get exllamaV2 tensor parallelization working.

Ph0rk0z · 2024-08-30T13:08:36Z

The TP has been good. May as well add Q6 cache too.

Inktomi93 · 2024-08-30T18:39:46Z

@RandomInternetPreson I went through and mostly adapted the exllamav2_hf module to work with your tweaks too. Only thing not working is CFG, I'm not smart enough to get that working. I honestly just did this so I could use XTC while also using the new TP Option.

        # Check if TP is enabled and load model with TP
        if shared.args.enable_tp:
            split = None
            if shared.args.gpu_split:
                split = [float(alloc) for alloc in shared.args.gpu_split.split(",")]
            self.ex_model.load_tp(split)  # Ensure TP loading is used
        else:
            if not shared.args.autosplit:
                split = None
                if shared.args.gpu_split:
                    split = [float(alloc) for alloc in shared.args.gpu_split.split(",")]
                self.ex_model.load(split)

        # Determine the correct cache type
        if shared.args.cache_8bit:
            self.ex_cachetype = ExLlamaV2Cache_8bit
        elif shared.args.cache_4bit:
            self.ex_cachetype = ExLlamaV2Cache_Q4
        else:
            self.ex_cachetype = ExLlamaV2Cache

        # Use TP if specified
        if shared.args.enable_tp:
            self.ex_cache = ExLlamaV2Cache_TP(self.ex_model, base=self.ex_cachetype)
        else:
            self.ex_cache = self.ex_cachetype(self.ex_model, lazy=shared.args.autosplit)           
             
        # Apply autosplit if specified and TP not enabled
        if shared.args.autosplit and not shared.args.enable_tp:
            self.ex_model.load_autosplit(self.ex_cache)

        self.past_seq = None

        # Determine the correct cache type for negative cache, also considering TP
        if shared.args.cfg_cache:
            base_cache_type = None
            if shared.args.cache_8bit:
                base_cache_type = ExLlamaV2Cache_8bit
            elif shared.args.cache_4bit:
                base_cache_type = ExLlamaV2Cache_Q4
            else:
                base_cache_type = ExLlamaV2Cache

            # Apply TP if specified for negative cache
            if shared.args.enable_tp:
                self.ex_cache_negative = ExLlamaV2Cache_TP(self.ex_model, base=base_cache_type)
            else:
                self.ex_cache_negative = base_cache_type(self.ex_model)

            self.past_seq_negative = None

RandomInternetPreson · 2024-08-30T20:37:21Z

@Inktomi93

Give this code a try: https://github.com/RandomInternetPreson/TextGenTips/blob/main/ExllamaV2_TensorParallel_Files/exllamav2_hf.py

I had mistral large 2 make it locally, the loader seemed to work, but I don't use cfg (I checked the box and the model loaded but didn't to any testing beyond that). Test it out and let me know if it works for you.

https://github.com/RandomInternetPreson/TextGenTips/blob/main/ExllamaV2_TensorParallel_Files/20240830-16-23-25.json

RandomInternetPreson · 2024-08-30T20:43:43Z

@Ph0rk0z is there a reason 6bit cache isn't implemented in textgen? I vaguely recall something being funky about the 6bit cache or something. I started out quantizing using exllama then switched to llama.cpp and am now back with exllama, so I missed some of the latest developments.

Ph0rk0z · 2024-08-31T11:34:31Z

Mainly just that nobody implemented it. It's yet another checkbox. There's also Q8 cache but we're still using the truncating one.

I've not tried to get CFG working, I think it probably needs the batching generator and CFG is waaay too much vram on the models I run for too little effect. Mistral large cranks with TP and I've been using it in HF since TP came out.

I just noticed: #6280

RandomInternetPreson · 2024-08-31T12:48:19Z

Are you me? I've been loving Mistral large and being able to use it with tp was the reason I started doing any of this. The hf loader code I linked to is work for me, it loads the model with and without the cfg checkbox checked.

Thanks for the link, I didn't realize there were so many good prs just sitting around waiting to be implemented. I'm interested in incorporating a lot of them into my local install. Lots of good things to look forward to in future release.

oobabooga · 2024-09-28T03:23:39Z

Thanks @RandomInternetPreson, that's super helpful. In my test this makes prompt processing slower but generation after that faster:

Before:

Prompt processing: 555.39
Text generation: 14.00

After:

Prompt processing: 182.54
Text generation: 26.80

(numbers in tokens/second)

RandomInternetPreson · 2024-09-28T20:26:22Z

:3 glad to help where I can!

When the dev posted about the TP implementation on local llama, they mentioned that text digestion took longer. I haven't noticed it as much as the inference speed boost. Over 5-7 gpus the inference speed increase is substantial.

One thing I've noticed is that the vram is used more efficiently for context. I was able to go from ~100k to the full 130k context window with the same amount of vram.

oobabooga · 2024-09-29T02:41:03Z

It should only be noticeable if you are feeding a very long context, like an entire codebase with 100k tokens or something.

Ph0rk0z · 2024-09-30T12:45:55Z

I still get prompt processing in the 400 range on most gens.

Kaszebe · 2024-10-04T06:17:09Z

Do we still need to pass --enable_tp in the command flags before we start ooba?

Ph0rk0z · 2024-10-04T13:25:42Z

You can check the check box. That's how it's supposed to work, like using 4bit cache or anything else.

oobabooga and others added 7 commits July 25, 2024 12:12

Merge pull request oobabooga#6271 from oobabooga/dev

dd97a83

Merge dev branch

UI: fix saving characters

498fec2

Merge pull request oobabooga#6300 from oobabooga/dev

d011040

Merge dev branch

Merge pull request oobabooga#6336 from oobabooga/dev

073694b

Merge dev branch

Merge pull request oobabooga#6337 from oobabooga/dev

1b62cd8

Merge dev branch

Merge pull request oobabooga#6339 from oobabooga/dev

5522584

Merge dev branch

Add files via upload

377018e

Code to get exllamaV2 tensor parallelization working.

RandomInternetPreson and others added 4 commits August 31, 2024 11:17

Updated hf version too

3e44373

Merge branch 'dev' into RandomInternetPreson-main

dc06495

Lint

1a4c054

Simplify

725a463

oobabooga merged commit 46996f6 into oobabooga:dev Sep 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ExllamaV2 tensor parallelism to increase multi gpu inference speeds code help #6356

ExllamaV2 tensor parallelism to increase multi gpu inference speeds code help #6356

RandomInternetPreson commented Aug 29, 2024

Ph0rk0z commented Aug 30, 2024

Inktomi93 commented Aug 30, 2024

RandomInternetPreson commented Aug 30, 2024 •

edited

Loading

RandomInternetPreson commented Aug 30, 2024

Ph0rk0z commented Aug 31, 2024 •

edited

Loading

RandomInternetPreson commented Aug 31, 2024

oobabooga commented Sep 28, 2024

RandomInternetPreson commented Sep 28, 2024

oobabooga commented Sep 29, 2024

Ph0rk0z commented Sep 30, 2024

Kaszebe commented Oct 4, 2024

Ph0rk0z commented Oct 4, 2024

ExllamaV2 tensor parallelism to increase multi gpu inference speeds code help #6356

ExllamaV2 tensor parallelism to increase multi gpu inference speeds code help #6356

Conversation

RandomInternetPreson commented Aug 29, 2024

Checklist:

Ph0rk0z commented Aug 30, 2024

Inktomi93 commented Aug 30, 2024

RandomInternetPreson commented Aug 30, 2024 • edited Loading

RandomInternetPreson commented Aug 30, 2024

Ph0rk0z commented Aug 31, 2024 • edited Loading

RandomInternetPreson commented Aug 31, 2024

oobabooga commented Sep 28, 2024

RandomInternetPreson commented Sep 28, 2024

oobabooga commented Sep 29, 2024

Ph0rk0z commented Sep 30, 2024

Kaszebe commented Oct 4, 2024

Ph0rk0z commented Oct 4, 2024

RandomInternetPreson commented Aug 30, 2024 •

edited

Loading

Ph0rk0z commented Aug 31, 2024 •

edited

Loading