Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use 4-bit AWQ? #1776

Open
BBC-Esq opened this issue Sep 10, 2024 · 9 comments
Open

How to use 4-bit AWQ? #1776

BBC-Esq opened this issue Sep 10, 2024 · 9 comments

Comments

@BBC-Esq
Copy link

BBC-Esq commented Sep 10, 2024

In reviewing the updated docs I notice a few things that prompted some questions...

  1. Neither AWQ/Int-4/int32_float16 are mentioned in the "Quantize on model conversion" nor "Quantize on model loading" sections here:

https://opennmt.net/CTranslate2/quantization.html

  1. Near the bottom it gives some helpful information, but it states "We have to quantize the model with AWQ first..." However, the code snippet gives is, apparently, to convert to CT2 format a model "already" in AWQ format (by TheBloke):

ct2-transformers-converter --model TheBloke/Llama-2-7B-AWQ --copy_files tokenizer.model --output_dir ct2_model

This was confusing to me because by using "we" it implies that ctranslate2 itself can quantize a model to AWQ format. Is this the case or not?

  1. Is it still true that even if a model is in AWQ format, it will still only be runnable if it originated from one of the model architectures that ctranslate2 supports? This is probably a kind of stupid question but wanted to doublecheck...

  2. Can we please get at least one example of how to actually run a model using 4-bit AWQ. I was not able to find a simple example, especially one using a transformers-based model.

Thanks yet again!

@BBC-Esq
Copy link
Author

BBC-Esq commented Sep 10, 2024

Follow-up questions...

Here's a script I used to quantize using AWQ. Note the usage of q_group_size and w_bit." The API may have changed because HERE they use group_size and bits instead.

MY CONVERSION SCRIPT
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = r"D:\Scripts\benchmark_chat\models\Mistral-7B-Instruct-v0.3"
quant_path = r"D:\Scripts\benchmark_chat\models\Mistral-7B-Instruct-v0.3-AWQ"

# model_path = r"D:\Scripts\benchmark_chat\models\Mistral-7B-Instruct-v0.3"
# quant_path = r"D:\Scripts\benchmark_chat\models\Mistral-7B-Instruct-v0.3-AWQ"

# model_path = r"D:\Scripts\benchmark_chat\models\Mistral-7B-Instruct-v0.3"
# quant_path = r"D:\Scripts\benchmark_chat\models\Mistral-7B-Instruct-v0.3-AWQ"

quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "gemm"}


# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path, trust_remote_code=True, low_cpu_mem_usage=True, use_cache=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

'''
# group size can be 64 or 32 as well

The possible values for version in the Auto-AWQ library are:

"gemm": GEMM version of the quantization method.
"gemv": GEMV version of the quantization method.
"marlin": Marlin version of the quantization method.
"gemv_fast": Fast GEMV version of the quantization method.
'''

Regardless, however, my questions are:

  1. First, if ctranslate2 doesn't convert to AWQ, do any of these parameters (or any other for that matter) matter as far as ctranslate2being able to convert and run model?

  2. Second, if ctranslate2 DOES convert to AWQ, how can I specify these parameters (and any others) during the conversion step?

Again, examples and a more thorough explanation of how to use AWQ in the docs would be much appreciated. Thanks!

@BBC-Esq
Copy link
Author

BBC-Esq commented Sep 10, 2024

I might have answered my own question, but can you confirm?

My understanding is that you can quantize (using a dataset for validation nonetheless) per these instructions:

https://casper-hansen.github.io/AutoAWQ/examples/

...and then convert with ctranslate2? And then how would it be run...specifying int32_float16? What about int32_bfloat16?

Thanks.

@minhthuc2502
Copy link
Collaborator

Thanks for pointing that out! I’ll update the documentation to clarify things soon.

To clarify, int32_float16 is just the internal compute type used with AWQ models. You don't need to specify it when generating tokens—it will default to int32_float16 automatically.

Following theses steps:

Step 1: You can either use an AWQ quantized model from Hugging Face (as shown in the example) or quantize one yourself using this guide. Then, convert the AWQ quantized model to a CT2 model as described in the documentation.

Step 2: Run inference as you would with other models in CT2, just by specifying the model path.

@BBC-Esq
Copy link
Author

BBC-Esq commented Sep 13, 2024

Thanks for pointing that out! I’ll update the documentation to clarify things soon.

To clarify, int32_float16 is just the internal compute type used with AWQ models. You don't need to specify it when generating tokens—it will default to int32_float16 automatically.

Following theses steps:

Step 1: You can either use an AWQ quantized model from Hugging Face (as shown in the example) or quantize one yourself using this guide. Then, convert the AWQ quantized model to a CT2 model as described in the documentation.

Step 2: Run inference as you would with other models in CT2, just by specifying the model path.

Why don't you say it just like you did to me now, but in the documentation. ;-) And then give an example or two as well.

I'm learning to convert to AWQ...When converting it's possible to use a calibration dataset as well as specify a "version" regarding the type of conversion. The AutoAWQ docs mention "Marlin" but, if I understand correctly, ct2 4.4 only supports gemm and gemv? Will it simply not run correctly if I quantize using the marlin kernel?

To further complicate matters, when running the model (with autoawq) I can specify a "version" such as "gemm," "gemv" or "exllama." The documentation says that you can only use "exllama" with a model that has been converted using "gemm". I'm confused by all of this.

How does it relate to running on ctranslate2? Here's the relevant portion of my conversion script I'm referring to:

SCRIPT HERE
model_path = r"D:\Scripts\bench_chat\models\Qwen2-7B-Instruct"
# quant_path = r"D:\Scripts\bench_chat\models\Qwen2-7B-Instruct-AWQ"
quant_path = r"D:\Scripts\bench_chat\models\Qwen2-7B-Instruct-AWQ-marlin"

# model_path = r"D:\Scripts\bench_chat\models\Yi-1.5-9B-Chat"
# quant_path = r"D:\Scripts\bench_chat\models\Yi-1.5-9b-Chat-AWQ"

# quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "gemm" }
# quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "gemv" }
# quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "gemv_fast" }
quant_config = { "zero_point": False, "q_group_size": 128, "w_bit": 4, "version": "marlin" }

# Load model
model = AutoAWQForCausalLM.from_pretrained(
    model_path, **{"low_cpu_mem_usage": True, "use_cache": False}
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

def load_cosmopedia():
    data = load_dataset('HuggingFaceTB/cosmopedia-100k', split="train")
    data = data.filter(lambda x: x["text_token_length"] >= 2048)

    return [text for text in data["text"]]

model.quantize(
    tokenizer,
    quant_config=quant_config,
    calib_data=load_cosmopedia(),
    n_parallel_calib_samples=16,
    max_calib_samples=256,
    max_calib_seq_len=8192
)

model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

print(f'Model is quantized and saved at "{quant_path}"')

@minhthuc2502
Copy link
Collaborator

Our version only supports gemm and gemv, so you'll need to choose between those two. For now, I believe these options are sufficient.

When running with CT2, you don’t need to specify gemm or gemv—it will be automatically detected based on the weight format.

@BBC-Esq
Copy link
Author

BBC-Esq commented Sep 14, 2024

And can I run an AWQ model that's of an architecture that Ctranslate2 doens't normally support. For example, some of the Zephyr models can be converted to Ctranslate2 but they can be quantized using AWQ. Is it now possible to use them on Ctranslate2?

@minhthuc2502
Copy link
Collaborator

No, It works only with models supported by Ctranslate2.

@BBC-Esq
Copy link
Author

BBC-Esq commented Sep 16, 2024

Anxiously awaiting the updated documentation to test further. Thanks.

@BBC-Esq
Copy link
Author

BBC-Esq commented Sep 19, 2024

Any update on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants