Allow GPTQModel to auto select Marlin or faster kernels for inference only ops #2138

LRL-ModelCloud · 2024-12-25T04:46:20Z

No description provided.

Qubitium · 2024-12-25T05:55:17Z

This PR addresses the issue where GPTQModel is unable to auto-select the fastest quant linear (Marlin) due to missing information regarding if this model is loaded via pre-trained or from_quantized. Add pack=True requirement to hf_select_quant_linear (for pretrained quantization will solve this. Without the PR, user need to manually set backend="marlin" to use Marlin. This PR allows auto selection of Marlin, if compatible, in default auto mode.

…h-pack

Qubitium · 2025-01-07T03:40:11Z

@SunMarc @IlyasMoutawwakil Ready for review. Unlike AutoGPTQ, GPTQModel has pretty robust internal code to select the best/fastet quant linear available for model inference. However, for this to happen, we need to receive signal if pack: bool (packing) will be used. For loading a post-quantized model, packing is not needed so we can auto select Marlin on Nivida CUDA available platform or next-best kernels for the current env. Some fast inference kernels are inference only and cannot perform packing.

HuggingFaceDocBuilderDev · 2025-01-07T10:43:19Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

SunMarc

LGTM !

LRL-ModelCloud added 2 commits December 25, 2024 12:07

select quant_linear with pack

c762c14

up GPTQMODEL_MINIMUM_VERSION

2cf0637

Qubitium and others added 3 commits January 7, 2025 10:25

Update quantizer.py

e265069

update gptqmodel version

8c62473

Merge remote-tracking branch 'main/main' into select-quant-linear-wit…

9ba8920

…h-pack

LRL-ModelCloud marked this pull request as ready for review January 7, 2025 03:19

LRL-ModelCloud changed the title ~~[WIP] GPTQModel select quant linear with pack~~ GPTQModel select quant linear with pack Jan 7, 2025

LRL-ModelCloud changed the title ~~GPTQModel select quant linear with pack~~ Allow GPTQModel to auto select Marlin or faster kernels for inference only ops Jan 7, 2025

SunMarc approved these changes Jan 7, 2025

View reviewed changes

SunMarc requested a review from IlyasMoutawwakil January 7, 2025 17:15

IlyasMoutawwakil merged commit 53240c3 into huggingface:main Jan 8, 2025
39 of 48 checks passed

Qubitium deleted the select-quant-linear-with-pack branch January 8, 2025 13:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow GPTQModel to auto select Marlin or faster kernels for inference only ops #2138

Allow GPTQModel to auto select Marlin or faster kernels for inference only ops #2138

LRL-ModelCloud commented Dec 25, 2024

Qubitium commented Dec 25, 2024

Qubitium commented Jan 7, 2025

HuggingFaceDocBuilderDev commented Jan 7, 2025

SunMarc left a comment

Allow GPTQModel to auto select Marlin or faster kernels for inference only ops #2138

Allow GPTQModel to auto select Marlin or faster kernels for inference only ops #2138

Conversation

LRL-ModelCloud commented Dec 25, 2024

Qubitium commented Dec 25, 2024

Qubitium commented Jan 7, 2025

HuggingFaceDocBuilderDev commented Jan 7, 2025

SunMarc left a comment

Choose a reason for hiding this comment