Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow GPTQModel to auto select Marlin or faster kernels for inference only ops #2138

Conversation

LRL-ModelCloud
Copy link
Contributor

No description provided.

@Qubitium
Copy link
Contributor

This PR addresses the issue where GPTQModel is unable to auto-select the fastest quant linear (Marlin) due to missing information regarding if this model is loaded via pre-trained or from_quantized. Add pack=True requirement to hf_select_quant_linear (for pretrained quantization will solve this. Without the PR, user need to manually set backend="marlin" to use Marlin. This PR allows auto selection of Marlin, if compatible, in default auto mode.

@LRL-ModelCloud LRL-ModelCloud marked this pull request as ready for review January 7, 2025 03:19
@LRL-ModelCloud LRL-ModelCloud changed the title [WIP] GPTQModel select quant linear with pack GPTQModel select quant linear with pack Jan 7, 2025
@Qubitium
Copy link
Contributor

Qubitium commented Jan 7, 2025

@SunMarc @IlyasMoutawwakil Ready for review. Unlike AutoGPTQ, GPTQModel has pretty robust internal code to select the best/fastet quant linear available for model inference. However, for this to happen, we need to receive signal if pack: bool (packing) will be used. For loading a post-quantized model, packing is not needed so we can auto select Marlin on Nivida CUDA available platform or next-best kernels for the current env. Some fast inference kernels are inference only and cannot perform packing.

@LRL-ModelCloud LRL-ModelCloud changed the title GPTQModel select quant linear with pack Allow GPTQModel to auto select Marlin or faster kernels for inference only ops Jan 7, 2025
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

@IlyasMoutawwakil IlyasMoutawwakil merged commit 53240c3 into huggingface:main Jan 8, 2025
39 of 48 checks passed
@Qubitium Qubitium deleted the select-quant-linear-with-pack branch January 8, 2025 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants