-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add gptqmodel support #2247
add gptqmodel support #2247
Conversation
Thanks for this PR to add support for gptqmodel. Let's wait for the transformers PR to be merged before proceeding with this one. |
Signed-off-by: jiqing-feng <[email protected]>
Hi @BenjaminBossan , there is a weird pytest error I never met,
The pytest works only when I removed this code |
Testing changes contain:
For gptq testing:
|
This comes from a pytest plugin we use to monitor code coverage. Running (btw. code coverage is great to check if the new code you added is covered by unit tests) |
Signed-off-by: jiqing-feng <[email protected]>
* add get_gptq_quant_linear * cleanup * rename to quant_linear * rename to get_gptqmodel_quant_linear * rename to QuantLinear * fix get device_map * import hf_select_quant_linear * pass checkpoint_format * fix lora * if is_gptqmodel_available, pass beckend auto_trainable * pass backend auto_trainable * cleanup * Update gptq.py --------- Co-authored-by: Qubitium-ModelCloud <[email protected]>
@BenjaminBossan This PR is ready and tested under cpu, intel/xpu, and nvidia/cuda. However, please note this PR is 1 of 3 in a cumulative set of 3 prs where the primary PR is in Optimum PR huggingface/optimum#2064 (awaiting review/approval) while this and Transformer PR huggingface/transformers#35012 are dependent on getting the Optimum PR merged first. |
Thanks for the update. Let's wait for the optimum and the transformers PRs to be merged first, since changes there could affect this PR. Feel free to ping me as soon as those PRs were merged. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
not stale |
The transformers PR was merged ! Let's wait a few days before merging this one to make sure that there is no issue from transformers side |
Hi @BenjaminBossan , could you please trigger the CI for this PR? Thanks! |
@SunMarc @jiqing-feng We have tested on our side today with latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to see progress with gptqmodel. I gave this PR a review and there are a couple of changes needed. Please check my comments. On top of those changes:
- Please run
make style
to make the linter happy. - Please update the PEFT quantization docs, make sure to mention that gptqmodel is preferred over autogptq and give instructions on how to install it. It would also be nice to explain why gptqmodel is preferred, for this a link to the transformers docs would be sufficient.
src/peft/tuners/adalora/model.py
Outdated
@@ -135,7 +137,7 @@ def _create_and_replace( | |||
|
|||
# If it is not an AdaLoraLayer, create a new module, else update it with new adapters | |||
if not isinstance(target, AdaLoraLayer): | |||
new_module = self._create_new_module(lora_config, adapter_name, target, **kwargs) | |||
new_module = self._create_new_module(lora_config, adapter_name, target, self.model.hf_device_map, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't assume that the model has a .hf_device_map
attribute. Theoretically, the model could by any PyTorch model, it doesn't have to be a transformers model -- and even transformers models don't necessarily have the attribute. Therefore, we have to check for this attribute here and if it doesn't exist, don't pass it (or pass None
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BenjaminBossan Done.
src/peft/tuners/lora/model.py
Outdated
@@ -232,7 +232,7 @@ def _create_and_replace( | |||
lora_bias=lora_config.lora_bias, | |||
) | |||
else: | |||
new_module = self._create_new_module(lora_config, adapter_name, target, **kwargs) | |||
new_module = self._create_new_module(lora_config, adapter_name, target, device_map=self.model.hf_device_map, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about hf_device_map
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BenjaminBossan Done.
tests/test_common_gpu.py
Outdated
@@ -79,8 +80,7 @@ | |||
from peft.tuners.vera import Linear4bit as VeraLinear4bit | |||
|
|||
|
|||
@require_non_cpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can just remove this decorator, since this class contains a bunch of tests that are unrelated to gptqmodel. From my understanding, gptqmodel can be run on CPU. In this case, I would suggest to move the gptqmodel tests to a different test file and leave this class untouched otherwise. They can be combined there with the PeftGPTQTests
tests from test_gpu_examples.py
.
The new test file (e.g. tests/test_gptqmodel.py
) can then be run during the normal GitHub CI, which will just use CPU. However, if the tests are very slow on CPU, we need to either accelerate them or not run them on normal CI after all.
Yes. This is correct. Both
@BenjaminBossan Thanks for the review. Requested changed have been made. I have added a short gptq section in docs with linking to gptqmodel and hf gptq docs to find more about gptq in general. There is a very short sample code for model gptq quant plus 1 line install instruction. |
Makefile
Outdated
@@ -35,6 +35,9 @@ tests_common_gpu: | |||
python -m pytest tests/test_decoder_models.py $(if $(IS_GITHUB_CI),--report-log "common_decoder.log",) | |||
python -m pytest tests/test_encoder_decoder_models.py $(if $(IS_GITHUB_CI),--report-log "common_encoder_decoder.log",) | |||
|
|||
test_gptqmodel_gpu: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work, as this make argument is never called anywhere. What I meant is to just add the line to tests_common_gpu
above, which is already called in the appropriate setting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-authored-by: Steven Liu <[email protected]>
Co-authored-by: Steven Liu <[email protected]>
@BenjaminBossan Makefile changes made and @stevhilu fixed some doc grammar plus corrected link to hf transformers quantization doc. Ready for re-review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates.
Unfortunately, the CI is failing. I think this is because LoraModel._create_new_module
has not been updated yet, it should add device_map=None
as a parameter. Also, let's always pass device_map
as a keyword argument and not positionally.
On top of this, I think we got a bit into a messy situation with the gptqmodel support in other HF libs. As a consequence, right now, the PEFT nightly CI is failing. This is because we have the following conditions there:
- auto-gptq is installed
- gptqmodel is not installed
- transformers is installed from source
- optimum uses the latest release
This results in the following error:
tests/test_gpu_examples.py:1414:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../../anaconda3/envs/peft-test/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py:564: in from_pretrained
return model_class.from_pretrained(
../../../anaconda3/envs/peft-test/lib/python3.11/site-packages/transformers/modeling_utils.py:4135: in from_pretrained
hf_quantizer.preprocess_model(
../../../anaconda3/envs/peft-test/lib/python3.11/site-packages/transformers/quantizers/base.py:194: in preprocess_model
return self._process_model_before_weight_loading(model, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <transformers.quantizers.quantizer_gptq.GptqHfQuantizer object at 0x7787215014d0>
model = OPTForCausalLM(
(model): OPTModel(
(decoder): OPTDecoder(
(embed_tokens): Embedding(50272, 768, padding_id...entwise_affine=True)
)
)
)
)
(lm_head): Linear(in_features=768, out_features=50272, bias=False)
)
kwargs = {'device_map': 'auto', 'keep_in_fp32_modules': []}
def _process_model_before_weight_loading(self, model: "PreTrainedModel", **kwargs):
if model.__class__.main_input_name != "input_ids":
raise RuntimeError("We can only quantize pure text model.")
if self.pre_quantized:
> model = self.optimum_quantizer.convert_model(model, **kwargs)
E TypeError: GPTQQuantizer.convert_model() got an unexpected keyword argument 'device_map'
../../../anaconda3/envs/peft-test/lib/python3.11/site-packages/transformers/quantizers/quantizer_gptq.py:104: TypeError
One step to close this is to ensure that gptqmodel is installed in the corresponding Dockerfile. But even so, this indicates that there are some missing checks in transformers. IIUC, the cause can be found here:
the optimum version is checked, which is good, but only if is_gptqmodel_availabe
. Yet, even if it is unavailable, we later pass device_map
to optimum_quantizer.convert_model
, which fails for older optimum versions. This means we either have to always check the optimum version, even for auto-gptq, thus forcing the latest install (or rather, source install, until the next release), or _process_model_before_weight_loading
should not pass the device_map
if auto-gptq is being used.
Can you confirm if AutoGPTQ release or souce is installed? Reason I ask because if AutoGPTQ is built from source, I can fix this compat in AutoGPTQ with a pr/main push there so it can take in device-map there even if not used so compat issue is resolved for both libs? Another option is add optimum version check and is gptqmodel or autogptq checks in PEFT as you noted. Writing this on phone and will do another pass over tomorrow to see best way out of this. But to confirm, you want the test to pass with non-source build of optimum to ensure backwards compat with old optimum correct? |
Thanks for the quick response @Qubitium
The release version.
That would probably help with potential issues that users may encounter because of incompatible versions. We can switch to from source install for auto-gptq, a new release would be even better though :)
Honestly, I'm not quite sure how much a typical PEFT user relies on specific optimum versions or if it's fine to ask them to install the latest version or even to install from source. We can probably do the latter if they need GPTQ, but we have to ensure that the user sees an error message that instructs them to do that, otherwise they'll be confused. And IIUC, this updated check has to be added to transformers. |
The problem is the core autogptq maintainer is mia, no ci control, no ci, I was never given permission to cut a release, I don't even think pypi release even works anymore. Will test a dev/draft release to see if autogptq pypi release is even feasible and if I have control for release. |
Btw. the failure of the GPTQ test caused by what I described above causes other subsequent tests to fail in our PEFT nightly CI (which runs GPU tests). The reason has to be some strange side-effect caused by the test failure, probably not directly related to GPTQ. Still, it would be great if the situation were resolved soon. Update: We figures out the source of the bug causing those subsequent tests to fail, it's not directly related to gptq but the failing gptq test results in this bug surfacing. |
@BenjaminBossan Please check PR huggingface/transformers#35830 resolves current backward compat issue. |
Thanks @Qubitium, LMK when this PR is ready for review. |
@BenjaminBossan hf compat Pr has been merged. CI can be restarted to test compat. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work on adding support for gptqmodel. The PR LGTM, I just have a small comment, but it's not directly related to the PEFT integration.
I noticed that for now, gptqmodel is not being tested on CI, neither for the normal CI that runs on PRs (CPU), nor for the nightly CI (GPU). This is because the package is not installed, therefore it the tests are skipped (CPU) or auto-gptq is being used (GPU). However, I think we cannot just install gptqmodel, as the support also depends on the yet unrelated optimum changes (nightly CI uses transformers installed from main, so we're fine there). So my suggestion for right now would be to keep things as they are, and then, once optimum had a release, make the switch to gptqmodel on CI. WDYT?
|
||
gptq_config = GPTQConfig(bits=4, group_size=128, dataset="wikitext2", tokenizer=tokenizer) | ||
|
||
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", quantization_config=gptq_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When trying to run this locally with 2 CUDA devices, I encountered a CUDA error after 50% progress:
File ~/work/forks/transformers/src/transformers/models/opt/modeling_opt.py:559, in OPTDecoderLayer.forward(self, hidden_states, attention_mask, layer_head_mask, past_key_value, output_attentions, use_cache, position_ids)
556 hidden_states = self.fc2(hidden_states)
557 hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)
--> 559 hidden_states = (residual + hidden_states).view(hidden_states_shape)
561 # 350m applies layer norm AFTER attention
562 if not self.do_layer_norm_before:
RuntimeError: CUDA error: an illegal memory access was encountered
Is this a known problem? Using 1 CUDA device or setting CUDA_LAUNCH_BLOCKING=1
resolves the error.
I suspect that the error occurs at the "switch" from GPU 0 to GPU 1, since that's exactly after half the layers when using device_map="auto"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will double check this to see if a) accelerate specific or b) OPT specific.
- for GPTQModel, we do not test for multi-gpu quantization since it's net-negative in terms of quantization speed.
- For optimum, the gpu splitting is performed by
accelerate
so maybe this is related to accelerate or OPT model
For next GPTQModel CI tests PR, I would recommend we move all model testings from OPT to Llama 1B. I believe OPT was chosen due to the tiny size but in our experience, but there are some strange issues with the OPT modeling code (that I can't recall) that causes strange issues here and there. We recently dropped all CI OPT tests in factor of Llama for this reason. Again, I can't seem to recall the exact reasons. =(
Basically no one uses OPT anymore and any modeling changes is heavily favoriing Llama so any fringe bugs are much less likely to occur on llama
class models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that opt is very outdated at this point, and we mainly use it since it's small, but at least for PEFT it hasn't caused any problems yet.
I ran the code above using meta-llama/Llama-3.2-1B
and again got an error after 50%:
File ~/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/linear.py:125, in Linear.forward(self, input)
124 def forward(self, input: Tensor) -> Tensor:
--> 125 return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
Thus it's unlikely to be related to the model architecture. CUDA_LAUNCH_BLOCKING=1
again was enough to resolve the issue.
Yes my thoughts align with this too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks everyone for the good work. Then let's add gptqmodel to the CI later when all necessary releases are out (optimum 1.24.0, transformers 4.49.0). I created #2342 to track this.
Support gptqmodel, we plan to replace autogptq by gptqmodel in the future.