Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add gptqmodel support #2247

Merged
merged 27 commits into from
Jan 23, 2025
Merged

add gptqmodel support #2247

merged 27 commits into from
Jan 23, 2025

Conversation

jiqing-feng
Copy link
Contributor

Support gptqmodel, we plan to replace autogptq by gptqmodel in the future.

@BenjaminBossan
Copy link
Member

Thanks for this PR to add support for gptqmodel. Let's wait for the transformers PR to be merged before proceeding with this one.

Signed-off-by: jiqing-feng <[email protected]>
@jiqing-feng
Copy link
Contributor Author

jiqing-feng commented Dec 4, 2024

Thanks for this PR to add support for gptqmodel. Let's wait for the transformers PR to be merged before proceeding with this one.

Hi @BenjaminBossan , there is a weird pytest error I never met,

(idp) root@sprocean:/home/jiqingfe/peft# pytest
ERROR: usage: pytest [options] [file_or_dir] [file_or_dir] [...]
pytest: error: unrecognized arguments: --cov=src/peft --cov-report=term-missing
  inifile: /home/jiqingfe/peft/pyproject.toml
  rootdir: /home/jiqingfe/peft

(idp) root@sprocean:/home/jiqingfe/peft# pip list | grep peft
peft                        0.13.3.dev0            /home/jiqingfe/peft

The pytest works only when I removed this code
Do you know why it happens?

@jiqing-feng
Copy link
Contributor Author

jiqing-feng commented Dec 4, 2024

Testing changes contain:

  1. Remove gpu limitation for gptq tests.
  2. GPTQ lib: @require_gptq means we can run these tests with gptqmodel or auto-gptq

For gptq testing:

pytest tests/test_gpu_examples.py::PeftGPTQGPUTests
pytest tests/test_common_gpu.py::PeftCommonTests::test_lora_gptq_quantization_from_pretrained_safetensors

@BenjaminBossan
Copy link
Member

there is a weird pytest error I never met,

This comes from a pytest plugin we use to monitor code coverage. Running python -m pip install pytest-cov should fix that for you.

(btw. code coverage is great to check if the new code you added is covered by unit tests)

jiqing-feng and others added 2 commits December 4, 2024 16:21
Signed-off-by: jiqing-feng <[email protected]>
* add get_gptq_quant_linear

* cleanup

* rename to quant_linear

* rename to get_gptqmodel_quant_linear

* rename to QuantLinear

* fix get device_map

* import hf_select_quant_linear

* pass checkpoint_format

* fix lora

* if is_gptqmodel_available, pass beckend auto_trainable

* pass backend auto_trainable

* cleanup

* Update gptq.py

---------

Co-authored-by: Qubitium-ModelCloud <[email protected]>
@jiqing-feng jiqing-feng marked this pull request as ready for review December 7, 2024 03:56
@Qubitium
Copy link
Contributor

Qubitium commented Dec 10, 2024

@BenjaminBossan This PR is ready and tested under cpu, intel/xpu, and nvidia/cuda. However, please note this PR is 1 of 3 in a cumulative set of 3 prs where the primary PR is in Optimum PR huggingface/optimum#2064 (awaiting review/approval) while this and Transformer PR huggingface/transformers#35012 are dependent on getting the Optimum PR merged first.

@jiqing-feng jiqing-feng changed the title [WIP]add gptqmodel support add gptqmodel support Dec 10, 2024
@BenjaminBossan
Copy link
Member

Thanks for the update. Let's wait for the optimum and the transformers PRs to be merged first, since changes there could affect this PR. Feel free to ping me as soon as those PRs were merged.

Copy link

github-actions bot commented Jan 9, 2025

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

@BenjaminBossan
Copy link
Member

not stale

@SunMarc
Copy link
Member

SunMarc commented Jan 15, 2025

The transformers PR was merged ! Let's wait a few days before merging this one to make sure that there is no issue from transformers side

@jiqing-feng
Copy link
Contributor Author

Hi @BenjaminBossan , could you please trigger the CI for this PR? Thanks!

@Qubitium
Copy link
Contributor

Qubitium commented Jan 16, 2025

@SunMarc @jiqing-feng We have tested on our side today with latest main transformers + main optimum + this PR and tests passing on our local env.

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to see progress with gptqmodel. I gave this PR a review and there are a couple of changes needed. Please check my comments. On top of those changes:

  1. Please run make style to make the linter happy.
  2. Please update the PEFT quantization docs, make sure to mention that gptqmodel is preferred over autogptq and give instructions on how to install it. It would also be nice to explain why gptqmodel is preferred, for this a link to the transformers docs would be sufficient.

@@ -135,7 +137,7 @@ def _create_and_replace(

# If it is not an AdaLoraLayer, create a new module, else update it with new adapters
if not isinstance(target, AdaLoraLayer):
new_module = self._create_new_module(lora_config, adapter_name, target, **kwargs)
new_module = self._create_new_module(lora_config, adapter_name, target, self.model.hf_device_map, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't assume that the model has a .hf_device_map attribute. Theoretically, the model could by any PyTorch model, it doesn't have to be a transformers model -- and even transformers models don't necessarily have the attribute. Therefore, we have to check for this attribute here and if it doesn't exist, don't pass it (or pass None).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/peft/tuners/adalora/model.py Outdated Show resolved Hide resolved
@@ -232,7 +232,7 @@ def _create_and_replace(
lora_bias=lora_config.lora_bias,
)
else:
new_module = self._create_new_module(lora_config, adapter_name, target, **kwargs)
new_module = self._create_new_module(lora_config, adapter_name, target, device_map=self.model.hf_device_map, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about hf_device_map.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -79,8 +80,7 @@
from peft.tuners.vera import Linear4bit as VeraLinear4bit


@require_non_cpu
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can just remove this decorator, since this class contains a bunch of tests that are unrelated to gptqmodel. From my understanding, gptqmodel can be run on CPU. In this case, I would suggest to move the gptqmodel tests to a different test file and leave this class untouched otherwise. They can be combined there with the PeftGPTQTests tests from test_gpu_examples.py.

The new test file (e.g. tests/test_gptqmodel.py) can then be run during the normal GitHub CI, which will just use CPU. However, if the tests are very slow on CPU, we need to either accelerate them or not run them on normal CI after all.

src/peft/tuners/lora/gptq.py Show resolved Hide resolved
@Qubitium
Copy link
Contributor

Regarding the tests: I assume for them to pass, we would need to install optimum and transformers from main, is that right?

Yes. This is correct. Both optimum and transformers needs to be built from main.

Regarding the quantization docs in PEFT, I agree that they could use some clean up and updates and that it should mostly refer to other sources. For historical context, when QLoRA was added to PEFT, this whole topic was still very new and it made sense to add the information here, as there were few other sources out there.

Still, I think it would make sense, and require only a small effort, to add a section about gptqmodel. By providing a small snippet + install instructions (incl. min optimum + transformers versions), you can boost adoption of gptqmodel. Let's not get blocked by a possible rewrite of the quantization section.

@BenjaminBossan Thanks for the review. Requested changed have been made. I have added a short gptq section in docs with linking to gptqmodel and hf gptq docs to find more about gptq in general. There is a very short sample code for model gptq quant plus 1 line install instruction.

Makefile Outdated
@@ -35,6 +35,9 @@ tests_common_gpu:
python -m pytest tests/test_decoder_models.py $(if $(IS_GITHUB_CI),--report-log "common_decoder.log",)
python -m pytest tests/test_encoder_decoder_models.py $(if $(IS_GITHUB_CI),--report-log "common_encoder_decoder.log",)

test_gptqmodel_gpu:
Copy link
Member

@BenjaminBossan BenjaminBossan Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work, as this make argument is never called anywhere. What I meant is to just add the line to tests_common_gpu above, which is already called in the appropriate setting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Copy link
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, your proposal makes sense to me @Qubitium, especially if/when more quantization methods are added, this will get longer and longer. Best to either let the Optimum or Transformers docs handle it 👍

docs/source/developer_guides/quantization.md Outdated Show resolved Hide resolved
docs/source/developer_guides/quantization.md Outdated Show resolved Hide resolved
@Qubitium
Copy link
Contributor

@BenjaminBossan Makefile changes made and @stevhilu fixed some doc grammar plus corrected link to hf transformers quantization doc. Ready for re-review.

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates.

Unfortunately, the CI is failing. I think this is because LoraModel._create_new_module has not been updated yet, it should add device_map=None as a parameter. Also, let's always pass device_map as a keyword argument and not positionally.

On top of this, I think we got a bit into a messy situation with the gptqmodel support in other HF libs. As a consequence, right now, the PEFT nightly CI is failing. This is because we have the following conditions there:

  • auto-gptq is installed
  • gptqmodel is not installed
  • transformers is installed from source
  • optimum uses the latest release

This results in the following error:

tests/test_gpu_examples.py:1414: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../../anaconda3/envs/peft-test/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py:564: in from_pretrained
    return model_class.from_pretrained(
../../../anaconda3/envs/peft-test/lib/python3.11/site-packages/transformers/modeling_utils.py:4135: in from_pretrained
    hf_quantizer.preprocess_model(
../../../anaconda3/envs/peft-test/lib/python3.11/site-packages/transformers/quantizers/base.py:194: in preprocess_model
    return self._process_model_before_weight_loading(model, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <transformers.quantizers.quantizer_gptq.GptqHfQuantizer object at 0x7787215014d0>
model = OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 768, padding_id...entwise_affine=True)
        )
      )
    )
  )
  (lm_head): Linear(in_features=768, out_features=50272, bias=False)
)
kwargs = {'device_map': 'auto', 'keep_in_fp32_modules': []}

    def _process_model_before_weight_loading(self, model: "PreTrainedModel", **kwargs):
        if model.__class__.main_input_name != "input_ids":
            raise RuntimeError("We can only quantize pure text model.")
    
        if self.pre_quantized:
>           model = self.optimum_quantizer.convert_model(model, **kwargs)
E           TypeError: GPTQQuantizer.convert_model() got an unexpected keyword argument 'device_map'

../../../anaconda3/envs/peft-test/lib/python3.11/site-packages/transformers/quantizers/quantizer_gptq.py:104: TypeError

One step to close this is to ensure that gptqmodel is installed in the corresponding Dockerfile. But even so, this indicates that there are some missing checks in transformers. IIUC, the cause can be found here:

https://github.com/huggingface/transformers/blob/a142f161313199bcfa67afe1990d1f0f39a973bb/src/transformers/quantizers/quantizer_gptq.py#L76-L80

the optimum version is checked, which is good, but only if is_gptqmodel_availabe. Yet, even if it is unavailable, we later pass device_map to optimum_quantizer.convert_model, which fails for older optimum versions. This means we either have to always check the optimum version, even for auto-gptq, thus forcing the latest install (or rather, source install, until the next release), or _process_model_before_weight_loading should not pass the device_map if auto-gptq is being used.

@Qubitium
Copy link
Contributor

Qubitium commented Jan 20, 2025

@BenjaminBossan

  • auto-gptq is installed
  • gptqmodel is not installed
  • transformers is installed from source
  • optimum uses the latest release

Can you confirm if AutoGPTQ release or souce is installed?

Reason I ask because if AutoGPTQ is built from source, I can fix this compat in AutoGPTQ with a pr/main push there so it can take in device-map there even if not used so compat issue is resolved for both libs?

Another option is add optimum version check and is gptqmodel or autogptq checks in PEFT as you noted.

Writing this on phone and will do another pass over tomorrow to see best way out of this.

But to confirm, you want the test to pass with non-source build of optimum to ensure backwards compat with old optimum correct?

@BenjaminBossan
Copy link
Member

Thanks for the quick response @Qubitium

Can you confirm if AutoGPTQ release or souce is installed?

The release version.

Reason I ask because if AutoGPTQ is built from source, I can fix this compat in AutoGPTQ with a pr/main push there so it can take in device-map there even if not used so compat issue is resolved for both libs?

That would probably help with potential issues that users may encounter because of incompatible versions. We can switch to from source install for auto-gptq, a new release would be even better though :)

you want the test to pass with non-source build of optimum to ensure backwards compat with old optimum correct?

Honestly, I'm not quite sure how much a typical PEFT user relies on specific optimum versions or if it's fine to ask them to install the latest version or even to install from source. We can probably do the latter if they need GPTQ, but we have to ensure that the user sees an error message that instructs them to do that, otherwise they'll be confused. And IIUC, this updated check has to be added to transformers.

@Qubitium
Copy link
Contributor

That would probably help with potential issues that users may encounter because of incompatible versions. We can switch to from source install for auto-gptq, a new release would be even better though :)

The problem is the core autogptq maintainer is mia, no ci control, no ci, I was never given permission to cut a release, I don't even think pypi release even works anymore. Will test a dev/draft release to see if autogptq pypi release is even feasible and if I have control for release.

@BenjaminBossan
Copy link
Member

BenjaminBossan commented Jan 21, 2025

Btw. the failure of the GPTQ test caused by what I described above causes other subsequent tests to fail in our PEFT nightly CI (which runs GPU tests). The reason has to be some strange side-effect caused by the test failure, probably not directly related to GPTQ. Still, it would be great if the situation were resolved soon.

Update: We figures out the source of the bug causing those subsequent tests to fail, it's not directly related to gptq but the failing gptq test results in this bug surfacing.

@Qubitium
Copy link
Contributor

@BenjaminBossan Please check PR huggingface/transformers#35830  resolves current backward compat issue.

@BenjaminBossan
Copy link
Member

Thanks @Qubitium, LMK when this PR is ready for review.

@Qubitium
Copy link
Contributor

Qubitium commented Jan 23, 2025

Thanks @Qubitium, LMK when this PR is ready for review.

@BenjaminBossan hf compat Pr has been merged. CI can be restarted to test compat.

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work on adding support for gptqmodel. The PR LGTM, I just have a small comment, but it's not directly related to the PEFT integration.

I noticed that for now, gptqmodel is not being tested on CI, neither for the normal CI that runs on PRs (CPU), nor for the nightly CI (GPU). This is because the package is not installed, therefore it the tests are skipped (CPU) or auto-gptq is being used (GPU). However, I think we cannot just install gptqmodel, as the support also depends on the yet unrelated optimum changes (nightly CI uses transformers installed from main, so we're fine there). So my suggestion for right now would be to keep things as they are, and then, once optimum had a release, make the switch to gptqmodel on CI. WDYT?


gptq_config = GPTQConfig(bits=4, group_size=128, dataset="wikitext2", tokenizer=tokenizer)

quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", quantization_config=gptq_config)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When trying to run this locally with 2 CUDA devices, I encountered a CUDA error after 50% progress:

File ~/work/forks/transformers/src/transformers/models/opt/modeling_opt.py:559, in OPTDecoderLayer.forward(self, hidden_states, attention_mask, layer_head_mask, past_key_value, output_attentions, use_cache, position_ids)
    556 hidden_states = self.fc2(hidden_states)
    557 hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)
--> 559 hidden_states = (residual + hidden_states).view(hidden_states_shape)
    561 # 350m applies layer norm AFTER attention
    562 if not self.do_layer_norm_before:

RuntimeError: CUDA error: an illegal memory access was encountered

Is this a known problem? Using 1 CUDA device or setting CUDA_LAUNCH_BLOCKING=1 resolves the error.

I suspect that the error occurs at the "switch" from GPU 0 to GPU 1, since that's exactly after half the layers when using device_map="auto".

Copy link
Contributor

@Qubitium Qubitium Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will double check this to see if a) accelerate specific or b) OPT specific.

  • for GPTQModel, we do not test for multi-gpu quantization since it's net-negative in terms of quantization speed.
  • For optimum, the gpu splitting is performed by accelerate so maybe this is related to accelerate or OPT model

For next GPTQModel CI tests PR, I would recommend we move all model testings from OPT to Llama 1B. I believe OPT was chosen due to the tiny size but in our experience, but there are some strange issues with the OPT modeling code (that I can't recall) that causes strange issues here and there. We recently dropped all CI OPT tests in factor of Llama for this reason. Again, I can't seem to recall the exact reasons. =(

Basically no one uses OPT anymore and any modeling changes is heavily favoriing Llama so any fringe bugs are much less likely to occur on llama class models.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that opt is very outdated at this point, and we mainly use it since it's small, but at least for PEFT it hasn't caused any problems yet.

I ran the code above using meta-llama/Llama-3.2-1B and again got an error after 50%:

File ~/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/linear.py:125, in Linear.forward(self, input)
    124 def forward(self, input: Tensor) -> Tensor:
--> 125     return F.linear(input, self.weight, self.bias)

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`

Thus it's unlikely to be related to the model architecture. CUDA_LAUNCH_BLOCKING=1 again was enough to resolve the issue.

@Qubitium
Copy link
Contributor

So my suggestion for right now would be to keep things as they are, and then, once optimum had a release, make the switch to gptqmodel on CI. WDYT?

Yes my thoughts align with this too.

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks everyone for the good work. Then let's add gptqmodel to the CI later when all necessary releases are out (optimum 1.24.0, transformers 4.49.0). I created #2342 to track this.

@BenjaminBossan BenjaminBossan merged commit 6e30991 into huggingface:main Jan 23, 2025
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants