-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[INTEGRATION] Expose stable kernel/packing/repacking apis #726
Comments
@wenhuach21 It appears there are two issues.
Need linux os version, kernel, libc/glibc version
Thanks. @CSY-ModelCloud |
We have renamed gptqmodel_marlin_cuda_inference. Can you try to pull latest and delete build dir? Then pip install it. |
Got it. It would be beneficial for GPTQModel to provide a backward-compatible API for layer packing and repacking, accommodating both the original AutoGPTQ linear layer and your/AutoRound fixed zero-point layer in future implementations. This would allow seamless reliance on your CUDA kernels for Marlin, asymmetric quantization, and other operations in AutoRound side. |
We are adding Tracking PR: #713 Code is not ready. We are finalizing it still. The above pr holds links to hf/optimum pr that will be submitted upstream. |
[1-3] https://github.com/ModelCloud/GPTQModel/pull/727/files we will expose the 3 Correction: 4 |
Thanks for the info. However, this may not help in our side, we need layer-wise packing and repacking as autoround could support mixed bits or mixed group size . |
@wenhuach21 We are currently refactoring and make sure gptqmodel is correctly integrated into transformers/optimum/peft. Can you list the exact api you want? Feel free-form and imagine any/all api you want/desire to have so that autoround can work with our kernels. Api stability can be enforced by locking pkg depends to specific release as we cant promise internal apis to be always stable. Let me know a detailed, preferably with pseudo code to illustrate the usage so I can visualize actual usage scenarios. Be as detailed as possible. |
@wenhuach21 Our refractor is complete and preparing for transformers/optimum/peft upstream prs to be merged and integrated. Now is a good time to review exactly what you and the intel/auto-round team needs from us explicatively at code-level. Please provide us with detailed (pseudo code is okay) examples show what apis we need to expose. |
Sorry for the delayed response. At the moment, the following come to mind as we want to support mixed bits quantization later Symmetric Quantization layer.pack(xxx, backend="marlin") ##Packs the layer using the specified format. Actually WrapperLinear is ok if there is no big change int the future.
check_packing_feasibility(xxx, backend) ## check whether the layer and its quantization config could pack with the specified backend
check_best_packing_format(xxx, target_device="cuda") ##return the best performance format in your repository based on the specified bit-width and group size. Asymmetric Quantization |
sorry, I forgot the repacking API and pos_init API. |
@wenhuach21 Feel free to open a wip PR and make core changes as you see fit. I can monitor and we can also connect on teams to smooth out ideas. The only things I would require is below:
|
when pip install, marlin kernel could not find ValueError: Trying to use the marlin backend, but could not import the C++/CUDA dependencies with the following error: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by /home/wenhuach/anaconda3/envs/autoround/lib/python3.10/site-packages/gptqmodel_marlin_cuda_inference.cpython-310-x86_64-linux-gnu.so)
when install from source
The text was updated successfully, but these errors were encountered: