-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AutoGPTQ or AutoGPTQ-bugfix? #57
Comments
Actually, I also tried to quantize some models with AutoGPTQ from https://github.com/AutoGPTQ/AutoGPTQ, and it seemed like the quality was worse. |
Wherever the PPL quality is better with AutoGPTQ-bugfix or not, the following is worth noting. If saving checkpoint with AutoGPTQ-bugfix, then the model will not work properly with vLLM, because their GPTQ kernels seem to make use of this "zeros +- 1" trick: https://github.com/vllm-project/vllm/blob/main/csrc/quantization/gptq/q_gemm.cu#L172-L175. |
AutoGPTQ-bugfix is ok. Sorry for previous confusion, the official AutoGPTQ repo have merged the "zeros +- 1" solution before. However, the solution was reverted due to some incompatibility, please refer AutoGPTQ/AutoGPTQ#354 for more details. |
@ChenMnZ Ok! Thank you. Well, the picture became clearer but still not quite 😅 What is better: fixed version or not? Why there was even a need for such fix? As far as I understood, the thing is the following. AutoGPTQ assumes that quantization is symmetric. However, it may be not so (OmniQuant uses asymmetric quantization by default). What is more, there is no way to tell AutoGPTQ's QuantLinear that quantization is not symmetric. All public GPTQ-packed real quantized models that I met are symmetrically quantized (for example, Llama-2-13B by TheBloke). Personally, I tested OmniQuant on Phi2 model. Symmetric quantization resulted in good quality GPTQ model, whereas asymmetric one led to broken GPTQ real-quant model. So, seems that this "AutoGPTQ or AutoGPTQ-bugfix" question may be more of a "symmetric or asymmetric quantization" question. At least, seems that the real-quant (GPTQ) backend may be not always compatible with the way a model is actually quantized. P.S. Sorry for my late response 😅 |
Could you please tell me a way to make the model produced by running |
@lqzzy Hello! I am afraid I can't help you with this 😅 Personally, I used vLLM only with GPTQ reduces model size, vLLM boosts models (it accelerates inference of even FP precision models). So maybe there is no need to quantize activations if you use vLLM?) However, if you really want to use OmniQuant |
Some time ago, in README there was a link to the "fixed version" of AutoGPTQ: AutoGPTQ-bugfix. However, current README gives link to the original repo: AutoGPTQ.
So, does this mean that everything is OK with AutoGPTQ real quantization now and we do not need the fixed repo?
I am asking such question, because, for example, the fix for qlinear triton was the following (link1, link2):
However, in AutoGPTQ there is still such
zeros
modification (link). So, it seems that original AutoGPTQ still might have some problems?..The text was updated successfully, but these errors were encountered: