-
-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement AWQ quantization support for LLaMA #1032
Conversation
Add awq improvements
Merge linear layers
More improvements awq
…d_awq_quant_support
Optimizing GEMM kernels for high throughput is an open problem when it comes to quantized models. My own tests indicated it is fastest below batch size 8 and it has equal performance at batch size 16 compared to FP16. |
Thanks for letting us know! Do you think it would be possible to use OpenAI Triton for writing more optimized AWQ kernel? |
Great work on the merge. I'll start releasing AWQ models this weekend. Will aim to do a couple of hundred by early next week. Primarily Llama 2 models, plus some of the old still-popular Llama 1s. |
I would absolutely love it if someone wrote a better kernel whether in Triton or CUDA. I do believe a Triton kernel is possible, there is just a few obstacles like how to dequantize and how to run optimized instructions like asm volatile. Triton kernels exist for GPTQ but it has also been shown that CUDA kernels are faster (ExLlama v2). |
Got it. Thanks for letting us know! Actually, there's no blocker for us to add AWQ support to other model types. Once you upload the AWQ models, we can add other model support accordingly. |
If you want something to test with, I have repositories for Falcon and MPT. I am sure The Bloke will send out so many more models soon though. https://huggingface.co/casperhansen/mpt-7b-8k-chat-awq |
Got it. So you mean the dequantization logic is not easy to implement in triton, right? Hmm... I think for now we will focus on code cleaning & supporting different quantization schemes without considering much about their performance, and later will get back to this performance issue. |
Yes, dequantization is easily the most tricky part to get right. The rest is normal FP16 operations. I agree on the strategy. When you get to GPTQ, you should definitely explore all the optimized work like Exllama V2. It’s the fastest repository for running GPTQ models at batch size 1 but has not been tested much for high throughput. |
@WoosukKwon Doesn't it support Turing arch? my GPU's compute capabitlity is 7.5. CUDA-12.1. build Error message:
If not, hope can add backward compatibility for kernel build. |
No, it supports SM80 and up. So Ampere and later is supported. |
@casper-hansen Ok..., do you mean AWQ quant doesn't support <8.0? And For <8.0, only choice is GPTQ for now? |
Yes, so Turing is not supported. T4 is not supported. The GEMM kernel makes use of tensor cores, but some instructions in the kernel make it incompatible with 7.5. |
@casper-hansen Thanks for you reply. |
@WoosukKwon Would you mind providing example usage of AWQ support for a llama2 fine-tuned variant like this one? |
@WoosukKwon After more research, I found dequantization int4 kernels for Triton below. Seems doable, just needs to fit the AWQ format. |
We tested awq with vllm as well and found it's hard to get throughput improvement with W4A16 method. Therefore, we impelement W8W8 method(produce int8 weight with smoothquant) which can increase throughput by 20% this pr: #1112 |
It is not only about the throughput. It is also the much worse quality (perplexity) of 4-bit weight quantized models than the 8-bit ones. I've tested 4-bit AWQ for my use cases and it is just not cutting it. The heavily quantized model makes frequent small mistakes. Running the same model at full 16 bits (or from a 8 bit GGUF using |
How did you guys deal with this: casper-hansen/AutoAWQ#234 |
I implemented Triton kernels for AWQ inference. They are much faster then the existing CUDA kernels, especially at larger batch sizes: They are also simpler (core kernel is ~ 50-100 lines of Triton). The weights format is a bit different (it's using the most recent AWQ weight format, albeit weights are transposed):
You can find the code here: https://github.com/vedantroy/gpu_kernels/tree/main Warning: I'm pretty sure my kernels are correct, but not 100%. I'll be emailing the authors to double-check. |
Co-authored-by: Robert Irvine <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Casper <[email protected]> Co-authored-by: julian-q <[email protected]>
This PR adds initial support for AWQ. To minimize the scope, it only covers LLaMA models for now. This PR is adopted from great previous PRs #762 and #926 written by @ri938 @casper-hansen @julian-q .
NOTE: We need to do refactoring after this PR. Especially, the weight loading logic and ParallLinear layers are messy at the moment.
Tested:
python examples/llm_engine_example.py --model casperhansen/vicuna-7b-v1.5-awq --quantization awq
python examples/llm_engine_example.py --model casperhansen/vicuna-7b-v1.5-awq --quantization awq -tp 2
python examples/llm_engine_example.py --model abhinavkulkarni/lmsys-vicuna-33b-v1.3-w4-g128-awq --quantization awq -tp 2