Implement AWQ quantization support for LLaMA #1032

WoosukKwon · 2023-09-13T16:02:59Z

This PR adds initial support for AWQ. To minimize the scope, it only covers LLaMA models for now. This PR is adopted from great previous PRs #762 and #926 written by @ri938 @casper-hansen @julian-q .

NOTE: We need to do refactoring after this PR. Especially, the weight loading logic and ParallLinear layers are messy at the moment.

Tested:

python examples/llm_engine_example.py --model casperhansen/vicuna-7b-v1.5-awq --quantization awq
python examples/llm_engine_example.py --model casperhansen/vicuna-7b-v1.5-awq --quantization awq -tp 2
python examples/llm_engine_example.py --model abhinavkulkarni/lmsys-vicuna-33b-v1.3-w4-g128-awq --quantization awq -tp 2

Add awq improvements

Merge linear layers

More improvements awq

…d_awq_quant_support

casper-hansen · 2023-09-16T07:04:04Z

@zhuohan123

Throughput of FP16 LLaMA-7B:
Throughput: 6.28 requests/s, 3003.81 tokens/s
Throughput of AWQ LLaMA-7B (casperhansen/vicuna-7b-v1.5-awq):
Throughput: 4.23 requests/s, 2022.94 tokens/s
Probably because the AWQ kernel is not well optimized (e.g., we use a single kernel for different shapes and hardware), the throughput decreases rather than increases. This is aligned with what @julian-q reported.

Optimizing GEMM kernels for high throughput is an open problem when it comes to quantized models. My own tests indicated it is fastest below batch size 8 and it has equal performance at batch size 16 compared to FP16.

WoosukKwon · 2023-09-16T07:07:37Z

@casper-hansen

Optimizing GEMM kernels for high throughput is an open problem when it comes to quantized models. My own tests indicated it is fastest below batch size 8 and it has equal performance at batch size 16 compared to FP16.

Thanks for letting us know! Do you think it would be possible to use OpenAI Triton for writing more optimized AWQ kernel?

TheBloke · 2023-09-16T07:08:56Z

Great work on the merge.

I'll start releasing AWQ models this weekend. Will aim to do a couple of hundred by early next week. Primarily Llama 2 models, plus some of the old still-popular Llama 1s.

casper-hansen · 2023-09-16T07:13:12Z

@casper-hansen

Optimizing GEMM kernels for high throughput is an open problem when it comes to quantized models. My own tests indicated it is fastest below batch size 8 and it has equal performance at batch size 16 compared to FP16.

Thanks for letting us know! Do you think it would be possible to use OpenAI Triton for writing more optimized AWQ kernel?

I would absolutely love it if someone wrote a better kernel whether in Triton or CUDA.

I do believe a Triton kernel is possible, there is just a few obstacles like how to dequantize and how to run optimized instructions like asm volatile. Triton kernels exist for GPTQ but it has also been shown that CUDA kernels are faster (ExLlama v2).

WoosukKwon · 2023-09-16T07:18:26Z

@TheBloke

I'll start releasing AWQ models this weekend. Will aim to do a couple of hundred by early next week. Primarily Llama 2 models, plus some of the old still-popular Llama 1s.

Got it. Thanks for letting us know! Actually, there's no blocker for us to add AWQ support to other model types. Once you upload the AWQ models, we can add other model support accordingly.

casper-hansen · 2023-09-16T07:21:47Z

@TheBloke

I'll start releasing AWQ models this weekend. Will aim to do a couple of hundred by early next week. Primarily Llama 2 models, plus some of the old still-popular Llama 1s.

Got it. Thanks for letting us know! Actually, there's no blocker for us to add AWQ support to other model types. Once you upload the AWQ models, we can add other model support accordingly.

If you want something to test with, I have repositories for Falcon and MPT. I am sure The Bloke will send out so many more models soon though.

https://huggingface.co/casperhansen/mpt-7b-8k-chat-awq
https://huggingface.co/casperhansen/falcon-7b-awq

WoosukKwon · 2023-09-16T07:23:22Z

@casper-hansen

I do believe a Triton kernel is possible, there is just a few obstacles like how to dequantize and how to run optimized instructions like asm volatile. Triton kernels exist for GPTQ but it has also been shown that CUDA kernels are faster (ExLlama v2).

Got it. So you mean the dequantization logic is not easy to implement in triton, right? Hmm... I think for now we will focus on code cleaning & supporting different quantization schemes without considering much about their performance, and later will get back to this performance issue.

casper-hansen · 2023-09-16T07:28:55Z

@casper-hansen

I do believe a Triton kernel is possible, there is just a few obstacles like how to dequantize and how to run optimized instructions like asm volatile. Triton kernels exist for GPTQ but it has also been shown that CUDA kernels are faster (ExLlama v2).

Got it. So you mean the dequantization logic is not easy to implement in triton, right? Hmm... I think for now we will focus on code cleaning & supporting different quantization schemes without considering much about their performance, and later will get back to this performance issue.

Yes, dequantization is easily the most tricky part to get right. The rest is normal FP16 operations.

I agree on the strategy. When you get to GPTQ, you should definitely explore all the optimized work like Exllama V2. It’s the fastest repository for running GPTQ models at batch size 1 but has not been tested much for high throughput.

esmeetu · 2023-09-16T08:47:04Z

@WoosukKwon Doesn't it support Turing arch? my GPU's compute capabitlity is 7.5. CUDA-12.1.

build Error message:

ptxas /tmp/tmpxft_0006e7c4_00000000-6_gemm_kernels.ptx, line 928; error : Feature '.m16n8k16' requires .target sm_80 or higher

If not, hope can add backward compatibility for kernel build.

casper-hansen · 2023-09-16T09:02:10Z

@WoosukKwon Doesn't it support Turing arch? my GPU's compute capabitlity is 7.5. CUDA-12.1.

build Error message:

ptxas /tmp/tmpxft_0006e7c4_00000000-6_gemm_kernels.ptx, line 928; error : Feature '.m16n8k16' requires .target sm_80 or higher

If not, hope can add backward compatibility for kernel build.

No, it supports SM80 and up. So Ampere and later is supported.

esmeetu · 2023-09-16T09:07:46Z

@casper-hansen Ok..., do you mean AWQ quant doesn't support <8.0? And For <8.0, only choice is GPTQ for now?

casper-hansen · 2023-09-16T09:12:38Z

@casper-hansen Ok..., do you mean AWQ quant doesn't support <8.0? And For <8.0, only choice is GPTQ for now?

Yes, so Turing is not supported. T4 is not supported. The GEMM kernel makes use of tensor cores, but some instructions in the kernel make it incompatible with 7.5.

esmeetu · 2023-09-16T09:27:29Z

@casper-hansen Thanks for you reply.

ryanshrott · 2023-09-16T10:52:46Z

@WoosukKwon Would you mind providing example usage of AWQ support for a llama2 fine-tuned variant like this one?
https://huggingface.co/rshrott/description-awq-4bit/tree/main

casper-hansen · 2023-09-17T20:37:08Z

Got it. So you mean the dequantization logic is not easy to implement in triton, right? Hmm... I think for now we will focus on code cleaning & supporting different quantization schemes without considering much about their performance, and later will get back to this performance issue.

@WoosukKwon After more research, I found dequantization int4 kernels for Triton below. Seems doable, just needs to fit the AWQ format.
https://github.com/ModelTC/lightllm/blob/main/lightllm/common/basemodel/triton_kernel/dequantize_gemm_int4.py

AniZpZ · 2023-09-20T13:25:05Z

We tested awq with vllm as well and found it's hard to get throughput improvement with W4A16 method. Therefore, we impelement W8W8 method(produce int8 weight with smoothquant) which can increase throughput by 20% this pr: #1112

viktor-ferenczi · 2023-09-21T06:57:59Z

It is not only about the throughput. It is also the much worse quality (perplexity) of 4-bit weight quantized models than the 8-bit ones. I've tested 4-bit AWQ for my use cases and it is just not cutting it. The heavily quantized model makes frequent small mistakes. Running the same model at full 16 bits (or from a 8 bit GGUF using ctransformers) don't make such mistakes at all. I'm looking forward for the W8A8 quantization mostly because of this. Model: WizardCoder 13B and 34B

vince62s · 2023-12-02T13:22:21Z

How did you guys deal with this: casper-hansen/AutoAWQ#234
GEMM and GEMV have their param reversed (out, in) "standard way for GEMV but (in, out) for GEMM which trigger a different way to slice when using tensor parallel.

vedantroy · 2023-12-09T05:29:56Z

@WoosukKwon

I implemented Triton kernels for AWQ inference. They are much faster then the existing CUDA kernels, especially at larger batch sizes:

They are also simpler (core kernel is ~ 50-100 lines of Triton).

The weights format is a bit different (it's using the most recent AWQ weight format, albeit weights are transposed):

    a: (M, K)
    qw: (K // pack_num, N)
    scales: (K // group_size, N)
    qzeros: (K // group_size // pack_num, N)

You can find the code here: https://github.com/vedantroy/gpu_kernels/tree/main
I'd be happy to help integrate these kernels into VLLM!

Warning: I'm pretty sure my kernels are correct, but not 100%. I'll be emailing the authors to double-check.

Co-authored-by: Robert Irvine <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Casper <[email protected]> Co-authored-by: julian-q <[email protected]>

robirv938 and others added 30 commits August 14, 2023 14:52

add quantisation config

fed3d61

pass down quantisation setings

0f936d0

american englihs

520394a

llama add the code for quantization

640cedf

update

0437ffa

merge in the AWQ code with note saying its source

861d3d7

update

d659e95

update

c0e4862

update

fed311e

fix loading of layers

0937043

update

7109bd3

update

5bd5ed6

quantization config is part of the model config

c3cc5ed

function

02bdfed

update

2f97151

Merge pull request #2 from ri938/add_awq_improvements

c39ec2a

Add awq improvements

working prototype

e5434ef

merge linear layers

ff4d693

update

033e8c1

Merge pull request #3 from ri938/merge_linear_layers

a3ac858

Merge linear layers

Add quant layer in Row and Column parallel.

974bf06

fix pylint errors

fbaf889

improve the quant weight loaded code

db4db0c

Merge pull request #5 from ri938/more_improvements_awq

73db30f

More improvements awq

Merge remote-tracking branch 'upstream/add_awq_quant_support' into ad…

ee7116a

…d_awq_quant_support

Loading works, Refactored Quant into Row/Column Parallel

8ff92c7

WIP. Try to load MPT.

f5e8d15

add quantization utils

409e290

consolidate quantization operations

a6193cc

tweak quant config

37839a9

WoosukKwon deleted the add_awq_quant_support branch September 16, 2023 07:06

WoosukKwon mentioned this pull request Sep 16, 2023

Add AWQ quant support #762

Closed

WoosukKwon mentioned this pull request Sep 16, 2023

Refactor AWQ support #926

Closed

WoosukKwon mentioned this pull request Sep 16, 2023

AWQ does not support Turing GPUs #1063

Closed

AniZpZ mentioned this pull request Sep 21, 2023

Support int8 KVCacheQuant and W8A8 inference in vllm #1112

Closed

6 tasks

v1nc3nt27 mentioned this pull request Sep 21, 2023

[Setup] Enable TORCH_CUDA_ARCH_LIST for selecting target GPUs #1074

Merged

casper-hansen mentioned this pull request Sep 25, 2023

INT8 support - SmoothQuant casper-hansen/AutoAWQ#71

Closed

5 tasks

pvtoan mentioned this pull request Nov 2, 2023

docs: add instructions for Langchain #1162

Merged

simon-mo mentioned this pull request Dec 1, 2023

API causes slowdown in batch request handling #1707

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement AWQ quantization support for LLaMA #1032

Implement AWQ quantization support for LLaMA #1032

WoosukKwon commented Sep 13, 2023 •

edited

Loading

casper-hansen commented Sep 16, 2023

WoosukKwon commented Sep 16, 2023

TheBloke commented Sep 16, 2023

casper-hansen commented Sep 16, 2023

WoosukKwon commented Sep 16, 2023 •

edited

Loading

casper-hansen commented Sep 16, 2023

WoosukKwon commented Sep 16, 2023

casper-hansen commented Sep 16, 2023

esmeetu commented Sep 16, 2023 •

edited

Loading

casper-hansen commented Sep 16, 2023

esmeetu commented Sep 16, 2023

casper-hansen commented Sep 16, 2023

esmeetu commented Sep 16, 2023

ryanshrott commented Sep 16, 2023

casper-hansen commented Sep 17, 2023 •

edited

Loading

AniZpZ commented Sep 20, 2023 •

edited

Loading

viktor-ferenczi commented Sep 21, 2023

vince62s commented Dec 2, 2023

vedantroy commented Dec 9, 2023 •

edited

Loading

Implement AWQ quantization support for LLaMA #1032

Implement AWQ quantization support for LLaMA #1032

Conversation

WoosukKwon commented Sep 13, 2023 • edited Loading

casper-hansen commented Sep 16, 2023

WoosukKwon commented Sep 16, 2023

TheBloke commented Sep 16, 2023

casper-hansen commented Sep 16, 2023

WoosukKwon commented Sep 16, 2023 • edited Loading

casper-hansen commented Sep 16, 2023

WoosukKwon commented Sep 16, 2023

casper-hansen commented Sep 16, 2023

esmeetu commented Sep 16, 2023 • edited Loading

casper-hansen commented Sep 16, 2023

esmeetu commented Sep 16, 2023

casper-hansen commented Sep 16, 2023

esmeetu commented Sep 16, 2023

ryanshrott commented Sep 16, 2023

casper-hansen commented Sep 17, 2023 • edited Loading

AniZpZ commented Sep 20, 2023 • edited Loading

viktor-ferenczi commented Sep 21, 2023

vince62s commented Dec 2, 2023

vedantroy commented Dec 9, 2023 • edited Loading

WoosukKwon commented Sep 13, 2023 •

edited

Loading

WoosukKwon commented Sep 16, 2023 •

edited

Loading

esmeetu commented Sep 16, 2023 •

edited

Loading

casper-hansen commented Sep 17, 2023 •

edited

Loading

AniZpZ commented Sep 20, 2023 •

edited

Loading

vedantroy commented Dec 9, 2023 •

edited

Loading