-
-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support int8 KVCacheQuant and W8A8 inference in vllm #1112
Conversation
This is interesting work! I was going to implement int8 in AutoAWQ with time as the authors of SmoothQuant (this PR) and AWQ are the same. My best guestimate is that |
I fully support this, since the 4-bit AWQ model proved to have inferior quality for my use cases. Having 8 bit weights with 8 bit activation cache would be the best of both worlds, allowing for almost no loss of quality (perplexity) while being able to run inference more efficiently. I would also keep an W8A16 mode as an option, should the precision of the activations and the KV cache would make a difference in specific use cases. |
Hi vLLM genius @WoosukKwon @zhuohan123 This is the latest development from our team regarding quantitative support for vllm, we have done something similar to #1032 before. At that time, we didn't mention pr after the benchmark results showed a drop in throughput, but later we found out that #1032 was merged, which is very encouraging. Therefore, we continue to do performance optimization on this basis, and send out the pr in WIP state in advance, hoping to get some comments and suggestions, and finally merge into the vllm codebase smoothly. Cheers! |
c581e38
to
27e3b4b
Compare
We implement quantization with smoothquant method for W8A8 I will release the code later. The perplexity is identical to a standard smoothquant method if you do W8A8 inference without int8 KVCacheQuant. Quantization details are discussed in this paper(Xiao et. al) |
SmoothQuant only supports OPT models. How can we test this PR when the SmoothQuant repository does not support LLaMa models? If you implement this PR without the quantization code, you will inevitably end up with a bad perplexity if you naively use W8A8 as you have no calibration dataset. See this image, accuracy ends up being worse than INT4 if you naively convert weights to W8A8. You need the SmoothQuant or AWQ method to convert if you want to preserve accuracy. You need a framework for this, which is why I created AutoAWQ - I will look to implement INT8 quantization using the torch-int modules and would love your help with this so we can support all models in vLLM (LLaMa, MPT, Falcon, etc.) without accuracy degradation. |
We implement smoothquant for llama ourself, you can find code here: https://github.com/AniZpZ/smoothquant/tree/llama-dev and easily quantize and export model with export_int8_llama.py |
Hi @AniZpZ @zhyncs, thank you for your great work with this PR. I have now had more time to explore your fast implementation and found that Nvidia only has support for INT8 for high throughput, which makes this PR achieve higher throughput than INT4 due to software capabilities. Is your proposal to run W8A16? Your code does not have A8 implemented in the SmoothQuant implements W8A8, but it seems silly to run A8 as there should be little benefit speed-wise. Therefore, I see this as a natural choice. I want to confirm this with you for my implementation in AutoAWQ as I want to push INT8 models out using your initial LLaMa implementation, just using the AWQ method for minimum perplexity loss. |
Our proposal is to run in W8A8. If you enable smoothquant, we will replace rmsnorm and linear layer with our custom int8 rmsnorm and w8a8linears which quant activations and impelement int8 gemm. You can find the detail in w8a8linear.py |
__float22bfloat162_rn is a CUDA function, could you please check your CUDA version? The code had been tested with CUDA 11.8. |
@AniZpZ |
Could you please check your environment variable |
It looks like a problem occured when GPU arch does not match the requirement. Please check you GPU arch. |
@AniZpZ @HandH1998 |
|
Which branch of vllm does kv cache int8 run on? my command is: python -m vllm.entrypoints.api_server --model=/home/models/quant/smoothquant/Codellama-13b-int8-02/CodeLlama-13b-hf-smoothquant/ --tokenizer /home/models/CodeLlama-13b-hf/ --max-num-batched-tokens=70000 --block-size=16 --swap-space=20 --kv-cache-dtype=int8 --kv-quant-params-path=/home/models/quant/kv-cache-turbomind/ |
@HandH1998
|
method 1 is weight only quantization. Please use our new branch(#1508) to test w8a8 inference. |
This branch mix kv cache quant with w8a8 model quant. Please try kv cache quantization with our new branch(#1507) |
with the #1507, there is also this issue. I think if not enable kv cache quant, it should not be worse.
|
@AniZpZ
My calibration command: Additional info: |
@warlock135 , hi, you can try transformers==4.34.0. The latest transformers modified the forward code, which makes the calibration fail to get key and value. |
Any hope for 8 bit quantization to be merged any time soon? 4-bit models doesn't provide great results for complicated cases unfortunately. |
We hope this feature can be merged soon, too. But it doesn't seem to get a lot of attention of main reviewers. If possible, we hope you can raise an issue. Thanks. |
@HandH1998 Can the latest transformers version be used for quantization currently using the latest AutoSmoothQuant (https://github.com/AniZpZ/AutoSmoothQuant.git)? |
@HandH1998 When I used transformers 4.36.2 to perform quantization in the AutoSmoothQuant project, the following error message appeared: |
you need to use transformers==4.34.0 |
what are you think about this kv cache quant technic? https://github.com/jy-yuan/KIVI |
Thank you. We will look into the technic. |
Hi, @AniZpZ , is here any progress on the lower bit quantization? |
@AniZpZ Can you provide a docker or environment requirement for both smoothquent and vllm kv-quant-merge branch? |
We are working on lower bit quantization. |
The enviroment requirement should be the similar to original vLLM. |
@AniZpZ When I build smoothquant from source, it requires CUDA 11.8. And build kv-quant-merge branch of vllm from source, it requires CUDA 12.0. Is that expected? Can you kindly provide a list of CUDA/torch/transformers/ray dependency for using smoothquant with your vllm branch? Thanks in advance. |
Closing as we do have W8A8 and Int8 support nowadays. 🙏 |
Hi, is there any plan to support int4 KVCacheQuant? |
I just found kv_cache fp8 quant, could you provide the related pr that support kv_cache int8 quant? thanks! |
We have recently implemented and tested int8 KV-Cache quantization and W8A8 inference in vLLM. We find out that our quantization implementation can increase the throughput over 20% and reduce the first token latency under heavy load. In contrast, W4A16 quant methods(eg. AWQ-based method) provided in vllm cannot improve throughput according to pr1032 because it can not benefit from int8 tensor core. So we propose this PR as an alternative quantization method.
Updates!!!
We have made some more progress in #1112 (comment)
More Updates!!!
If you want to properly eval mmlu dataset with vllm, some modify on sampler must be done. The code can be found in our mmlu_eval branch
Important message!!!
We spilt the pr into two parts for easier review and use. The w8a8 inference part is in #1508 and the kv cache quant part is in #1507
What we have right now:
a. Quant\Dequant helper functions adapted from Faster Transformer
b. Quantized version CUDA kernels
c. Unit tests for the added kernels
a. Int8 Gemm kernels adapted from torch-int
b. W8A8 linear layer modules
c. Support W8A8 inference on Llama model
What we plan to do:
How to test throughput
A. how to enable w8a8 inference
0. install cutlass because we currently use cutlass gemm kernel. We plan to replace them with cublas gemm kernel soon.We support cublas gemm kernel now, you can remove cutlass gemm kernel in setup.py
B. how to enable kv cache quant
And you can use kv cache quant and w8a8 inference together
Experiment Result
current test result in our datasets on A100 80G (updates with quant&rms fusion and gemm d2h bug fix)
Throughput of FP16 LLaMA-13B:
Throughput of Int8 LLaMA-13B with int8 KVCacheQuant:
Throughput of Int8 LLaMA-13B with int8 KVCacheQuant, using cublas gemm kernel:
How to evalute model performance
We add evaluation method of quanted models, currently support mmlu datasets.
You can find detail in benchmarks/benchmark_evaluation.py
Updates
We have released SmoothQuant for LLaMA in
https://github.com/AniZpZ/smoothquant/tree/llama-dev
https://github.com/AniZpZ/torch-int/tree/llama-dev
The code for generating KV-Cache quantization parameters is ready, check vllm/kv_quant fold
We replace int8 gemm with cublas version and the increasement of throughput comes to around 30%