Support int8 KVCacheQuant and W8A8 inference in vllm #1112

AniZpZ · 2023-09-20T13:20:00Z

We have recently implemented and tested int8 KV-Cache quantization and W8A8 inference in vLLM. We find out that our quantization implementation can increase the throughput over 20% and reduce the first token latency under heavy load. In contrast, W4A16 quant methods(eg. AWQ-based method) provided in vllm cannot improve throughput according to pr1032 because it can not benefit from int8 tensor core. So we propose this PR as an alternative quantization method.

Updates!!!
We have made some more progress in #1112 (comment)

More Updates!!!
If you want to properly eval mmlu dataset with vllm, some modify on sampler must be done. The code can be found in our mmlu_eval branch

Important message!!!
We spilt the pr into two parts for easier review and use. The w8a8 inference part is in #1508 and the kv cache quant part is in #1507

What we have right now:

int8 KV-Cache quantization related works:
a. Quant\Dequant helper functions adapted from Faster Transformer
b. Quantized version CUDA kernels
c. Unit tests for the added kernels
W8A8 inference related works:
a. Int8 Gemm kernels adapted from torch-int
b. W8A8 linear layer modules
c. Support W8A8 inference on Llama model
Test result based on our own dataset

What we plan to do:

1. Further kernel fusion
2. Code refactoring and cleaning
3. Opimize int8 GEMM kernel
4. Release SmoothQuant for LLaMA
5. Add code for generating KV-Cache quantization parameters (scales and zero points)
6. Experiments on more datasets

How to test throughput
A. how to enable w8a8 inference
~~0. install cutlass because we currently use cutlass gemm kernel. We plan to replace them with cublas gemm kernel soon.~~
We support cublas gemm kernel now, you can remove cutlass gemm kernel in setup.py

install smoothquant and torch-int for llama. Use "examples/generate_act_scales.py" to generate act scale, and then use "examples/export_int8_llama.py" to export int8 model. Please note to check and change the 'architectures' field in the model's config.json from 'Int8LlamaForCausalLM' to 'LlamaForCausalLM'.
update vllm and execute

python ./vllm/entrypoints/api_server.py --model=/path/to/quantized/model --tokenizer=/path/to/tokenizer --max-num-batched-tokens=70000 --block-size=16 --swap-space=20 --quantization smoothquant

B. how to enable kv cache quant

use vllm/kv_quant/calibrate.py to genearte scales and use vllm/kv_quant/export_kv_params.py to export kv caches.
exeute

python ./vllm/entrypoints/api_server.py --model=/path/to/quantized/model --tokenizer=/path/to/tokenizer --max-num-batched-tokens=70000 --block-size=16 --swap-space=20 --kv-cache-dtype=int8 --kv-quant-params-path=/path/to/kv_params_dir

And you can use kv cache quant and w8a8 inference together

Experiment Result
current test result in our datasets on A100 80G (updates with quant&rms fusion and gemm d2h bug fix)

Throughput of FP16 LLaMA-13B:

Throughput:  4.9945 requests/s, 543.0660 token/s
Average latency: 31.7689 s

Throughput of Int8 LLaMA-13B with int8 KVCacheQuant:

Throughput: 6.1147 requests/s, 664.8646 token/s, 
Average latency: 27.4222 s

Throughput of Int8 LLaMA-13B with int8 KVCacheQuant, using cublas gemm kernel:

Throughput: 6.4723 requests/s, 703.7514 token/s, 
Average latency: 25.9912 s

How to evalute model performance
We add evaluation method of quanted models, currently support mmlu datasets.
You can find detail in benchmarks/benchmark_evaluation.py

python benchmark_evaluation.py --model=/path/to/quantized/model --tokenizer=/path/to/tokenizer --dev-data-path=/path/to/mmlu/dev/ --test-data-path=/path/to/mmlu/test/ --kv-cache-dtype=int8 --kv-quant-params-path=/path/to/kv_params_dir --quantization=smoothquant

Updates
We have released SmoothQuant for LLaMA in
https://github.com/AniZpZ/smoothquant/tree/llama-dev
https://github.com/AniZpZ/torch-int/tree/llama-dev

The code for generating KV-Cache quantization parameters is ready, check vllm/kv_quant fold

We replace int8 gemm with cublas version and the increasement of throughput comes to around 30%

casper-hansen · 2023-09-20T13:50:08Z

This is interesting work! I was going to implement int8 in AutoAWQ with time as the authors of SmoothQuant (this PR) and AWQ are the same. My best guestimate is that single_query_cached_kv_attention_quantized_kernel is doing the heavy lifting of throughput here as it comes from FasterTransformer which is well optimized.

viktor-ferenczi · 2023-09-21T06:46:46Z

I fully support this, since the 4-bit AWQ model proved to have inferior quality for my use cases. Having 8 bit weights with 8 bit activation cache would be the best of both worlds, allowing for almost no loss of quality (perplexity) while being able to run inference more efficiently. I would also keep an W8A16 mode as an option, should the precision of the activations and the KV cache would make a difference in specific use cases.

zhyncs · 2023-09-21T08:51:58Z

Hi vLLM genius @WoosukKwon @zhuohan123

This is the latest development from our team regarding quantitative support for vllm, we have done something similar to #1032 before. At that time, we didn't mention pr after the benchmark results showed a drop in throughput, but later we found out that #1032 was merged, which is very encouraging. Therefore, we continue to do performance optimization on this basis, and send out the pr in WIP state in advance, hoping to get some comments and suggestions, and finally merge into the vllm codebase smoothly. Cheers!

casper-hansen · 2023-09-21T09:03:59Z

@AniZpZ @zhyncs This is great work! My understanding is that this PR converts FP16 -> INT8 dynamically without computing a loss function to optimize perplexity. Have you evaluated perplexity on this approach?

AniZpZ · 2023-09-21T09:17:37Z

@AniZpZ @zhyncs This is great work! My understanding is that this PR converts FP16 -> INT8 dynamically without computing a loss function to optimize perplexity. Have you evaluated perplexity on this approach?

We implement quantization with smoothquant method for W8A8 I will release the code later. The perplexity is identical to a standard smoothquant method if you do W8A8 inference without int8 KVCacheQuant.

Quantization details are discussed in this paper(Xiao et. al)

casper-hansen · 2023-09-21T09:46:44Z

@AniZpZ @zhyncs This is great work! My understanding is that this PR converts FP16 -> INT8 dynamically without computing a loss function to optimize perplexity. Have you evaluated perplexity on this approach?

We implement quantization with smoothquant method for W8A8 I will release the code later. The perplexity is identical to a standard smoothquant method if you do W8A8 inference without int8 KVCacheQuant.

Quantization details are discussed in this paper(Xiao et. al)

SmoothQuant only supports OPT models. How can we test this PR when the SmoothQuant repository does not support LLaMa models? If you implement this PR without the quantization code, you will inevitably end up with a bad perplexity if you naively use W8A8 as you have no calibration dataset.

See this image, accuracy ends up being worse than INT4 if you naively convert weights to W8A8. You need the SmoothQuant or AWQ method to convert if you want to preserve accuracy. You need a framework for this, which is why I created AutoAWQ - I will look to implement INT8 quantization using the torch-int modules and would love your help with this so we can support all models in vLLM (LLaMa, MPT, Falcon, etc.) without accuracy degradation.

AniZpZ · 2023-09-21T10:03:59Z

@AniZpZ @zhyncs This is great work! My understanding is that this PR converts FP16 -> INT8 dynamically without computing a loss function to optimize perplexity. Have you evaluated perplexity on this approach?

@AniZpZ @zhyncs This is great work! My understanding is that this PR converts FP16 -> INT8 dynamically without computing a loss function to optimize perplexity. Have you evaluated perplexity on this approach?

We implement quantization with smoothquant method for W8A8 I will release the code later. The perplexity is identical to a standard smoothquant method if you do W8A8 inference without int8 KVCacheQuant.
Quantization details are discussed in this paper(Xiao et. al)

SmoothQuant only supports OPT models. How can we test this PR when the SmoothQuant repository does not support LLaMa models? If you implement this PR without the quantization code, you will inevitably end up with a bad perplexity if you naively use W8A8 as you have no calibration dataset.

See this image, accuracy ends up being worse than INT4 if you naively convert weights to W8A8. You need the SmoothQuant or AWQ method to convert if you want to preserve accuracy. You need a framework for this, which is why I created AutoAWQ - I will look to implement INT8 quantization using the torch-int modules and would love your help with this so we can support all models in vLLM (LLaMa, MPT, Falcon, etc.) without accuracy degradation.

We implement smoothquant for llama ourself, you can find code here: https://github.com/AniZpZ/smoothquant/tree/llama-dev and easily quantize and export model with export_int8_llama.py
It should work with https://github.com/AniZpZ/torch-int/tree/llama-dev

casper-hansen · 2023-09-22T20:33:28Z

Hi @AniZpZ @zhyncs, thank you for your great work with this PR.

I have now had more time to explore your fast implementation and found that Nvidia only has support for INT8 for high throughput, which makes this PR achieve higher throughput than INT4 due to software capabilities.

Is your proposal to run W8A16? Your code does not have A8 implemented in the llama.py model definition.

SmoothQuant implements W8A8, but it seems silly to run A8 as there should be little benefit speed-wise. Therefore, I see this as a natural choice. I want to confirm this with you for my implementation in AutoAWQ as I want to push INT8 models out using your initial LLaMa implementation, just using the AWQ method for minimum perplexity loss.

AniZpZ · 2023-09-23T02:53:44Z

Hi @AniZpZ @zhyncs, thank you for your great work with this PR.

I have now had more time to explore your fast implementation and found that Nvidia only has support for INT8 for high throughput, which makes this PR achieve higher throughput than INT4 due to software capabilities.

Is your proposal to run W8A16? Your code does not have A8 implemented in the llama.py model definition.

SmoothQuant implements W8A8, but it seems silly to run A8 as there should be little benefit speed-wise. Therefore, I see this as a natural choice. I want to confirm this with you for my implementation in AutoAWQ as I want to push INT8 models out using your initial LLaMa implementation, just using the AWQ method for minimum perplexity loss.

Our proposal is to run in W8A8. If you enable smoothquant, we will replace rmsnorm and linear layer with our custom int8 rmsnorm and w8a8linears which quant activations and impelement int8 gemm. You can find the detail in w8a8linear.py
If you want enable tensor core to do int8 caclulation, weights and activations should both be int8.

AniZpZ · 2023-11-27T09:29:06Z

@HandH1998 when compiling the vllm from this branch kv_quant, another issue:

Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/2] c++ -MMD -MF /app/vllm/build/temp.linux-x86_64-3.8/csrc/cache.o.d -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/local/lib/python3.8/dist-packages/torch/include -I/usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.8/dist-packages/torch/include/TH -I/usr/local/lib/python3.8/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.8 -c -c /app/vllm/csrc/cache.cpp -o /app/vllm/build/temp.linux-x86_64-3.8/csrc/cache.o -g -O2 -std=c++17 -D_GLIBCXX_USE_CXX11_ABI=0 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="gcc"' '-DPYBIND11_STDLIB="libstdcpp"' '-DPYBIND11_BUILD_ABI="cxxabi1011"' -DTORCH_EXTENSION_NAME=cache_ops -D_GLIBCXX_USE_CXX11_ABI=0 [2/2] /usr/local/cuda/bin/nvcc -I/usr/local/lib/python3.8/dist-packages/torch/include -I/usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.8/dist-packages/torch/include/TH -I/usr/local/lib/python3.8/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.8 -c -c /app/vllm/csrc/cache_kernels.cu -o /app/vllm/build/temp.linux-x86_64-3.8/csrc/cache_kernels.o -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O2 -std=c++17 -D_GLIBCXX_USE_CXX11_ABI=0 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 --threads 8 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="gcc"' '-DPYBIND11_STDLIB="libstdcpp"' '-DPYBIND11_BUILD_ABI="cxxabi1011"' -DTORCH_EXTENSION_NAME=cache_ops -D_GLIBCXX_USE_CXX11_ABI=0 FAILED: /app/vllm/build/temp.linux-x86_64-3.8/csrc/cache_kernels.o /usr/local/cuda/bin/nvcc -I/usr/local/lib/python3.8/dist-packages/torch/include -I/usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.8/dist-packages/torch/include/TH -I/usr/local/lib/python3.8/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.8 -c -c /app/vllm/csrc/cache_kernels.cu -o /app/vllm/build/temp.linux-x86_64-3.8/csrc/cache_kernels.o -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O2 -std=c++17 -D_GLIBCXX_USE_CXX11_ABI=0 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 --threads 8 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=cache_ops -D_GLIBCXX_USE_CXX11_ABI=0 /app/vllm/csrc/cache_kernels.cu(44): warning #1143-D: arithmetic on pointer to void or function type

/app/vllm/csrc/cache_kernels.cu(45): warning #1143-D: arithmetic on pointer to void or function type

/usr/local/lib/python3.8/dist-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero detected during: instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]" (61): here instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]" /usr/local/lib/python3.8/dist-packages/torch/include/c10/core/TensorImpl.h(77): here

/usr/local/lib/python3.8/dist-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero detected during: instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=std::size_t, one_sided=true, =0]" (61): here instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=std::size_t, one_sided=true, =0]" /usr/local/lib/python3.8/dist-packages/torch/include/ATen/core/qualified_name.h(73): here

/app/vllm/csrc/cache_kernels.cu(399): warning #550-D: variable "src_key_indices" was set but never used

/app/vllm/csrc/cache_kernels.cu(400): warning #550-D: variable "src_value_indices" was set but never used

/app/vllm/csrc/quant_utils.cuh(217): error: identifier "__float22bfloat162_rn" is undefined

/app/vllm/csrc/cache_kernels.cu(44): warning #1143-D: arithmetic on pointer to void or function type

/app/vllm/csrc/cache_kernels.cu(45): warning #1143-D: arithmetic on pointer to void or function type

/usr/local/lib/python3.8/dist-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero detected during: instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]" (61): here instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]" /usr/local/lib/python3.8/dist-packages/torch/include/c10/core/TensorImpl.h(77): here

/usr/local/lib/python3.8/dist-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero detected during: instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=std::size_t, one_sided=true, =0]" (61): here instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=std::size_t, one_sided=true, =0]" /usr/local/lib/python3.8/dist-packages/torch/include/ATen/core/qualified_name.h(73): here

/app/vllm/csrc/cache_kernels.cu(399): warning #550-D: variable "src_key_indices" was set but never used

/app/vllm/csrc/cache_kernels.cu(400): warning #550-D: variable "src_value_indices" was set but never used

1 error detected in the compilation of "/app/vllm/csrc/cache_kernels.cu". /app/vllm/csrc/quant_utils.cuh(217): error: identifier "__float22bfloat162_rn" is undefined

/app/vllm/csrc/cache_kernels.cu(44): warning #1143-D: arithmetic on pointer to void or function type

/app/vllm/csrc/cache_kernels.cu(45): warning #1143-D: arithmetic on pointer to void or function type

/usr/local/lib/python3.8/dist-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero detected during: instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]" (61): here instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]" /usr/local/lib/python3.8/dist-packages/torch/include/c10/core/TensorImpl.h(77): here

/usr/local/lib/python3.8/dist-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero detected during: instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=std::size_t, one_sided=true, =0]" (61): here instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=std::size_t, one_sided=true, =0]" /usr/local/lib/python3.8/dist-packages/torch/include/ATen/core/qualified_name.h(73): here

/app/vllm/csrc/cache_kernels.cu(399): warning #550-D: variable "src_key_indices" was set but never used

/app/vllm/csrc/cache_kernels.cu(400): warning #550-D: variable "src_value_indices" was set but never used

1 error detected in the compilation of "/app/vllm/csrc/cache_kernels.cu". /app/vllm/csrc/cache_kernels.cu(44): warning #1143-D: arithmetic on pointer to void or function type

/app/vllm/csrc/cache_kernels.cu(45): warning #1143-D: arithmetic on pointer to void or function type

/usr/local/lib/python3.8/dist-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero detected during: instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]" (61): here instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]" /usr/local/lib/python3.8/dist-packages/torch/include/c10/core/TensorImpl.h(77): here

/usr/local/lib/python3.8/dist-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero detected during: instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=std::size_t, one_sided=true, =0]" (61): here instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=std::size_t, one_sided=true, =0]" /usr/local/lib/python3.8/dist-packages/torch/include/ATen/core/qualified_name.h(73): here

/app/vllm/csrc/cache_kernels.cu(399): warning #550-D: variable "src_key_indices" was set but never used

/app/vllm/csrc/cache_kernels.cu(400): warning #550-D: variable "src_value_indices" was set but never used

/app/vllm/csrc/cache_kernels.cu(44): warning #1143-D: arithmetic on pointer to void or function type

/app/vllm/csrc/cache_kernels.cu(45): warning #1143-D: arithmetic on pointer to void or function type

/usr/local/lib/python3.8/dist-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero detected during: instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]" (61): here instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]" /usr/local/lib/python3.8/dist-packages/torch/include/c10/core/TensorImpl.h(77): here

/usr/local/lib/python3.8/dist-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero detected during: instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=std::size_t, one_sided=true, =0]" (61): here instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=std::size_t, one_sided=true, =0]" /usr/local/lib/python3.8/dist-packages/torch/include/ATen/core/qualified_name.h(73): here

/app/vllm/csrc/cache_kernels.cu(399): warning #550-D: variable "src_key_indices" was set but never used

/app/vllm/csrc/cache_kernels.cu(400): warning #550-D: variable "src_value_indices" was set but never used

ninja: build stopped: subcommand failed. Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build subprocess.run( File "/usr/lib/python3.8/subprocess.py", line 516, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "setup.py", line 257, in setuptools.setup( File "/usr/local/lib/python3.8/dist-packages/setuptools/init.py", line 153, in setup return distutils.core.setup(**attrs) File "/usr/lib/python3.8/distutils/core.py", line 148, in setup dist.run_commands() File "/usr/lib/python3.8/distutils/dist.py", line 966, in run_commands self.run_command(cmd) File "/usr/lib/python3.8/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/usr/local/lib/python3.8/dist-packages/setuptools/command/install.py", line 74, in run self.do_egg_install() File "/usr/local/lib/python3.8/dist-packages/setuptools/command/install.py", line 116, in do_egg_install self.run_command('bdist_egg') File "/usr/lib/python3.8/distutils/cmd.py", line 313, in run_command self.distribution.run_command(command) File "/usr/lib/python3.8/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/usr/local/lib/python3.8/dist-packages/setuptools/command/bdist_egg.py", line 164, in run cmd = self.call_command('install_lib', warn_dir=0) File "/usr/local/lib/python3.8/dist-packages/setuptools/command/bdist_egg.py", line 150, in call_command self.run_command(cmdname) File "/usr/lib/python3.8/distutils/cmd.py", line 313, in run_command self.distribution.run_command(command) File "/usr/lib/python3.8/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/usr/local/lib/python3.8/dist-packages/setuptools/command/install_lib.py", line 11, in run self.build() File "/usr/lib/python3.8/distutils/command/install_lib.py", line 109, in build self.run_command('build_ext') File "/usr/lib/python3.8/distutils/cmd.py", line 313, in run_command self.distribution.run_command(command) File "/usr/lib/python3.8/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/usr/local/lib/python3.8/dist-packages/setuptools/command/build_ext.py", line 79, in run _build_ext.run(self) File "/usr/local/lib/python3.8/dist-packages/Cython/Distutils/old_build_ext.py", line 186, in run _build_ext.build_ext.run(self) File "/usr/lib/python3.8/distutils/command/build_ext.py", line 340, in run self.build_extensions() File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 843, in build_extensions build_ext.build_extensions(self) File "/usr/local/lib/python3.8/dist-packages/Cython/Distutils/old_build_ext.py", line 195, in build_extensions _build_ext.build_ext.build_extensions(self) File "/usr/lib/python3.8/distutils/command/build_ext.py", line 449, in build_extensions self._build_extensions_serial() File "/usr/lib/python3.8/distutils/command/build_ext.py", line 474, in _build_extensions_serial self.build_extension(ext) File "/usr/local/lib/python3.8/dist-packages/setuptools/command/build_ext.py", line 202, in build_extension _build_ext.build_extension(self, ext) File "/usr/lib/python3.8/distutils/command/build_ext.py", line 528, in build_extension objects = self.compiler.compile(sources, File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 658, in unix_wrap_ninja_compile _write_ninja_file_and_compile_objects( File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1574, in _write_ninja_file_and_compile_objects _run_ninja_build( File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build raise RuntimeError(message) from e RuntimeError: Error compiling objects for extension

__float22bfloat162_rn is a CUDA function, could you please check your CUDA version? The code had been tested with CUDA 11.8.

ChristineSeven · 2023-11-27T09:43:49Z

@AniZpZ
exactly cuda 11.8 and compiled successfullly on another branch w8a8.
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

HandH1998 · 2023-12-01T08:24:11Z

@AniZpZ exactly cuda 11.8 and compiled successfullly on another branch w8a8. nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0

Could you please check your environment variable TORCH_CUDA_ARCH_LIST? The cuda arch should >= 8.0.

AniZpZ · 2023-12-01T11:08:44Z

@AniZpZ exactly cuda 11.8 and compiled successfullly on another branch w8a8. nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0

It looks like a problem occured when GPU arch does not match the requirement. Please check you GPU arch.

ChristineSeven · 2023-12-02T02:54:01Z

TORCH_CUDA_ARCH_LIST

@AniZpZ exactly cuda 11.8 and compiled successfullly on another branch w8a8. nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0

@AniZpZ exactly cuda 11.8 and compiled successfullly on another branch w8a8. nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0

Could you please check your environment variable TORCH_CUDA_ARCH_LIST? The cuda arch should >= 8.0.

@AniZpZ @HandH1998
['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86']
Here is the cuda arch support list. So is there a probelm? Thanks!

HandH1998 · 2023-12-02T04:36:53Z

TORCH_CUDA_ARCH_LIST

@AniZpZ exactly cuda 11.8 and compiled successfullly on another branch w8a8. nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0

@AniZpZ exactly cuda 11.8 and compiled successfullly on another branch w8a8. nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0

Could you please check your environment variable TORCH_CUDA_ARCH_LIST? The cuda arch should >= 8.0.

@AniZpZ @HandH1998 ['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86'] Here is the cuda arch support list. So is there a probelm? Thanks!

export TORCH_CUDA_ARCH_LIST=80;86

shatealaboxiaowang · 2023-12-08T06:23:31Z

We have recently implemented and tested int8 KV-Cache quantization and W8A8 inference in vLLM. We find out that our quantization implementation can increase the throughput over 20% and reduce the first token latency under heavy load. In contrast, W4A16 quant methods(eg. AWQ-based method) provided in vllm cannot improve throughput according to pr1032 because it can not benefit from int8 tensor core. So we propose this PR as an alternative quantization method.

Updates!!! We have made some more progress in #1112 (comment)

More Updates!!! If you want to properly eval mmlu dataset with vllm, some modify on sampler must be done. The code can be found in our mmlu_eval branch

Important message!!! We spilt the pr into two parts for easier review and use. The w8a8 inference part is in #1508 and the kv cache quant part is in #1507

What we have right now:

int8 KV-Cache quantization related works:
a. Quant\Dequant helper functions adapted from Faster Transformer
b. Quantized version CUDA kernels
c. Unit tests for the added kernels

W8A8 inference related works:
a. Int8 Gemm kernels adapted from torch-int
b. W8A8 linear layer modules
c. Support W8A8 inference on Llama model

Test result based on our own dataset

What we plan to do:

1. Further kernel fusion

2. Code refactoring and cleaning

3. Opimize int8 GEMM kernel

4. Release SmoothQuant for LLaMA

5. Add code for generating KV-Cache quantization parameters (scales and zero points)

6. Experiments on more datasets

How to test throughput A. how to enable w8a8 inference ~~0. install cutlass because we currently use cutlass gemm kernel. We plan to replace them with cublas gemm kernel soon.~~ We support cublas gemm kernel now, you can remove cutlass gemm kernel in setup.py

install smoothquant and torch-int for llama. Use "examples/generate_act_scales.py" to generate act scale, and then use "examples/export_int8_llama.py" to export int8 model. Please note to check and change the 'architectures' field in the model's config.json from 'Int8LlamaForCausalLM' to 'LlamaForCausalLM'.

update vllm and execute
python ./vllm/entrypoints/api_server.py --model=/path/to/quantized/model --tokenizer=/path/to/tokenizer --max-num-batched-tokens=70000 --block-size=16 --swap-space=20 --quantization smoothquant
B. how to enable kv cache quant

use vllm/kv_quant/calibrate.py to genearte scales and use vllm/kv_quant/export_kv_params.py to export kv caches.

exeute
python ./vllm/entrypoints/api_server.py --model=/path/to/quantized/model --tokenizer=/path/to/tokenizer --max-num-batched-tokens=70000 --block-size=16 --swap-space=20 --kv-cache-dtype=int8 --kv-quant-params-path=/path/to/kv_params_dir
And you can use kv cache quant and w8a8 inference together

Experiment Result current test result in our datasets on A100 80G (updates with quant&rms fusion and gemm d2h bug fix)

Throughput of FP16 LLaMA-13B:
Throughput:  4.9945 requests/s, 543.0660 token/s
Average latency: 31.7689 s
Throughput of Int8 LLaMA-13B with int8 KVCacheQuant:
Throughput: 6.1147 requests/s, 664.8646 token/s, 
Average latency: 27.4222 s
Throughput of Int8 LLaMA-13B with int8 KVCacheQuant, using cublas gemm kernel:
Throughput: 6.4723 requests/s, 703.7514 token/s, 
Average latency: 25.9912 s
How to evalute model performance We add evaluation method of quanted models, currently support mmlu datasets. You can find detail in benchmarks/benchmark_evaluation.py
python benchmark_evaluation.py --model=/path/to/quantized/model --tokenizer=/path/to/tokenizer --dev-data-path=/path/to/mmlu/dev/ --test-data-path=/path/to/mmlu/test/ --kv-cache-dtype=int8 --kv-quant-params-path=/path/to/kv_params_dir --quantization=smoothquant
Updates We have released SmoothQuant for LLaMA in https://github.com/AniZpZ/smoothquant/tree/llama-dev https://github.com/AniZpZ/torch-int/tree/llama-dev

The code for generating KV-Cache quantization parameters is ready, check vllm/kv_quant fold

We replace int8 gemm with cublas version and the increasement of throughput comes to around 30%

Which branch of vllm does kv cache int8 run on?
（1）I run it on kv_quant ， but error occurs as follow：
ImportError: /home/vllm/vllm-kv_quant/vllm/cuda_utils.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZNSt15__exception_ptr13exception_ptr10_M_releaseEv
（2）I run it on kv-quant-merge branch， but error ocurrs as follow:
KeyError: 'model.layers.0.self_attn.qkv_proj.dequant_scale'

my command is: python -m vllm.entrypoints.api_server --model=/home/models/quant/smoothquant/Codellama-13b-int8-02/CodeLlama-13b-hf-smoothquant/ --tokenizer /home/models/CodeLlama-13b-hf/ --max-num-batched-tokens=70000 --block-size=16 --swap-space=20 --kv-cache-dtype=int8 --kv-quant-params-path=/home/models/quant/kv-cache-turbomind/
Have you encountered this error?

ChristineSeven · 2023-12-08T09:19:18Z

@HandH1998
I used the 2 ways to quant and scored my fined llama 70b model, and not enable kv cache quant.

model = AutoModelForCausalLM.from_pretrained('xxx',device_map='auto',load_in_8bit=True) , this one got the closed score as the none quanted one.
use the kv-quant-merged branch to compile, the score is bad.
So why this could be so different? I used the orginal file "dataset/val.jsonl.zst" as the validation file to export act scales and model, would this be the problem?

AniZpZ · 2023-12-08T09:55:43Z

@HandH1998 I used the 2 ways to quant and scored my fined llama 70b model, and not enable kv cache quant.

model = AutoModelForCausalLM.from_pretrained('xxx',device_map='auto',load_in_8bit=True) , this one got the closed score as the none quanted one.

use the kv-quant-merged branch to compile, the score is bad.
So why this could be so different? I used the orginal file "dataset/val.jsonl.zst" as the validation file to export act scales and model, would this be the problem?

method 1 is weight only quantization. Please use our new branch(#1508) to test w8a8 inference.

AniZpZ · 2023-12-08T09:57:03Z

dequant_scale

This branch mix kv cache quant with w8a8 model quant. Please try kv cache quantization with our new branch(#1507)

ChristineSeven · 2023-12-19T13:14:20Z

dequant_scale

This branch mix kv cache quant with w8a8 model quant. Please try kv cache quantization with our new branch(#1507)

with the #1507， there is also this issue. I think if not enable kv cache quant, it should not be worse.
I used the 2 ways to quant and scored my fined llama 70b model, and not enable kv cache quant.

model = AutoModelForCausalLM.from_pretrained('xxx',device_map='auto',load_in_8bit=True) , this one got the closed score as the none quanted one.
2.use the kv_quant branch to compile, the score is bad.
So why this could be so different? I used the orginal file "dataset/val.jsonl.zst" as the validation file to export act scales and model, would this be the problem?

warlock135 · 2023-12-28T04:55:33Z

@AniZpZ
When I tried to calibrate, it results an error:

Traceback (most recent call last):
File "/work/vllm_merge/vllm/kv_quant/calibrate.py", line 124, in
fire.Fire(calibrate)
File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/work/vllm_merge/vllm/kv_quant/calibrate.py", line 115, in calibrate
calib_ctx.calibrate(all_data)
File "/work/vllm_merge/vllm/kv_quant/calibration.py", line 283, in calibrate
_ = model(data.to(self.device))
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 1068, in forward
layer_outputs = decoder_layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/work/vllm_merge/vllm/kv_quant/calibration.py", line 167, in _forward
key, value = out.pop(-1)
TypeError: cannot unpack non-iterable NoneType object

My calibration command:
python3 vllm/kv_quant/calibrate.py /work/llama2-7b-hf c4 128 2048 /work/act-scales

Additional info:
key, value = out.pop(-1)
out is a list [tensor, None]

HandH1998 · 2024-01-04T05:46:00Z

@warlock135 , hi, you can try transformers==4.34.0. The latest transformers modified the forward code, which makes the calibration fail to get key and value.

hikmet-demir · 2024-01-04T22:12:45Z

Any hope for 8 bit quantization to be merged any time soon? 4-bit models doesn't provide great results for complicated cases unfortunately.

HandH1998 · 2024-01-05T12:14:27Z

Any hope for 8 bit quantization to be merged any time soon? 4-bit models doesn't provide great results for complicated cases unfortunately.

We hope this feature can be merged soon, too. But it doesn't seem to get a lot of attention of main reviewers. If possible, we hope you can raise an issue. Thanks.

xyfZzz · 2024-01-18T02:50:17Z

@warlock135 , hi, you can try transformers==4.34.0. The latest transformers modified the forward code, which makes the calibration fail to get key and value.

@HandH1998 Can the latest transformers version be used for quantization currently using the latest AutoSmoothQuant (https://github.com/AniZpZ/AutoSmoothQuant.git)?

xyfZzz · 2024-01-18T03:09:24Z

@warlock135 , hi, you can try transformers==4.34.0. The latest transformers modified the forward code, which makes the calibration fail to get key and value.

@HandH1998 Can the latest transformers version be used for quantization currently using the latest AutoSmoothQuant (https://github.com/AniZpZ/AutoSmoothQuant.git)?

@HandH1998 When I used transformers 4.36.2 to perform quantization in the AutoSmoothQuant project, the following error message appeared:
ImportError: cannot import name 'SiLUActivation' from 'transformers.activations'

HandH1998 · 2024-01-18T03:28:16Z

@warlock135 , hi, you can try transformers==4.34.0. The latest transformers modified the forward code, which makes the calibration fail to get key and value.

@HandH1998 Can the latest transformers version be used for quantization currently using the latest AutoSmoothQuant (https://github.com/AniZpZ/AutoSmoothQuant.git)?

@HandH1998 When I used transformers 4.36.2 to perform quantization in the AutoSmoothQuant project, the following error message appeared: ImportError: cannot import name 'SiLUActivation' from 'transformers.activations'

you need to use transformers==4.34.0

wDevil · 2024-02-13T08:20:35Z

what are you think about this kv cache quant technic? https://github.com/jy-yuan/KIVI

AniZpZ · 2024-02-19T08:59:03Z

what are you think about this kv cache quant technic? https://github.com/jy-yuan/KIVI

Thank you. We will look into the technic.

andakai · 2024-04-01T04:43:38Z

what are you think about this kv cache quant technic? https://github.com/jy-yuan/KIVI

what are you think about this kv cache quant technic? https://github.com/jy-yuan/KIVI

Thank you. We will look into the technic.

Hi, @AniZpZ , is here any progress on the lower bit quantization?

Opdoop · 2024-04-14T15:18:16Z

@AniZpZ Can you provide a docker or environment requirement for both smoothquent and vllm kv-quant-merge branch?

AniZpZ · 2024-04-15T02:19:35Z

what are you think about this kv cache quant technic? https://github.com/jy-yuan/KIVI

what are you think about this kv cache quant technic? https://github.com/jy-yuan/KIVI

Thank you. We will look into the technic.

Hi, @AniZpZ , is here any progress on the lower bit quantization?

We are working on lower bit quantization.

AniZpZ · 2024-04-15T02:20:28Z

@AniZpZ Can you provide a docker or environment requirement for both smoothquent and vllm kv-quant-merge branch?

The enviroment requirement should be the similar to original vLLM.

Opdoop · 2024-04-15T02:38:26Z

@AniZpZ When I build smoothquant from source, it requires CUDA 11.8. And build kv-quant-merge branch of vllm from source, it requires CUDA 12.0. Is that expected? Can you kindly provide a list of CUDA/torch/transformers/ray dependency for using smoothquant with your vllm branch? Thanks in advance.

simon-mo · 2024-10-01T18:21:53Z

Closing as we do have W8A8 and Int8 support nowadays. 🙏

SherrySwift · 2024-10-10T07:30:51Z

Hi, is there any plan to support int4 KVCacheQuant?

Reinerzhou · 2024-11-06T04:26:24Z

Closing as we do have W8A8 and Int8 support nowadays. 🙏

I just found kv_cache fp8 quant, could you provide the related pr that support kv_cache int8 quant? thanks!

zhangpeng added 2 commits September 20, 2023 12:01

add llama quant

e08acaa

change weight path

387c804

AniZpZ mentioned this pull request Sep 20, 2023

Implement AWQ quantization support for LLaMA #1032

Merged

AniZpZ changed the title ~~[Enhancement] Support int8 KVCacheQuant and W8A8 inference in vllm~~ [WIP] Support int8 KVCacheQuant and W8A8 inference in vllm Sep 21, 2023

viktor-ferenczi mentioned this pull request Sep 21, 2023

GGUF support #1002

Closed

zhangpeng and others added 10 commits September 21, 2023 17:06

fix weight load

68cd1e0

merge gate and up matrix

ca088d6

use FTLlamaRMSNorm

6bde51e

support bitsandbytes int8

931e51c

llama support bnb 4bit

c0c2a4d

support kv cache quantization

3bb6e31

fix python code

bc9fada

merge and reformat

976874d

add int8gemm

2c0c311

support int8 inference

27e3b4b

AniZpZ force-pushed the kv-quant-merge branch from c581e38 to 27e3b4b Compare September 21, 2023 09:11

Reduce alpha,beta unnecessary d2h

e6f45ff

[email protected] added 3 commits September 22, 2023 17:35

fix weight load

96c10ca

fix weight load

4be7d83

fix ln layer init

be6f7b8

viktor-ferenczi mentioned this pull request Sep 23, 2023

Can we load a model in 4bit or 8bit? #1155

Closed

zhyncs mentioned this pull request Jun 14, 2024

[Performance]: What can we learn from OctoAI #5167

Closed

simon-mo closed this Oct 1, 2024

Support int8 KVCacheQuant and W8A8 inference in vllm #1112

Support int8 KVCacheQuant and W8A8 inference in vllm #1112

Conversation

AniZpZ commented Sep 20, 2023 • edited Loading

casper-hansen commented Sep 20, 2023 • edited Loading

viktor-ferenczi commented Sep 21, 2023 • edited Loading

zhyncs commented Sep 21, 2023

casper-hansen commented Sep 21, 2023

AniZpZ commented Sep 21, 2023 • edited Loading

casper-hansen commented Sep 21, 2023

AniZpZ commented Sep 21, 2023 • edited Loading

casper-hansen commented Sep 22, 2023

AniZpZ commented Sep 23, 2023 • edited Loading

AniZpZ commented Nov 27, 2023

ChristineSeven commented Nov 27, 2023 • edited Loading

HandH1998 commented Dec 1, 2023

AniZpZ commented Dec 1, 2023

ChristineSeven commented Dec 2, 2023

HandH1998 commented Dec 2, 2023

shatealaboxiaowang commented Dec 8, 2023 • edited Loading

ChristineSeven commented Dec 8, 2023 • edited Loading

AniZpZ commented Dec 8, 2023

AniZpZ commented Dec 8, 2023

ChristineSeven commented Dec 19, 2023

warlock135 commented Dec 28, 2023

HandH1998 commented Jan 4, 2024 • edited Loading

hikmet-demir commented Jan 4, 2024

HandH1998 commented Jan 5, 2024

xyfZzz commented Jan 18, 2024

xyfZzz commented Jan 18, 2024

HandH1998 commented Jan 18, 2024

wDevil commented Feb 13, 2024

AniZpZ commented Feb 19, 2024

andakai commented Apr 1, 2024

Opdoop commented Apr 14, 2024

AniZpZ commented Apr 15, 2024

AniZpZ commented Apr 15, 2024

Opdoop commented Apr 15, 2024

simon-mo commented Oct 1, 2024

SherrySwift commented Oct 10, 2024

Reinerzhou commented Nov 6, 2024

AniZpZ commented Sep 20, 2023 •

edited

Loading

casper-hansen commented Sep 20, 2023 •

edited

Loading

viktor-ferenczi commented Sep 21, 2023 •

edited

Loading

AniZpZ commented Sep 21, 2023 •

edited

Loading

AniZpZ commented Sep 21, 2023 •

edited

Loading

AniZpZ commented Sep 23, 2023 •

edited

Loading

ChristineSeven commented Nov 27, 2023 •

edited

Loading

shatealaboxiaowang commented Dec 8, 2023 •

edited

Loading

ChristineSeven commented Dec 8, 2023 •

edited

Loading

HandH1998 commented Jan 4, 2024 •

edited

Loading