[Kernel] Support running GPTQ 8-bit models in Marlin #4533

alexm-redhat · 2024-05-01T16:59:30Z

This PR adds 8-bit weight support to the Marlin GPU kernel and the Marlin on-the-fly repack. As a result, all GPTQ 8-bit models (with any group size or act_order) can run via Marlin (for Ampere GPUs and up).

robertgshaw2-redhat · 2024-05-01T17:20:11Z

tests/conftest.py

@@ -273,7 +273,9 @@ def generate_greedy_logprobs(
        return all_logprobs

    def __del__(self):
-        del self.model
+        if self.model is not None:


why needed? @alexm-nm

I got occasional warnings about del self.model being None. Not necessary for correctness if it causes issues in CI.

I dont think we should touch this file

pcmoritz · 2024-05-01T23:48:40Z

csrc/quantization/gptq_marlin/gptq_marlin.cu

@@ -114,11 +115,20 @@ template <int lut> __device__ inline int lop3(int a, int b, int c) {
  return res;
 }

+template <int start_byte, int mask>


It wouldn't hurt to add a comment on what this does :)

Good idea, added

pcmoritz · 2024-05-01T23:51:43Z

csrc/quantization/gptq_marlin/gptq_marlin.cu

  FragC frag_c[thread_m_blocks][4][2];
  FragS frag_s[2][4];        // No act-order
  FragS act_frag_s[2][4][4]; // For act-order

+  // if (blockIdx.x == 0 && threadIdx.x == 0) {


This looks like it is left over from debugging -- remove it before merging the PR?

pcmoritz · 2024-05-02T00:11:28Z

vllm/model_executor/layers/quantization/gptq_marlin.py

@@ -62,7 +65,14 @@ def _get_perms():
    return perm, scale_perm, scale_perm_single


-_perm, _scale_perm, _scale_perm_single = _get_perms()
+_perm = {}


I wonder why you need these global variables -- if you want to avoid recomputing things, it is most likely cleaner / better to use a function with a @functools.lru_cache annotation :)

I just realized that we don't need most of this code anymore since we use auto-repack GPU kernel. Refactored to use only scale shuffles.

pcmoritz · 2024-05-02T00:12:24Z

vllm/model_executor/layers/quantization/gptq_marlin.py

+    elif num_bits == 8:
+        interleave = numpy.array([0, 2, 1, 3])
+    else:
+        raise Exception("num_bits must be 4 or 8, got {}".format(num_bits))


ValueError? Raising generic exceptions is not a good idea :)

pcmoritz · 2024-05-02T00:14:55Z

csrc/quantization/gptq_marlin/gptq_marlin.cu

+
+  TORCH_CHECK(max_shared_mem / 2 > scales_cache_size); // Sanity
+
+  // printf("tb_k = %d, tb_n = %d, pipe_size = %f, scales_cache_size = %d,


ditto, remove debug code

Good catch, removed

pcmoritz · 2024-05-02T00:15:10Z

csrc/quantization/gptq_marlin/gptq_marlin.cu

+  int scales_cache_size =
+      get_scales_cache_size(th_config, prob_m, prob_n, prob_k, num_bits,
+                            group_size, has_act_order, is_k_full);
+  // printf("scales_cache_size = %d\n", scales_cache_size);


same here and below

pcmoritz · 2024-05-02T00:16:09Z

csrc/quantization/gptq_marlin/gptq_marlin.cu

-  int num_threads = th_config.num_threads;
-  thread_k = th_config.thread_k;
-  thread_n = th_config.thread_n;
+  // printf("exec_config: max_m_blocks = %d, thread_k = %d, thread_n = %d\n",


alexm-redhat · 2024-05-02T02:58:55Z

Here are vLLM server TTFT and TPOT benchmark results for 8-bit GPTQ Llama-3-8B and Yi-34B on A100 GPU on GCP instance (based on benchmark_serving.py):

Llama3-8B 8-bit GPTQ:

Yi-34B 8-bit GPTQ:

Original PDFs:
vLLM Server - Llama-3-8B 8-bit.pdf
vLLM Server - Yi-34B 8-bit.pdf

pcmoritz · 2024-05-02T03:10:19Z

Thanks for fixing the comments, this looks much nicer :)

robertgshaw2-redhat · 2024-05-02T12:01:06Z

tests/models/test_gptq_marlin.py

@@ -63,10 +70,11 @@ def test_models(
    gptq_marlin_model = vllm_runner(model_name=model_name,
                                    revision=revision,
                                    dtype=dtype,
-                                    quantization="marlin",
+                                    quantization="gptq",


@alexm-nm this test should have marlin for quantization

Also - is enforce_eager=True required?

Oh must be a leftover from debug, good catch, will fix it in 30min

Fixed, tests pass

robertgshaw2-redhat · 2024-05-02T12:03:11Z

tests/models/test_gptq_marlin.py

                                    max_model_len=MAX_MODEL_LEN,
                                    tensor_parallel_size=1,
-                                    disable_custom_all_reduce=True)
+                                    disable_custom_all_reduce=True,


to make this cleaner, can we remove disable_custom_all_reduce?

robertgshaw2-redhat · 2024-05-02T14:23:54Z

@alexm-nm the model test failed

alexm-redhat added 3 commits May 1, 2024 12:54

Add 8-bit weight support to Marlin

5201ffd

format

7dd10b3

revert changes to offline_inference.py

c439eec

alexm-redhat mentioned this pull request May 1, 2024

[Feature]: GPTQ/AWQ quantization is not fully optimized yet. The speed can be slower than non-quantized models. #4359

Closed

robertgshaw2-redhat reviewed May 1, 2024

View reviewed changes

robertgshaw2-redhat requested review from pcmoritz and simon-mo May 1, 2024 17:22

robertgshaw2-redhat mentioned this pull request May 1, 2024

v0.4.2 Release Tracker #4505

Closed

remove conftest change

2bb278f

pcmoritz reviewed May 1, 2024

View reviewed changes

pcmoritz reviewed May 2, 2024

View reviewed changes

pcmoritz approved these changes May 2, 2024

View reviewed changes

review fixes based on pcmoritz comments

af9e3fd

robertgshaw2-redhat reviewed May 2, 2024

View reviewed changes

rob review comments

e8de8b9

robertgshaw2-redhat enabled auto-merge (squash) May 2, 2024 14:07

remove 32/512 config

9bb6cf7

auto-merge was automatically disabled May 2, 2024 15:39
Head branch was pushed to by a user without write access

alexm-redhat added 2 commits May 2, 2024 11:43

fix merge bug

2eec0e5

32=>64

4c71e54

robertgshaw2-redhat merged commit 7038e8b into vllm-project:main May 2, 2024
48 checks passed

robertgshaw2-redhat deleted the marlin_8bit branch May 2, 2024 16:56

robertgshaw2-redhat pushed a commit to neuralmagic/nm-vllm that referenced this pull request May 6, 2024

[Kernel] Support running GPTQ 8-bit models in Marlin (vllm-project#4533)

2078207

z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request May 7, 2024

[Kernel] Support running GPTQ 8-bit models in Marlin (vllm-project#4533)

d3ab1c7

dtrifiro pushed a commit to opendatahub-io/vllm that referenced this pull request May 7, 2024

[Kernel] Support running GPTQ 8-bit models in Marlin (vllm-project#4533)

81c0f04

dtrifiro mentioned this pull request May 15, 2024

bump ubi base image tag opendatahub-io/vllm#24

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel] Support running GPTQ 8-bit models in Marlin #4533

[Kernel] Support running GPTQ 8-bit models in Marlin #4533

alexm-redhat commented May 1, 2024

robertgshaw2-redhat May 1, 2024

alexm-redhat May 1, 2024

robertgshaw2-redhat May 1, 2024

alexm-redhat May 1, 2024

pcmoritz May 1, 2024

robertgshaw2-redhat May 2, 2024

alexm-redhat May 2, 2024

pcmoritz May 1, 2024

robertgshaw2-redhat May 2, 2024

alexm-redhat May 2, 2024

pcmoritz May 2, 2024

alexm-redhat May 2, 2024

pcmoritz May 2, 2024

alexm-redhat May 2, 2024

pcmoritz May 2, 2024

alexm-redhat May 2, 2024

pcmoritz May 2, 2024

alexm-redhat May 2, 2024

pcmoritz May 2, 2024

alexm-redhat May 2, 2024

alexm-redhat commented May 2, 2024 •

edited

Loading

pcmoritz commented May 2, 2024

robertgshaw2-redhat May 2, 2024

alexm-redhat May 2, 2024

alexm-redhat May 2, 2024

robertgshaw2-redhat May 2, 2024

alexm-redhat May 2, 2024

alexm-redhat May 2, 2024

robertgshaw2-redhat commented May 2, 2024


		TORCH_CHECK(max_shared_mem / 2 > scales_cache_size); // Sanity

		// printf("tb_k = %d, tb_n = %d, pipe_size = %f, scales_cache_size = %d,

[Kernel] Support running GPTQ 8-bit models in Marlin #4533

[Kernel] Support running GPTQ 8-bit models in Marlin #4533

Conversation

alexm-redhat commented May 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexm-redhat commented May 2, 2024 • edited Loading

pcmoritz commented May 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertgshaw2-redhat commented May 2, 2024

alexm-redhat commented May 2, 2024 •

edited

Loading