DeepseekMoE support with Fused MoE kernel #2453

zwd003 · 2024-01-16T11:47:12Z

Adding support for DeepseekMoE as described in here.

This work was partly done by @esmeetu and DeepSeek-AI

We have fixed some bugs in the @esmeetu's code and added support for expert parallelism and fused moe kernel.

Test code:

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="deepseek-ai/deepseek-moe-16b-base", trust_remote_code=True, tensor_parallel_size=8)
outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Ouput:

Prompt: 'Hello, my name is', Generated text: ' Mr. Jacobson, and I teach 6th Grade Mathematics and Physical Science'
Prompt: 'The president of the United States is', Generated text: ' not only the most powerful man on the planet, he is also one of the'
Prompt: 'The capital of France is', Generated text: ' among the most visited cities in Europe and the world, so it’s worth'
Prompt: 'The future of AI is', Generated text: ' here.\n- Develop your programming skills to use in AI.\n- Use'

update 2024.01.19

Performance Benchmarking for Fused MoE Improvements

This PR introduces significant performance enhancements when compared to the current method in Mistral and other baseline methods. Below is a summary of the benchmarks conducted:

Benchmark Details:

Hardware Configuration: 2 x A100 40G PCIe
Prompt Count: Varies per setting (32 or 256)
Average Prompt Length: 166 tokens
Max Tokens per Request: 64 or 256
Max Batch Size: 1 or 256

Results Summary:

Setting	Model/Method	Prompts	Max Tokens	Max Batch Size	Time Cost (Baseline)	Time Cost (This PR)	Speedup
1	Baseline vs. with_fused_moe	256	64	256	25s	9s	2.8x
2	Baseline vs. with_fused_moe	32	64	1	104s	33s (fix)	3.2x (fix)
3	Llama2-13b vs. deepseek-16b-moe (with_fused_moe)	256	256	256	51s	23s	2.2x
4	Llama2-13b vs. deepseek-16b-moe (with_fused_moe)	32	256	1	396s	140s	2.8x
5*	deepseek-16b-moe with PR #2293 vs. with_fused_moe	256	64	256	26s	9s	2.9x
6	deepseek-16b-moe vs. llama2-7b	256	64	256	8s	9s	0.9x

*Updated on 2024-01-20: Added comparison with PR #2293.
Updated on 2024.01.21: Implement the align_block_size function in C++ to achieve a 10% performance improvement. Now deepseek-moe16b is possible to achieve speeds almost identical to 7b dense model.

Updated on 2024.01.22: I found that it has greatly exceeded the speed of llama2-7b, with deepseekmoe-16b at 16587.82 tokens/s and llama2-7b at 10978.67 tokens/s. The speed bottleneck in the results from the table above(8s vs 9s) is not due to the model's computation speed. The following code was used to test this:

python benchmarks/benchmark_throughput.py  --model=meta-llama/Llama-2-7b\
 --input-len 1000 --output-len 64 -tp 2 --num-prompts 256

python benchmarks/benchmark_throughput.py  --model=deepseek-ai/deepseek-moe-16b-base \
--input-len 1000 --output-len 64 -tp 2 --num-prompts 256 --trust-remote-code

Future Works

I believe it is possible to achieve higher performance and surpass the speed of the 7b-dense model; we might need to do the following things:

Fuse the computations of shared_expert and routed_expert.
Fuse the gate, softmax, and topk operations.

those works may be done by the community in the future(in another pr)

esmeetu · 2024-01-16T15:28:35Z

@zwd003 Thanks for fixing my last PR! But have you seen that it seems no speed boost after adapting expert parallelism.

zhuohan123 · 2024-01-17T01:01:24Z

@esmeetu Can you help test and review this PR?

esmeetu · 2024-01-17T14:18:18Z

Hi, @zwd003, I refactor based on this PR . Please refer to #2467.
@zhuohan123 Sure. After i test on this, i found a more consistent model style aligning with Mixtral implementation. But it's difficult to make changes on this. So I commit a new PR based on this.

zwd003 · 2024-01-18T06:50:51Z

@zwd003 Thanks for fixing my last PR! But have you seen that it seems no speed boost after adapting expert parallelism.

in my setting(8 A100 40g, tp=8, max_tokens=256, number of prompts = 256, max_batch_size = 256, with average input tokens per prompt 168), this implementation has faster speed.

|code                    |enforce_eager.     |speed(it/s)|
|baseline(original code) |True               |1.87|
|this(20240116)          |True               |7.04|
|this(fused_moe)         |False              |10.73|

esmeetu · 2024-01-18T13:34:28Z

Hi, @zwd003 Could you help benchmark #2467 compared with this? I want to see how much performance difference.

zwd003 · 2024-01-19T03:24:41Z

i developed a fused MOE kernel that achieves faster speeds(different from #2293). the code is ready to review @esmeetu @zhuohan123.

esmeetu · 2024-01-19T04:39:28Z

@zwd003 Running on T4 GPU with float16 not working.

RuntimeError: Internal Triton PTX codegen error:
ptxas /tmp/compile-ptx-src-f6f8c2, line 505; error : Feature '.bf16' requires .target sm_80 or higher
ptxas /tmp/compile-ptx-src-f6f8c2, line 505; error : Feature 'cvt.bf16.f32' requires .target sm_80 or higher

zwd003 · 2024-01-19T05:38:06Z

@zwd003 Running on T4 GPU with float16 not working.

RuntimeError: Internal Triton PTX codegen error: ptxas /tmp/compile-ptx-src-f6f8c2, line 505; error : Feature '.bf16' requires .target sm_80 or higher ptxas /tmp/compile-ptx-src-f6f8c2, line 505; error : Feature 'cvt.bf16.f32' requires .target sm_80 or higher

now it can run with fp16

esmeetu · 2024-01-19T05:48:20Z

LGTM!
For the generation speed, i got 12 t/s for batch size 1 which is x2 than before. However this also not good as a 13B model which is 27t/s on my machine with TP4.

zwd003 · 2024-01-19T06:22:16Z

LGTM! For the generation speed, i got 12 t/s for batch size 1 which is x2 than before. However this also not good as a 13B model which is 27t/s on my machine with TP4.

in my setting(TP=2, A100 40g pcie, max_batch_size = 256, max_tokens=256, prompts=256 with average input length 166), llama2 13b cost 51s, deepseek16b-moe cost 25s. For smaller models, a smaller tp yields better acceleration effects compare to dense model. see benchmarks above for more results

vllm/model_executor/layers/fused_moe.py

scv119 · 2024-01-20T16:54:52Z

Pretty cool implementation! I do feel this PR will perform better result comparing to my implementation #2293. Wondering if you have benchmarked that yet?

SinanAkkoyun · 2024-01-22T01:53:36Z

Hi, thank you very much for all the great work!
Why is it being compared to llama 13B? If I didn't misunderstand, the MoE only has ~3B active parameters, what is holding the current implementation back from achieving llama 3B t/s?

zwd003 · 2024-01-22T02:32:48Z

Hi, thank you very much for all the great work! Why is it being compared to llama 13B? If I didn't misunderstand, the MoE only has ~3B active parameters, what is holding the current implementation back from achieving llama 3B t/s?

MoE requires more optimization to achieve the same effect as dense models (dense models are mostly dense matrix multiplication, and cublas can almost reach the limit of hardware computation speed). For instance, the current mistral8x7b is also slower than the 14b dense model. Additionally, MoE models involve many more kernel computations, such as gate (softmax) and topk. If these kernels continue to be fused together, there is potential for further speed improvements.

arnavdantuluri · 2024-01-22T04:29:55Z

This looks really good!
I'm curious though if this is applicable out of the box to mistral8x7b as well. If not what changes would be necessary to port this code to mistral8x7b

zwd003 · 2024-01-22T04:53:38Z

This looks really good! I'm curious though if this is applicable out of the box to mistral8x7b as well. If not what changes would be necessary to port this code to mistral8x7b

this also be applicable to mistral8x7b. Only need to modify the parameter names in Mistral (packing parameters from different experts together, you can se def pack_params() in deepseek model), but I haven't yet tested the acceleration effect in Mistral. Some Triton kernel hyperparameters (such as BLOCK_SIZE_M) may need to be adjusted to achieve the best performance.

casper-hansen · 2024-01-22T11:12:03Z

@zwd003 I have a question related to quantization.

How can we apply this optimization to quantized models? Do we need to dequantize weights before/during running the kernel to achieve the speedup?

zwd003 · 2024-01-22T13:24:27Z

@zwd003 I have a question related to quantization.

How can we apply this optimization to quantized models? Do we need to dequantize weights before/during running the kernel to achieve the speedup?

it needs a new kernel supporting quant matmul(we need to rewrite the kernel in cuda/cpp, but the main idea is not changed, that is, computing each block for the corresponding expert)

casper-hansen · 2024-01-22T13:34:08Z

it needs a new kernel supporting quant matmul(we need to rewrite the kernel in cuda/cpp, but the main idea is not changed, that is, computing each block for the corresponding expert)

I do believe we have Triton kernels for both GPTQ and AWQ. Perhaps you would need to create separate quantized Triton kernels based on the linked kernels for quantized matmul.

If this is of interest to DeepSeek, I could implement quantization in AWQ of the DeepSeek-MoE model.

esmeetu · 2024-01-22T14:55:17Z

vllm/model_executor/models/deepseek.py

+        intermediate_size: int,
+        hidden_act: str,
+        linear_method: Optional[LinearMethodBase] = None,
+        reduce_results=True,


It was a helpful way to remove ambiguity if we define another DeepseekExpertMLP without reducing results. Then we can remove this reduce_results parameter. If we keep this, adding a type for reduce_results looks nicer. Both choices are ok.

ok, it's right, i have fix it

zwd003 · 2024-01-22T16:12:07Z

Hi, thank you very much for all the great work! Why is it being compared to llama 13B? If I didn't misunderstand, the MoE only has ~3B active parameters, what is holding the current implementation back from achieving llama 3B t/s?

Even with the current code, I found that it is indeed 1.6 times faster than dense-7b. The performance bottleneck in the test results from the table is not in the model's computation. Please see the new test results.

SinanAkkoyun · 2024-01-24T01:18:01Z

Even with the current code, I found that it is indeed 1.6 times faster than dense-7b. The performance bottleneck in the test results from the table is not in the model's computation. Please see the new test results.

That's so great to hear, thanks!

csrc/moe_align_block_size_kernels.cu

vllm/transformers_utils/configs/deepseek.py

vllm/model_executor/models/deepseek.py

vllm/model_executor/layers/fused_moe.py

pcmoritz · 2024-01-29T19:42:50Z

Unfortunately, there seem to be some correctness issues with the kernel (this was prompted by the user feedback #2542 (comment)). It happens in all kinds of configurations in smaller ways but can be reproduced in a pretty major way with the following diff:

diff --git a/tests/kernels/test_fused_moe.py b/tests/kernels/test_fused_moe.py
index f68e84f4f9..1b8e6321e6 100644
--- a/tests/kernels/test_fused_moe.py
+++ b/tests/kernels/test_fused_moe.py
@@ -22,11 +22,11 @@ def torch_moe(a, w1, w2, topk_weight, topk_ids):
             topk_weight.view(B, -1, 1)).sum(dim=1)
 
 
-@pytest.mark.parametrize("m", [512, 222, 33, 1])
-@pytest.mark.parametrize("n", [2048, 256, 1024])
-@pytest.mark.parametrize("k", [128, 511, 1024])
-@pytest.mark.parametrize("e", [8, 64])
-@pytest.mark.parametrize("topk", [2, 6])
+@pytest.mark.parametrize("m", [1])
+@pytest.mark.parametrize("n", [2048, 256, 1024, 8192])
+@pytest.mark.parametrize("k", [4096])
+@pytest.mark.parametrize("e", [8])
+@pytest.mark.parametrize("topk", [2])
 @pytest.mark.parametrize("dtype", [torch.float16, torch.bfloat16])
 def test_fused_moe(
     m: int,
@@ -46,4 +46,4 @@ def test_fused_moe(
 
     triton_output = fused_moe(a, w1, w2, topk_weight, topk_ids, False)
     torch_output = torch_moe(a, w1, w2, topk_weight, topk_ids)
-    assert torch.allclose(triton_output, torch_output, atol=1e-2, rtol=0)
+    assert torch.allclose(triton_output, torch_output, atol=1e-3, rtol=0), torch.max(abs(triton_output - torch_output))

This gives an error of 0.0312 for the test_fused_moe[dtype1-2-8-4096-8192-1] setting and I have actually seen even larger divergences. I suspect there is a bug in the fused_moe kernel somewhere.

zwd003 · 2024-01-30T03:07:16Z

Unfortunately, there seem to be some correctness issues with the kernel (this was prompted by the user feedback #2542 (comment)). It happens in all kinds of configurations in smaller ways but can be reproduced in a pretty major way with the following diff:

diff --git a/tests/kernels/test_fused_moe.py b/tests/kernels/test_fused_moe.py
index f68e84f4f9..1b8e6321e6 100644
--- a/tests/kernels/test_fused_moe.py
+++ b/tests/kernels/test_fused_moe.py
@@ -22,11 +22,11 @@ def torch_moe(a, w1, w2, topk_weight, topk_ids):
             topk_weight.view(B, -1, 1)).sum(dim=1)
 
 
-@pytest.mark.parametrize("m", [512, 222, 33, 1])
-@pytest.mark.parametrize("n", [2048, 256, 1024])
-@pytest.mark.parametrize("k", [128, 511, 1024])
-@pytest.mark.parametrize("e", [8, 64])
-@pytest.mark.parametrize("topk", [2, 6])
+@pytest.mark.parametrize("m", [1])
+@pytest.mark.parametrize("n", [2048, 256, 1024, 8192])
+@pytest.mark.parametrize("k", [4096])
+@pytest.mark.parametrize("e", [8])
+@pytest.mark.parametrize("topk", [2])
 @pytest.mark.parametrize("dtype", [torch.float16, torch.bfloat16])
 def test_fused_moe(
     m: int,
@@ -46,4 +46,4 @@ def test_fused_moe(
 
     triton_output = fused_moe(a, w1, w2, topk_weight, topk_ids, False)
     torch_output = torch_moe(a, w1, w2, topk_weight, topk_ids)
-    assert torch.allclose(triton_output, torch_output, atol=1e-2, rtol=0)
+    assert torch.allclose(triton_output, torch_output, atol=1e-3, rtol=0), torch.max(abs(triton_output - torch_output))

This gives an error of 0.0312 for the test_fused_moe[dtype1-2-8-4096-8192-1] setting and I have actually seen even larger divergences. I suspect there is a bug in the fused_moe kernel somewhere.

This issue might be due to numerical precision; after switching to fp32, the difference is very small.

# assert hidden_states.dtype in [torch.float16, torch.bfloat16]# disable type assert
accumulator += tl.dot(a, b, allow_tf32=False) # disable tf32 in  in kernel
# accumulator = accumulator.to(compute_type) # disable translation to bf16 or fp16

test_fused_moe(1, 8192, 4096, 8, 2, torch.float32)

outputs:

fused_moe = tensor([[ 0.5336,  0.6163,  1.0068,  ...,  0.6273,  0.4510, -0.8778]],
       device='cuda:0')
torch_moe = tensor([[ 0.5336,  0.6163,  1.0068,  ...,  0.6273,  0.4510, -0.8778]],
       device='cuda:0')
(fused_moe - torch_moe).abs().max() = tensor(6.5863e-06, device='cuda:0')

pcmoritz · 2024-01-30T03:31:50Z

@zwd003 You are right, thanks a lot for checking. I only set the dtype to float32 OR used the allow_tf32=False but not both at the same time. Great catch! Let's merge the PR then :)

WoosukKwon · 2024-01-30T04:43:24Z

csrc/moe_align_block_size_kernels.cu

+                                size_t numel) {
+    const size_t tokens_per_thread = CEILDIV(numel, blockDim.x);
+    const size_t start_idx = threadIdx.x * tokens_per_thread;
+    __shared__ int32_t tokens_cnts[NUM_MAX_EXPERTS + 1][NUM_MAX_EXPERTS];


nit: Actually, this part doesn't need to be static. Shared memory size can be configured dynamically at the kernel launch time. However, I think we can fix this in a later PR.

WoosukKwon

@zwd003 LGTM. Thanks for the amazing work! We're really excited to have this model and the Triton fused MoE kernel.

@esmeetu Thanks for your original PR and your continuous contribution to vLLM!

@pcmoritz Many thanks for your review! It helped so much.

SinanAkkoyun · 2024-01-30T12:26:15Z

When will this merge be pushed to the latest openai docker image?

Thank you so much for the work!

casper-hansen · 2024-01-30T20:07:43Z

In my tests, (fused_moe - torch_moe).abs().max() gives 0.37 when compared to the Huggingface implementation of Mixtral. This is while setting both implementations to float16. @zwd003 are you sure the test implemented corresponds to the original implementation? It seems like a large difference when keeping the precision constant.

https://github.com/casper-hansen/AutoAWQ/blob/mixtral_fused/tests/test_fused_moe.py

pcmoritz · 2024-01-30T23:25:54Z

Thanks @casper-hansen, I'm digging into this some more and I'm also planning to add a test to the repo about it :)

In the test you posted (which is very nice btw), I see that all the states_fused are zero after running the forward pass (this is after porting it to the MixtralMoE that is now in master, so it might be different in your setting). I'm digging more into this (e.g. at the moment I'm not sure if the gate is being loaded).

pcmoritz · 2024-01-31T00:53:07Z

I figured out what is going on now I think. There were two adaptations I needed to make so your script can be adapted to the Mixtral MOE: Load the gate, and also set inplace=False so the hidden state doesn't get overridden. This is the updated script

import torch

torch.cuda.manual_seed(0)
torch.cuda.manual_seed_all(0)

import time
from vllm.model_executor.models.mixtral import MixtralMoE
from transformers.models.mixtral.configuration_mixtral import MixtralConfig
from transformers.models.mixtral.modeling_mixtral import MixtralSparseMoeBlock

config = MixtralConfig()

block = MixtralSparseMoeBlock(config).float().to("cuda")
fused = MixtralMoE(
    num_experts=config.num_local_experts,
    top_k=config.num_experts_per_tok,
    hidden_size=config.hidden_size,
    intermediate_size=config.intermediate_size,
)

fused.gate.linear_weights["weight"][:] = block.gate.weight.data
for i in range(config.num_local_experts):
    fused.ws[i][:] = torch.cat((
        block.experts[i].w1.weight.data,
        block.experts[i].w3.weight.data,
    ), dim=0).to("cuda")
    fused.w2s[i][:] = block.experts[i].w2.weight.data

def _run_profile(fn, inputs, rounds=2):
    start_time = time.perf_counter()
    torch.cuda.synchronize()

    for _ in range(rounds):
        states, router_logits = fn(inputs)

    torch.cuda.synchronize()
    end_time = time.perf_counter()

    return (end_time - start_time) / rounds, states, router_logits

# [batch_size, seq_len, hidden_dim]
inputs = torch.randn((1, 64, config.hidden_size)).to("cuda")

block_time, states_block, router_block = _run_profile(block.forward, inputs)
fused_time, states_fused, router_fused = _run_profile(fused.forward, inputs)

print(block_time, fused_time, block_time / fused_time)
print("states_fused", states_fused)
print("states_block", states_block)
print("diff1", (states_fused - states_block).mean().abs())
print("diff2", (states_fused - states_block).abs().max())

And this is the diff to the repo (mostly to make sure the MoE layer can run in the same process and also make sure it doesn't use lower precision tensor core arithmetic):

diff --git a/vllm/model_executor/layers/fused_moe.py b/vllm/model_executor/layers/fused_moe.py
index 998062d82d..99d8de7ccb 100644
--- a/vllm/model_executor/layers/fused_moe.py
+++ b/vllm/model_executor/layers/fused_moe.py
@@ -105,7 +105,7 @@ def fused_moe_kernel(
                     mask=offs_k[:, None] < K - k * BLOCK_SIZE_K,
                     other=0.0)
         # We accumulate along the K dimension.
-        accumulator += tl.dot(a, b)
+        accumulator += tl.dot(a, b, allow_tf32=False)
         # Advance the ptrs to the next K block.
         a_ptrs += BLOCK_SIZE_K * stride_ak
         b_ptrs += BLOCK_SIZE_K * stride_bk
@@ -235,7 +235,7 @@ def fused_moe(hidden_states: torch.Tensor,
     assert hidden_states.is_contiguous(), "Hidden_states must be contiguous"
     assert w1.is_contiguous(), "Expert weights1 must be contiguous"
     assert w2.is_contiguous(), "Expert weights2 must be contiguous"
-    assert hidden_states.dtype in [torch.float16, torch.bfloat16]
+    assert hidden_states.dtype in [torch.float16, torch.bfloat16, torch.float32]
     M, _ = hidden_states.shape
     E, N, _ = w1.shape
 
diff --git a/vllm/model_executor/models/mixtral.py b/vllm/model_executor/models/mixtral.py
index f36c35fd27..480de2b8bf 100644
--- a/vllm/model_executor/models/mixtral.py
+++ b/vllm/model_executor/models/mixtral.py
@@ -72,7 +72,7 @@ class MixtralMoE(nn.Module):
         params_dtype: Optional[torch.dtype] = None,
     ):
         super().__init__()
-        tp_size = get_tensor_model_parallel_world_size()
+        tp_size = 1 # get_tensor_model_parallel_world_size()
         self.num_total_experts = num_experts
         self.top_k = top_k
         self.hidden_size = hidden_size
@@ -93,13 +93,15 @@ class MixtralMoE(nn.Module):
                         2 * self.intermediate_size,
                         self.hidden_size,
                         device="cuda",
-                        dtype=self.params_dtype))
+                        dtype=self.params_dtype),
+            requires_grad=False)
         self.w2s = nn.Parameter(
             torch.empty(self.num_total_experts,
                         self.hidden_size,
                         self.intermediate_size,
                         device="cuda",
-                        dtype=self.params_dtype))
+                        dtype=self.params_dtype),
+            requires_grad=False)
 
         set_weight_attrs(self.ws, {
             "weight_loader": self.weight_loader,
@@ -139,13 +141,13 @@ class MixtralMoE(nn.Module):
                                         self.w2s,
                                         routing_weights,
                                         selected_experts,
-                                        inplace=True)
+                                        inplace=False)
 
-        final_hidden_states = tensor_model_parallel_all_reduce(
-            final_hidden_states)
+        # final_hidden_states = tensor_model_parallel_all_reduce(
+        #     final_hidden_states)
 
         return final_hidden_states.view(batch_size, sequence_length,
-                                        hidden_size)
+                                        hidden_size), None
 
 
 class MixtralAttention(nn.Module):
@@ -160,7 +162,7 @@ class MixtralAttention(nn.Module):
                  sliding_window: Optional[int] = None) -> None:
         super().__init__()
         self.hidden_size = hidden_size
-        tp_size = get_tensor_model_parallel_world_size()
+        tp_size = 1 # get_tensor_model_parallel_world_size()
         self.total_num_heads = num_heads
         assert self.total_num_heads % tp_size == 0
         self.num_heads = self.total_num_heads // tp_size

With those modifications, the results are very accurate even if I set the number of rounds to a high value, for example for rounds = 100, I'm getting

diff1 tensor(4.0595e-08, device='cuda:0', grad_fn=<AbsBackward0>)
diff2 tensor(0.0002, device='cuda:0', grad_fn=<MaxBackward1>)

I'll convert this to a test that can be committed into the repo next! Thanks for looking into this, I'm similarly interested in making sure the model quality is as high as possible :)

pcmoritz · 2024-01-31T02:29:41Z

@casper-hansen Unit test added in #2677

Co-authored-by: roy <[email protected]>

chenhongyu2048 · 2024-06-28T04:07:58Z

I'm wondering if this commit really applies expert parallelism?
Because I only see all_reduce in the code and not all_to_all that is needed for expert parallelism.

This was referenced Jan 16, 2024

开源的MoE模型支持中文吗？ deepseek-ai/DeepSeek-MoE#6

Closed

inference tools like vllm can support? deepseek-ai/DeepSeek-MoE#2

Closed

esmeetu mentioned this pull request Jan 17, 2024

Deepseek moe #2467

Closed

4 tasks

zwd003 changed the title ~~DeepseekMoE support~~ DeepseekMoE support with Fused MoE kernel Jan 18, 2024

zwd003 mentioned this pull request Jan 19, 2024

deepseek-moe-16b inference speed is slower than Baichuan-13b deepseek-ai/DeepSeek-MoE#13

Closed

WoosukKwon reviewed Jan 19, 2024

View reviewed changes

vllm/model_executor/layers/fused_moe.py Outdated Show resolved Hide resolved

casper-hansen mentioned this pull request Jan 20, 2024

[BOUNTY] Optimized Triton Kernels for full fine tunes axolotl-ai-cloud/axolotl#1038

Open

simon-mo mentioned this pull request Jan 22, 2024

DeepSeek MoE support #2534

Closed

pcmoritz mentioned this pull request Jan 22, 2024

Fused MOE for Mixtral #2542

Merged

esmeetu approved these changes Jan 22, 2024

View reviewed changes

pcmoritz reviewed Jan 29, 2024

View reviewed changes

csrc/moe_align_block_size_kernels.cu Outdated Show resolved Hide resolved

WoosukKwon reviewed Jan 29, 2024

View reviewed changes

vllm/transformers_utils/configs/deepseek.py Outdated Show resolved Hide resolved

vllm/model_executor/models/deepseek.py Outdated Show resolved Hide resolved

pcmoritz reviewed Jan 29, 2024

View reviewed changes

vllm/model_executor/layers/fused_moe.py Outdated Show resolved Hide resolved

fix up

0bb745f

zwd003 force-pushed the deepseekmoe-dev branch from 0cfc260 to 0bb745f Compare January 29, 2024 09:12

zhuohan123 mentioned this pull request Jan 30, 2024

Bump up version to v0.3.0 #2656

Merged

scv119 mentioned this pull request Jan 30, 2024

tensor parallel MOE implementation #2293

Closed

WoosukKwon reviewed Jan 30, 2024

View reviewed changes

WoosukKwon added 2 commits January 30, 2024 04:52

yapf

46603ca

Merge branch 'main' into deepseekmoe-dev

b85c1eb

WoosukKwon approved these changes Jan 30, 2024

View reviewed changes

WoosukKwon merged commit 5d60def into vllm-project:main Jan 30, 2024
10 of 12 checks passed

WoosukKwon mentioned this pull request Jan 30, 2024

[BUG] Quantization support for MoE models #2663

Closed

zhuohan123 mentioned this pull request Jan 31, 2024

[Roadmap] vLLM Roadmap Q1 2024 #2681

Closed

30 tasks

NikolaBorisov pushed a commit to deepinfra/vllm that referenced this pull request Jan 31, 2024

DeepseekMoE support with Fused MoE kernel (vllm-project#2453)

bc1fbfd

Co-authored-by: roy <[email protected]>

chu-tianxiang mentioned this pull request Feb 5, 2024

GPTQ & AWQ Fused MOE #2761

Open

3 tasks

casper-hansen mentioned this pull request Feb 11, 2024

MoE grouped gemm and fused topk_softmax casper-hansen/AutoAWQ_kernels#8

Merged

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

DeepseekMoE support with Fused MoE kernel (vllm-project#2453)

9bd89b4

Co-authored-by: roy <[email protected]>

alexm-redhat pushed a commit to neuralmagic/nm-vllm that referenced this pull request Feb 13, 2024

DeepseekMoE support with Fused MoE kernel (vllm-project#2453)

c1a74c1

Co-authored-by: roy <[email protected]>

andy-neuma mentioned this pull request Feb 23, 2024

andy/bump main to v0.3.2 neuralmagic/nm-vllm#49

Closed

danielhanchen mentioned this pull request Mar 9, 2024

Faster Inference & Training Roadmap unslothai/unsloth#226

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepseekMoE support with Fused MoE kernel #2453

DeepseekMoE support with Fused MoE kernel #2453

zwd003 commented Jan 16, 2024 •

edited

Loading

esmeetu commented Jan 16, 2024

zhuohan123 commented Jan 17, 2024

esmeetu commented Jan 17, 2024 •

edited

Loading

zwd003 commented Jan 18, 2024 •

edited

Loading

esmeetu commented Jan 18, 2024

zwd003 commented Jan 19, 2024

esmeetu commented Jan 19, 2024 •

edited

Loading

zwd003 commented Jan 19, 2024

esmeetu commented Jan 19, 2024

zwd003 commented Jan 19, 2024

scv119 commented Jan 20, 2024

SinanAkkoyun commented Jan 22, 2024 •

edited

Loading

zwd003 commented Jan 22, 2024

arnavdantuluri commented Jan 22, 2024

zwd003 commented Jan 22, 2024

casper-hansen commented Jan 22, 2024

zwd003 commented Jan 22, 2024 •

edited

Loading

casper-hansen commented Jan 22, 2024

esmeetu Jan 22, 2024

zwd003 Jan 22, 2024

zwd003 commented Jan 22, 2024

SinanAkkoyun commented Jan 24, 2024

pcmoritz commented Jan 29, 2024 •

edited

Loading

zwd003 commented Jan 30, 2024

pcmoritz commented Jan 30, 2024

WoosukKwon Jan 30, 2024 •

edited

Loading

WoosukKwon left a comment

SinanAkkoyun commented Jan 30, 2024

casper-hansen commented Jan 30, 2024

pcmoritz commented Jan 30, 2024 •

edited

Loading

pcmoritz commented Jan 31, 2024

pcmoritz commented Jan 31, 2024

chenhongyu2048 commented Jun 28, 2024

DeepseekMoE support with Fused MoE kernel #2453

DeepseekMoE support with Fused MoE kernel #2453

Conversation

zwd003 commented Jan 16, 2024 • edited Loading

Performance Benchmarking for Fused MoE Improvements

Benchmark Details:

Results Summary:

Future Works

I believe it is possible to achieve higher performance and surpass the speed of the 7b-dense model; we might need to do the following things:

esmeetu commented Jan 16, 2024

zhuohan123 commented Jan 17, 2024

esmeetu commented Jan 17, 2024 • edited Loading

zwd003 commented Jan 18, 2024 • edited Loading

esmeetu commented Jan 18, 2024

zwd003 commented Jan 19, 2024

esmeetu commented Jan 19, 2024 • edited Loading

zwd003 commented Jan 19, 2024

esmeetu commented Jan 19, 2024

zwd003 commented Jan 19, 2024

scv119 commented Jan 20, 2024

SinanAkkoyun commented Jan 22, 2024 • edited Loading

zwd003 commented Jan 22, 2024

arnavdantuluri commented Jan 22, 2024

zwd003 commented Jan 22, 2024

casper-hansen commented Jan 22, 2024

zwd003 commented Jan 22, 2024 • edited Loading

casper-hansen commented Jan 22, 2024

esmeetu Jan 22, 2024

Choose a reason for hiding this comment

zwd003 Jan 22, 2024

Choose a reason for hiding this comment

zwd003 commented Jan 22, 2024

SinanAkkoyun commented Jan 24, 2024

pcmoritz commented Jan 29, 2024 • edited Loading

zwd003 commented Jan 30, 2024

pcmoritz commented Jan 30, 2024

WoosukKwon Jan 30, 2024 • edited Loading

Choose a reason for hiding this comment

WoosukKwon left a comment

Choose a reason for hiding this comment

SinanAkkoyun commented Jan 30, 2024

casper-hansen commented Jan 30, 2024

pcmoritz commented Jan 30, 2024 • edited Loading

pcmoritz commented Jan 31, 2024

pcmoritz commented Jan 31, 2024

chenhongyu2048 commented Jun 28, 2024

zwd003 commented Jan 16, 2024 •

edited

Loading

esmeetu commented Jan 17, 2024 •

edited

Loading

zwd003 commented Jan 18, 2024 •

edited

Loading

esmeetu commented Jan 19, 2024 •

edited

Loading

SinanAkkoyun commented Jan 22, 2024 •

edited

Loading

zwd003 commented Jan 22, 2024 •

edited

Loading

pcmoritz commented Jan 29, 2024 •

edited

Loading

WoosukKwon Jan 30, 2024 •

edited

Loading

pcmoritz commented Jan 30, 2024 •

edited

Loading