[Model] Introduce CUDA Graph support for DeepSeek v3 #12204

houseroad · 2025-01-20T06:43:28Z

Kudos to @jianyuh, who introduce the CUDA graph to DeepSeek v3. The overall throughput almost doubled based on the testing.

Signed-off-by: Lu Fang <[email protected]>

github-actions · 2025-01-20T06:43:39Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

youkaichao · 2025-01-20T06:54:57Z

wow that's amazing!
cc @WoosukKwon

ywang96 · 2025-01-20T07:21:34Z

@houseroad Before Woosuk takes a look at this PR, ~~can you run format.sh so that we can get the formatting issue out of the way? Thanks!~~ nvm - the pre-commit PR just got merged #11975, feel free to check the documentation here

tlrmchlsmth

LGTM, and thanks for the fix! (I'm accepting but don't have a system to run DeepSeek v3 so can't verify the fix -- changes look good anyway)

tlrmchlsmth · 2025-01-20T14:36:09Z

csrc/moe/moe_align_sum_kernels.cu

-  if (num_experts >= 256) {
+  if (!use_shared_memory) {


Why was this change needed for the fix BTW?

tlrmchlsmth · 2025-01-20T14:38:15Z

csrc/moe/moe_align_sum_kernels.cu

-          const int32_t mem_tokens_cnts =
-              ((num_experts + 1) * num_experts) * sizeof(int32_t);
-          const int32_t mem_cumsum = (num_experts + 1) * sizeof(int32_t);
-          // allocate global memory
-          int32_t* tokens_cnts;
-          int32_t* cumsum;
-          cudaMalloc(&tokens_cnts, mem_tokens_cnts);
-          cudaMalloc(&cumsum, mem_cumsum);
+          torch::Tensor token_cnts =
+              torch::empty({(num_experts + 1) * num_experts},
+                           torch::TensorOptions()
+                               .dtype(torch::kInt)
+                               .device(topk_ids.device()));
+          torch::Tensor cumsum =
+              torch::empty({num_experts + 1}, torch::TensorOptions()
+                                                  .dtype(torch::kInt)
+                                                  .device(topk_ids.device()));


Makes sense to me, since during cuda graph capture, some actions, such as cudaMalloc, may be unsafe

Nice doc pointer, thanks

mgoin

LGTM thank you

youkaichao · 2025-01-20T15:55:34Z

it is recommended to merge main, and use pre-commit to run the linter now.

mgoin · 2025-01-20T16:12:02Z

I actually think this PR #12222 has a better implementation of this optimization, could you please help review @houseroad ?

mgoin

Considering this PR instead #12222

houseroad · 2025-01-20T22:20:17Z

Yeah, agree #12222 is a better approach, left some comment, but overall looks good to me. Let's go with #12222. So close this one.

Introduce CUDA Graph support for DeepSeek v3

68d8ad7

Signed-off-by: Lu Fang <[email protected]>

casper-hansen mentioned this pull request Jan 20, 2025

[Kernel] add triton fused moe kernel for gptq/awq #12185

Merged

tlrmchlsmth approved these changes Jan 20, 2025

View reviewed changes

mgoin approved these changes Jan 20, 2025

View reviewed changes

mgoin requested changes Jan 20, 2025

View reviewed changes

houseroad closed this Jan 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model] Introduce CUDA Graph support for DeepSeek v3 #12204

[Model] Introduce CUDA Graph support for DeepSeek v3 #12204

houseroad commented Jan 20, 2025

github-actions bot commented Jan 20, 2025

youkaichao commented Jan 20, 2025

ywang96 commented Jan 20, 2025 •

edited

Loading

tlrmchlsmth left a comment

tlrmchlsmth Jan 20, 2025

tlrmchlsmth Jan 20, 2025

mgoin Jan 20, 2025

mgoin left a comment

youkaichao commented Jan 20, 2025

mgoin commented Jan 20, 2025 •

edited

Loading

mgoin left a comment •

edited

Loading

houseroad commented Jan 20, 2025

[Model] Introduce CUDA Graph support for DeepSeek v3 #12204

[Model] Introduce CUDA Graph support for DeepSeek v3 #12204

Conversation

houseroad commented Jan 20, 2025

github-actions bot commented Jan 20, 2025

youkaichao commented Jan 20, 2025

ywang96 commented Jan 20, 2025 • edited Loading

tlrmchlsmth left a comment

Choose a reason for hiding this comment

tlrmchlsmth Jan 20, 2025

Choose a reason for hiding this comment

tlrmchlsmth Jan 20, 2025

Choose a reason for hiding this comment

mgoin Jan 20, 2025

Choose a reason for hiding this comment

mgoin left a comment

Choose a reason for hiding this comment

youkaichao commented Jan 20, 2025

mgoin commented Jan 20, 2025 • edited Loading

mgoin left a comment • edited Loading

Choose a reason for hiding this comment

houseroad commented Jan 20, 2025

ywang96 commented Jan 20, 2025 •

edited

Loading

mgoin commented Jan 20, 2025 •

edited

Loading

mgoin left a comment •

edited

Loading