-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] Introduce CUDA Graph support for DeepSeek v3 #12204
Conversation
Signed-off-by: Lu Fang <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
wow that's amazing! |
@houseroad Before Woosuk takes a look at this PR, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, and thanks for the fix! (I'm accepting but don't have a system to run DeepSeek v3 so can't verify the fix -- changes look good anyway)
if (num_experts >= 256) { | ||
if (!use_shared_memory) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this change needed for the fix BTW?
const int32_t mem_tokens_cnts = | ||
((num_experts + 1) * num_experts) * sizeof(int32_t); | ||
const int32_t mem_cumsum = (num_experts + 1) * sizeof(int32_t); | ||
// allocate global memory | ||
int32_t* tokens_cnts; | ||
int32_t* cumsum; | ||
cudaMalloc(&tokens_cnts, mem_tokens_cnts); | ||
cudaMalloc(&cumsum, mem_cumsum); | ||
torch::Tensor token_cnts = | ||
torch::empty({(num_experts + 1) * num_experts}, | ||
torch::TensorOptions() | ||
.dtype(torch::kInt) | ||
.device(topk_ids.device())); | ||
torch::Tensor cumsum = | ||
torch::empty({num_experts + 1}, torch::TensorOptions() | ||
.dtype(torch::kInt) | ||
.device(topk_ids.device())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me, since during cuda graph capture, some actions, such as cudaMalloc, may be unsafe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice doc pointer, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thank you
it is recommended to merge main, and use |
I actually think this PR #12222 has a better implementation of this optimization, could you please help review @houseroad ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering this PR instead #12222
Kudos to @jianyuh, who introduce the CUDA graph to DeepSeek v3. The overall throughput almost doubled based on the testing.