[Feature] DeepSeek V3 optimization #2591

zhyncs · 2024-12-26T08:52:39Z

libratiger · 2024-12-26T10:38:29Z

Very quick response !
I understand that the overlap scheduler is model-independent and is a general optimization that should be supported by default.
At least some special optimizations are needed?

merrymercy · 2024-12-26T11:33:36Z

The overlap scheduler is model-independent but has not been supported when using dp attention. We have a private branch for this and will upstream it soon.

fengyang95 · 2024-12-26T13:27:00Z

Is the memory sufficient for an 8 gpus instance? This model size is too large.

zhyncs · 2024-12-26T15:10:45Z

Is the memory sufficient for an 8 gpus instance? This model size is too large.

671B works on H200 * 8 with FP8 (671 < 141 * 8)

zhyncs · 2024-12-26T16:39:53Z

Hi @fengyang95 You can also consider multi node.

If you do not have GPUs with large enough memory, please try multi-node tensor parallelism (help 1 help 2).

zhyncs · 2024-12-26T18:59:38Z

FYI Due to the tight schedule, SGLang v0.4.1 currently only provides preliminary support for DeepSeek V3. To make it run more cost-efficiently, we need to complete most of the optimizations mentioned above. If you are interested in any of the above optimizations, feel free to join the SGLang Slack for discussions or contribute a PR. We hope to complete these optimizations quickly and appreciate any discussion and contributions.

zhyncs · 2024-12-27T17:47:27Z

Update: SGLang v0.4.1.post1 supports CUDA Graph for DeepSeek V3, please use the latest version.

pip install "sglang[all]==0.4.1.post1" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer

zhyncs · 2024-12-29T16:13:17Z

Update: SGLang v0.4.1.post2 supports FP8 GEMM Tuning for DeepSeek V3, please use the latest version.

pip install "sglang[all]==0.4.1.post2" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer

zhyncs · 2024-12-30T05:44:52Z

ref #2647

CSEEduanyu · 2025-01-06T05:46:53Z

plan to support mtp?

zhyncs · 2025-01-06T18:04:53Z

plan to support mtp?

It's on the roadmap and it's named nextn. We'll support it soon.

lixiaolx · 2025-01-08T02:46:05Z

@zhyncs @Ying1123 @merrymercy ,hello,
As you mentioned above, TP+DP,

TP+DP Attention @Ying1123

I have two questions, could you help me answer them?

1.Can we decouple TP and DP after this implementation? Can we configure the scenario where DP is not equal to TP?

2.Is there a detailed schedule for the mentioned above? Are there any related supporting design documents that can be shared?

Mutinifni · 2025-01-10T21:27:09Z

I had another question regarding DP attention. The sglang blog mentions that DP attention is effective because of the MLA has only 1 KV head, which causes unnecessary duplication of KV caches. DeepSeek-V3 MLA has more KV heads (16 attention, 128 KV), so do we still replicate KV caches if just using something like TP8 or TP16? I understand there might not be sufficient heads if the deployment is large.

+1 for shared design docs, if possible.

pipul · 2025-01-13T07:30:49Z

I had another question regarding DP attention. The sglang blog mentions that DP attention is effective because of the MLA has only 1 KV head, which causes unnecessary duplication of KV caches. DeepSeek-V3 MLA has more KV heads (16 attention, 128 KV), so do we still replicate KV caches if just using something like TP8 or TP16? I understand there might not be sufficient heads if the deployment is large.

+1 for shared design docs, if possible.

@zhyncs @Mutinifni

https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/main/config.json
"num_key_value_heads": 128,
DeepSeek-V3 has 128 KV heads??

min-xu-et · 2025-01-14T18:56:03Z

Are there any data related to inference time batch size and token imbalance between experts? What's the total throughput like for a 8xH200 node?

CSEEduanyu · 2025-01-19T14:09:38Z

Has there been any progress with the support from NextN?

lambert0312 · 2025-01-22T06:31:27Z

The overlap scheduler with DP attention can not be used on A800 * 4., because always OOM.

MtFitzRoy · 2025-01-22T10:31:25Z

Is there a plan to support TP + SP attention?

The paper says "The attention part employs 4-way Tensor Parallelism (TP4) with Sequence Parallelism (SP)"

wuyaoxuehun · 2025-01-26T08:46:52Z

Can you support deepseek R1 Q4KM GGUF file，https://huggingface.co/unsloth/DeepSeek-R1-GGUF

Xu-Chen · 2025-01-27T06:47:55Z

Any process about nextn speculative decoding?

Neo9061 · 2025-02-03T17:36:18Z

Hi @ispobock, wonder if you have any timeline to share nextn (speculative decoding) will be supported? thanks

ispobock · 2025-02-04T05:48:40Z

We have finished spec module refactor and will support nextn in the next 1~2 weeks.

Neo9061 · 2025-02-04T21:44:00Z

We have finished spec module refactor and will support nextn in the next 1~2 weeks.

Thanks! I wonder if your implementation will include any mechanism to generate the acceptance rates of the MTP head?

lambert0312 · 2025-02-05T02:03:32Z

DeepSeek MTP spec decode #12755 is Implement DeepSeek MTP: vllm-project/vllm#12181 to support DeepSeek MTP layers for next n prediction.

yukavio · 2025-02-05T09:22:32Z

Does sglang now support deepseekv3 inference with EP>1? When I added --enable-ep-moe to the command to start the service, I found that the process would hang. I'm not sure if this is a problem caused by my environment or if this feature is not currently supported.

jianglan89 · 2025-02-05T15:04:21Z

Does sglang now support deepseekv3 inference with EP>1? When I added --enable-ep-moe to the command to start the service, I found that the process would hang. I'm not sure if this is a problem caused by my environment or if this feature is not currently supported.

Me too, it seems not support ep-moe in 0.4.2

01lin · 2025-02-07T03:04:56Z

When will mtp(nextn speculative decoding) be supported?

lambert0312 · 2025-02-08T02:51:25Z

This is https://github.com/CentML's implementation of DeepSeek MTP modules that enable speculative decoding for DeepSeek-R1. vllm-project/vllm#12915

ltm920716 · 2025-02-08T09:02:09Z

hi,
I am new to sglang，and I need help about deploying deepseek on two nodes：
https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-two-h2008-nodes-and-docker

I have two H100 nodes，what is the best setting at parameters for the throughout？

thanks

zhyncs · 2025-02-10T14:51:20Z

FYI You can check out the latest progress of DeepSeek V3/R1 nextn here. #3472
I believe the majority of the difficult tasks have been completed recently, and we will be launching this feature in the coming days. Please stay tuned.

zhyncs added enhancement New feature or request performance quant LLM Quantization labels Dec 26, 2024

zhyncs assigned HaiShaw, merrymercy, ispobock, HandH1998, zhyncs and Ying1123 Dec 26, 2024

zhyncs pinned this issue Dec 26, 2024

zhyncs added the high priority label Dec 26, 2024

mowentian mentioned this issue Jan 2, 2025

How can I play with the speculative decoding which metioned in the paper? deepseek-ai/DeepSeek-V3#14

Closed

roG0d mentioned this issue Jan 7, 2025

Benchmark results for DeepSeek-v3 in 2x8xH200 Cluster #2738

Closed

3 tasks

This comment has been minimized.

Sign in to view

luzengxiangcn mentioned this issue Feb 7, 2025

If dp_size = tp_size is still required for deepseek model? #3359

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] DeepSeek V3 optimization #2591

[Feature] DeepSeek V3 optimization #2591

zhyncs commented Dec 26, 2024 •

edited

Loading

libratiger commented Dec 26, 2024

merrymercy commented Dec 26, 2024 •

edited

Loading

fengyang95 commented Dec 26, 2024 •

edited

Loading

zhyncs commented Dec 26, 2024

zhyncs commented Dec 26, 2024

zhyncs commented Dec 26, 2024

zhyncs commented Dec 27, 2024

zhyncs commented Dec 29, 2024

zhyncs commented Dec 30, 2024

CSEEduanyu commented Jan 6, 2025

zhyncs commented Jan 6, 2025

lixiaolx commented Jan 8, 2025 •

edited

Loading

Mutinifni commented Jan 10, 2025

pipul commented Jan 13, 2025

min-xu-et commented Jan 14, 2025

CSEEduanyu commented Jan 19, 2025

lambert0312 commented Jan 22, 2025

MtFitzRoy commented Jan 22, 2025

wuyaoxuehun commented Jan 26, 2025

Xu-Chen commented Jan 27, 2025 •

edited

Loading

Neo9061 commented Feb 3, 2025

ispobock commented Feb 4, 2025

Neo9061 commented Feb 4, 2025

lambert0312 commented Feb 5, 2025

yukavio commented Feb 5, 2025

jianglan89 commented Feb 5, 2025

This comment has been minimized.

01lin commented Feb 7, 2025

lambert0312 commented Feb 8, 2025

ltm920716 commented Feb 8, 2025 •

edited

Loading

zhyncs commented Feb 10, 2025 •

edited

Loading

[Feature] DeepSeek V3 optimization #2591

[Feature] DeepSeek V3 optimization #2591

Comments

zhyncs commented Dec 26, 2024 • edited Loading

Checklist

Adoption

Usage

Features

Related resources

libratiger commented Dec 26, 2024

merrymercy commented Dec 26, 2024 • edited Loading

fengyang95 commented Dec 26, 2024 • edited Loading

zhyncs commented Dec 26, 2024

zhyncs commented Dec 26, 2024

zhyncs commented Dec 26, 2024

zhyncs commented Dec 27, 2024

zhyncs commented Dec 29, 2024

zhyncs commented Dec 30, 2024

CSEEduanyu commented Jan 6, 2025

zhyncs commented Jan 6, 2025

lixiaolx commented Jan 8, 2025 • edited Loading

Mutinifni commented Jan 10, 2025

pipul commented Jan 13, 2025

min-xu-et commented Jan 14, 2025

CSEEduanyu commented Jan 19, 2025

lambert0312 commented Jan 22, 2025

MtFitzRoy commented Jan 22, 2025

wuyaoxuehun commented Jan 26, 2025

Xu-Chen commented Jan 27, 2025 • edited Loading

Neo9061 commented Feb 3, 2025

ispobock commented Feb 4, 2025

Neo9061 commented Feb 4, 2025

lambert0312 commented Feb 5, 2025

yukavio commented Feb 5, 2025

jianglan89 commented Feb 5, 2025

This comment has been minimized.

01lin commented Feb 7, 2025

lambert0312 commented Feb 8, 2025

ltm920716 commented Feb 8, 2025 • edited Loading

zhyncs commented Feb 10, 2025 • edited Loading

zhyncs commented Dec 26, 2024 •

edited

Loading

merrymercy commented Dec 26, 2024 •

edited

Loading

fengyang95 commented Dec 26, 2024 •

edited

Loading

lixiaolx commented Jan 8, 2025 •

edited

Loading

Xu-Chen commented Jan 27, 2025 •

edited

Loading

ltm920716 commented Feb 8, 2025 •

edited

Loading

zhyncs commented Feb 10, 2025 •

edited

Loading