Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] DeepSeek V3 optimization #2591

Open
11 of 18 tasks
zhyncs opened this issue Dec 26, 2024 · 31 comments
Open
11 of 18 tasks

[Feature] DeepSeek V3 optimization #2591

zhyncs opened this issue Dec 26, 2024 · 31 comments
Assignees
Labels
enhancement New feature or request high priority performance quant LLM Quantization

Comments

@zhyncs
Copy link
Member

zhyncs commented Dec 26, 2024

Checklist

Adoption

SGLang adoption for DeepSeek V3 and R1

Usage

User Guide for Existing System (Installation & Launch)

https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3

Please use the latest version v0.4.2.post4. Please prefer to use docker image. docker pull lmsysorg/sglang:latest

For running on AMD MI300X, use this as a reference. Running DeepSeek-R1 on a single NDv5 MI300X VM

Features

Related resources

No response

@zhyncs zhyncs added enhancement New feature or request performance quant LLM Quantization labels Dec 26, 2024
@zhyncs zhyncs pinned this issue Dec 26, 2024
@libratiger
Copy link
Contributor

Very quick response !
I understand that the overlap scheduler is model-independent and is a general optimization that should be supported by default.
At least some special optimizations are needed?

@merrymercy
Copy link
Contributor

merrymercy commented Dec 26, 2024

The overlap scheduler is model-independent but has not been supported when using dp attention. We have a private branch for this and will upstream it soon.

@fengyang95
Copy link

fengyang95 commented Dec 26, 2024

Is the memory sufficient for an 8 gpus instance? This model size is too large.

@zhyncs
Copy link
Member Author

zhyncs commented Dec 26, 2024

Is the memory sufficient for an 8 gpus instance? This model size is too large.

671B works on H200 * 8 with FP8 (671 < 141 * 8)

@zhyncs
Copy link
Member Author

zhyncs commented Dec 26, 2024

Hi @fengyang95 You can also consider multi node.

If you do not have GPUs with large enough memory, please try multi-node tensor parallelism (help 1 help 2).

@zhyncs
Copy link
Member Author

zhyncs commented Dec 26, 2024

FYI Due to the tight schedule, SGLang v0.4.1 currently only provides preliminary support for DeepSeek V3. To make it run more cost-efficiently, we need to complete most of the optimizations mentioned above. If you are interested in any of the above optimizations, feel free to join the SGLang Slack for discussions or contribute a PR. We hope to complete these optimizations quickly and appreciate any discussion and contributions.

@zhyncs
Copy link
Member Author

zhyncs commented Dec 27, 2024

Update: SGLang v0.4.1.post1 supports CUDA Graph for DeepSeek V3, please use the latest version.

pip install "sglang[all]==0.4.1.post1" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer

@zhyncs
Copy link
Member Author

zhyncs commented Dec 29, 2024

Update: SGLang v0.4.1.post2 supports FP8 GEMM Tuning for DeepSeek V3, please use the latest version.

pip install "sglang[all]==0.4.1.post2" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer

@zhyncs
Copy link
Member Author

zhyncs commented Dec 30, 2024

ref #2647

@CSEEduanyu
Copy link

plan to support mtp?

@zhyncs
Copy link
Member Author

zhyncs commented Jan 6, 2025

plan to support mtp?

It's on the roadmap and it's named nextn. We'll support it soon.

@lixiaolx
Copy link

lixiaolx commented Jan 8, 2025

@zhyncs @Ying1123 @merrymercy ,hello,
As you mentioned above, TP+DP,

TP+DP Attention @Ying1123

I have two questions, could you help me answer them?

1.Can we decouple TP and DP after this implementation? Can we configure the scenario where DP is not equal to TP?

2.Is there a detailed schedule for the mentioned above? Are there any related supporting design documents that can be shared?

@Mutinifni
Copy link
Contributor

I had another question regarding DP attention. The sglang blog mentions that DP attention is effective because of the MLA has only 1 KV head, which causes unnecessary duplication of KV caches. DeepSeek-V3 MLA has more KV heads (16 attention, 128 KV), so do we still replicate KV caches if just using something like TP8 or TP16? I understand there might not be sufficient heads if the deployment is large.

+1 for shared design docs, if possible.

@pipul
Copy link

pipul commented Jan 13, 2025

I had another question regarding DP attention. The sglang blog mentions that DP attention is effective because of the MLA has only 1 KV head, which causes unnecessary duplication of KV caches. DeepSeek-V3 MLA has more KV heads (16 attention, 128 KV), so do we still replicate KV caches if just using something like TP8 or TP16? I understand there might not be sufficient heads if the deployment is large.

+1 for shared design docs, if possible.

@zhyncs @Mutinifni

https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/main/config.json
"num_key_value_heads": 128,
DeepSeek-V3 has 128 KV heads??

@min-xu-et
Copy link
Contributor

Are there any data related to inference time batch size and token imbalance between experts? What's the total throughput like for a 8xH200 node?

@CSEEduanyu
Copy link

Has there been any progress with the support from NextN?

@lambert0312
Copy link

The overlap scheduler with DP attention can not be used on A800 * 4., because always OOM.

@MtFitzRoy
Copy link

Is there a plan to support TP + SP attention?

The paper says "The attention part employs 4-way Tensor Parallelism (TP4) with Sequence Parallelism (SP)"

@wuyaoxuehun
Copy link

Can you support deepseek R1 Q4KM GGUF file,https://huggingface.co/unsloth/DeepSeek-R1-GGUF

@Xu-Chen
Copy link
Contributor

Xu-Chen commented Jan 27, 2025

Any process about nextn speculative decoding?

@Neo9061
Copy link

Neo9061 commented Feb 3, 2025

Hi @ispobock, wonder if you have any timeline to share nextn (speculative decoding) will be supported? thanks

@ispobock
Copy link
Collaborator

ispobock commented Feb 4, 2025

We have finished spec module refactor and will support nextn in the next 1~2 weeks.

@Neo9061
Copy link

Neo9061 commented Feb 4, 2025

We have finished spec module refactor and will support nextn in the next 1~2 weeks.

Thanks! I wonder if your implementation will include any mechanism to generate the acceptance rates of the MTP head?

@lambert0312
Copy link

DeepSeek MTP spec decode #12755 is Implement DeepSeek MTP: vllm-project/vllm#12181 to support DeepSeek MTP layers for next n prediction.

@yukavio
Copy link
Collaborator

yukavio commented Feb 5, 2025

Does sglang now support deepseekv3 inference with EP>1? When I added --enable-ep-moe to the command to start the service, I found that the process would hang. I'm not sure if this is a problem caused by my environment or if this feature is not currently supported.

@jianglan89
Copy link

Does sglang now support deepseekv3 inference with EP>1? When I added --enable-ep-moe to the command to start the service, I found that the process would hang. I'm not sure if this is a problem caused by my environment or if this feature is not currently supported.

Me too, it seems not support ep-moe in 0.4.2

@BBuf

This comment has been minimized.

@01lin
Copy link

01lin commented Feb 7, 2025

When will mtp(nextn speculative decoding) be supported?

@lambert0312
Copy link

This is https://github.com/CentML's implementation of DeepSeek MTP modules that enable speculative decoding for DeepSeek-R1. vllm-project/vllm#12915

@ltm920716
Copy link

ltm920716 commented Feb 8, 2025

hi,
I am new to sglang,and I need help about deploying deepseek on two nodes:
https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-two-h2008-nodes-and-docker

I have two H100 nodes,what is the best setting at parameters for the throughout?

thanks

@zhyncs
Copy link
Member Author

zhyncs commented Feb 10, 2025

FYI You can check out the latest progress of DeepSeek V3/R1 nextn here. #3472
I believe the majority of the difficult tasks have been completed recently, and we will be launching this feature in the coming days. Please stay tuned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request high priority performance quant LLM Quantization
Projects
None yet
Development

No branches or pull requests