Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeepSeek V3 Support #35425

Open
2 tasks done
casper-hansen opened this issue Dec 26, 2024 · 9 comments · May be fixed by #35926
Open
2 tasks done

DeepSeek V3 Support #35425

casper-hansen opened this issue Dec 26, 2024 · 9 comments · May be fixed by #35926

Comments

@casper-hansen
Copy link

casper-hansen commented Dec 26, 2024

Model description

Transformer model

DeepSeek V3 is a Transformer model that utilizes Mixture of Experts (similar to Qwen2 MoE) and Multi-head Latent Attention (MLA).

image

Multi-token Prediction

The model is able to predict multiple tokens sequentially at each step through the MTP modules. The first token is generated by the causal LM which feeds the output token into what I would describe as a "Transformer head" to generate additional tokens for the current step. DeepSeek notes in their release that "MTP support is currently under active development within the community, and we welcome your contributions and feedback." (i.e. code for this is not released).

image

Open source status

  • The model implementation is available
  • The model weights are available

Provide useful links for the implementation

Transformers Code: https://huggingface.co/deepseek-ai/DeepSeek-V3
GitHub Code (minimal implementation): https://github.com/deepseek-ai/DeepSeek-V3/tree/main/inference
Paper: https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf

@Qubitium
Copy link
Contributor

Historical perspective on Deepseek. Deepseek v2 support was never added by Deepseek team. There is a community v2 PR that never went out of draft phase.

#31976

Unless added by oss community or hf, history shows deepseek will not proactively add hf support as their priority is sglang, lmdeploy and others.

Lets hope someone or hf pickup this ball on this as this is not a simple model to support.

@ArthurZucker

@casper-hansen
Copy link
Author

The DeepSeek v3 code is mostly available already though. They put an MIT license on the code in their repository. So a PR mostly needs a multi-token prediction implementation.

https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/main/modeling_deepseek.py

@Nottlespike
Copy link

The DeepSeek v3 code is mostly available already though. They put an MIT license on the code in their repository. So a PR mostly needs a multi-token prediction implementation.

https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/main/modeling_deepseek.py

This isn't quite correct right? DeepSeek-V2/V2.5 never got natively implemented as per #31976 and there is modeling code for V2 as well. https://huggingface.co/deepseek-ai/DeepSeek-V2/blob/main/modeling_deepseek.py

@Nottlespike
Copy link

Looking at the ADD_NEW_MODEL_PROPOSAL_TEMPLATE.md from the REPO I'm willing to try my hand at attempting a native DeepSeek-V3 impelemntation but since that README was last updated 10 months ago is the mentor system still valid and is this the up to date way to go about it?

@LysandreJik
Copy link
Member

cc @Cyrilvallez on the last comment above regarding guiding model integrations ^

@Nottlespike
Copy link

cc @Cyrilvallez on the last comment above regarding guiding model integrations ^

Thanks! Would love to do this as per protocol as I am dying to use transformers tools on this model!

@IYoreI
Copy link

IYoreI commented Jan 2, 2025

https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/main/modeling_deepseek.py#L439
what does this mean? It seems that this code can only run inference
image

@Nottlespike
Copy link

Bumping for @Cyrilvallez advisement. We, not really me @fairydreaming, skipped transformers functionally for llama.cpp. But this would be very appreciated to have for those who want to try like exl2 or AWQ

@Cyrilvallez
Copy link
Member

Hey @Nottlespike! Very nice that you want to tackle this! Sorry for the late answer, I was still on vacation. Regarding model integrations rules, you can also check here and modular rules.
What you want to do is to isolate the small individual changes from your model (deepseek), and existing models in the library (e.g. mixtral, qwen2 moe, deepseek v1...), then create a modular file based on those differences. You can check e.g. this past model integration for an example of model addition with modular.
The modeling code already on the hub should provide a strong starting point.
Let me know if you have any questions 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants