-
Notifications
You must be signed in to change notification settings - Fork 10.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: add DeepSeek-v3 support #10981
Comments
The sigmoid routing thing or whatever is a bit different but the rest of the arch is largerly the same as deepseek2.5, just larger. There's no PR yet in hf transformers, it looks like they've built this atop of transformers 4.33 so that will be quite a merge to get properly i guess. |
In case it helps: transformers 4.46.3 is written here https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/requirements.txt |
What's missing to get this to work, and can one do anything to help? |
|
Can a dev help break down for us what would be required in |
|
@fairydreaming : How much more work is needed before you can accept collaborators and testers on your branch? I see on localllama that you have at least a PoC running. |
I still have to add a new pre-tokenizer regex and test the tokenization. I'm not sure how many weird regex quirks I'll encounter along the way, but I estimate it will take a few days at most. Edit: Also, I don't have MTP implemented, but it can be added later. |
You can do this without offical HF transformers support without |
My DeepSeek-V3 branch is here: https://github.com/fairydreaming/llama.cpp/tree/deepseek-v3 To convert the model to GGUF you need dequantized DeepSeek V3. You can download it from HF (there are several BF16 DeepSeek V3 models available, but I didn't test any of them) or run inference/fp8_cast_bf16.py script from the original model to convert it to bf16 (that's what I did). Note that it uses triton, so I think you need a GPU for this. In case you experience CUDA out of memory errors during conversion check this: https://huggingface.co/deepseek-ai/DeepSeek-V3/discussions/17 There are some minor tokenization differences compared to the original model, but I think it's usable. |
Some initial perplexity values over wiki.test.raw (not a full run) with Q4_K_S quantized model:
|
THANKS! Will begin running https://github.com/EleutherAI/lm-evaluation-harness on it ASAP! |
I ran farel-bench locally on the model, looks good! (first two are via OpenRouter, third is local)
|
What is your rig specs wise? |
@Nottlespike Epyc 9374F, 384GB RAM. It took almost 5 hours to run all 450 prompts. |
No GPU's? I got as 4x3090 Ti FE's linked together with the hacked P2P driver plus a ThreadRipper Pro 8 channels of 128GB DDR4 so I should be able to run it MUCH faster! I've seen your work before and REALLY appreciate your contributions! Any way we can get in contact? I know @bartowski1182 very well if they have a contact with you? |
@Nottlespike I have a single RTX 4090, but I didn't use it here. What is your exact CPU model? Regarding the contact I'm active on Reddit (mostly on r/LocalLLaMA) with the same username. |
I have been informed I am "unpopular to hated" on r/LocalLLaMA...... given I am basically using a "server" with 4 of the best consumer GPU's on the market and I called the tinybox a grift at best and a scam at worst. |
@fairydreaming Am I reading your PR correctly and you DON'T NEED |
@Nottlespike AFAIK llama.cpp conversion scripts only use HF transformers AutoTokenizer class and DeepSeek V3 has no custom tokenizer class implementation, so I guess there is no need for |
@fairydreaming This is elegant.... props. The previous HF transformers "implementation" forced |
EDIT: Ignore below, simple user error. @fairydreaming, I'm running your convert_hf_to_gguf_update.py file to create a GGUF after dequantizing the model, but when I run the script, I get an error. Any advice on what I'm doing wrong?
It always gives the same error, no matter what I run: Excited to replicate what you've done! Great work. |
@etafund that's the script for updating the conversion script, use the one without _update |
Thanks, @fairydreaming! Your updated conversion script is working perfectly going from BF16 to q8_0. I'll update with inference results once the quanting finishes and I have a chance to run it through its paces. |
What are your speeds with the 4090? |
|
@fairydreaming https://manifold.markets/Kearm20/will-i-be-able-to-run-deepseekv3-10 It is now. |
I have it running on my dual-socket genoa rig now. First result is 8.83t/s cpu-only inference. |
I got it to work also. Similar path as @RodriMora.
Specs: Hint for anyone trying to replicate: Make sure to run with --no-context-shift CUDA build with -ngl 0. Real life test on q8_0 gguf: |
@cpumaxx, running at q8_0 or q4? |
I quanted to q8_0. |
just curious, how much RAM does this use to run?
…On Fri, Jan 3, 2025 at 6:35 PM cpumaxx ***@***.***> wrote:
I have it running on my dual-socket genoa rig now. First result is 8.83t/s
cpu-only inference.
@fairydreaming <https://github.com/fairydreaming>: is there anything I
can do to assist in implementation? I'd need to pull and requant, but don't
mind doing so if I can be of use. Thanks for getting this model working!
—
Reply to this email directly, view it on GitHub
<#10981 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAA2SFMWUVV5DCBS4BSUQEL2I3C5PAVCNFSM6AAAAABUHCSRESVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRZGU4DKMZYGM>
.
You are receiving this because you commented.Message ID: <ggerganov/llama.
***@***.***>
--
勇気とユーモア
|
@arthurwolf, 680238 MiB is the model buffer size on my rig. So about 664.25 GB of RAM. |
llama-server process is using 711GB on my rig |
@cpumaxx have tried a few different commands here and am stuck at 4-5 tokens/second. Mind posting the command you're running that gets you to 8-9 tokens/second on your dual Genoa setup? Thanks so much!
|
100% -ngl should not be 0. 7 should work with a A6000. |
@Nottlespike Agree with you - just trying to figure out the Epyc Genoa CPU issue first, then will layer in the GPU to improve performance. |
Remove the numactl command and --no-mmap flag. Everything else is the same as what I'm using. |
Yo if y'all want to collaborate in real time I'm livestreaming this all over X at https://x.com/i/broadcasts/1MnxnDkgaMyGO |
But I'm also a maniac who is doing this from BF16 |
@etafund Try to limit the number of threads to 32 or 48 (-t 32) |
Still testing at about 5.5 tokens/second on a dual Epyc Genoa system. If anyone has advice on how to get this closer to 8-9 tokens/second, let me know. Thanks @fairydreaming and @cpumaxx and @Nottlespike for all the help so far.
|
@etafund What performance do you have on a CPU-only llama.cpp build? (compiled without CUDA) Edit: Also you can try increasing the number of threads to 48, 64 and check when the performance starts decreasing, I'm not sure what's the right value for your CPU. |
@cpumaxx If you have time please repeat the whole conversion and model testing process with the current code to confirm that it still works without problems (and that old DeepSeek V2 and V2.5 still work in case you have them). I just finished dealing with the llama.cpp file explosion caused by #10902, time to get some rest. |
Sure. I should be done by end of day if the last conversion was any barometer. |
I can now confirm that the re-quanted model works with the new code (and that the old model doesn't) |
@cpumaxx Can you post the error you get with the old model? |
Sorry, I meant the old v3 quant doesn't work with the new code (I tried as a sanity check to make sure I was on the new code and that the new quant was really different) |
OK, that was expected since a tensor name changed. |
Same here, new quants working fine with the latest commit. Gives this error with the old quants:
So everything working as expected. I also run the MMLU-PRO computer science benchmark and got really good results: |
Great! BTW how did you run the benchmark, did you use llama-server? I also tried to run this bench today out of curiosity (via OpenAI-compatible endpoint of llama-server) but experienced llama-server token generation speed gradually getting slower and slower (at the beginning it was over 9 t/s, but around question 180 only around 5 t/s) so I started investigating why it does that. I don't know, maybe the prompt cache started to grow too big and caused parts of the model to be removed from RAM. Did you notice a similar behavior? |
Same, running llama-server with the OAI api endpoint as the backend and ollama-mmlu-pro for the benchmark. This is the report results I got at the end:
So 2t/s average at the end. I didn't monitor it as I left it overnight and already closed the ssh session so I don't have the llama-server output. I'll test again and monitor RAM. But I think it may be just the longer questions? I get 5t/s on super low context (under 200 tokens). 4t/s at 500-1000 context. And 2t/s at 2000-3000 context from some quick test. |
@RodriMora I'm not sure, there is definitely large variance caused by different length of prompts/generated token sequences (and different sets of activated experts), but values close to 8 t/s are only at the beginning. Edit: tomorrow I'm going to disable prompt caching and run it again, will see if it changes anything. |
I know this is closed/merged, but as a datapoint: deepseek 2.5 didn't show any regressions. |
Great, thanks for checking! |
Prerequisites
Feature Description
Add support for DeepSeek-v3
https://huggingface.co/deepseek-ai/DeepSeek-V3
Currently not supported:
ERROR:hf-to-gguf:Model DeepseekV3ForCausalLM is not supported
Motivation
DeepSeek-v3 is a big MoE model of 685B params, would be great as offloading to RAM would be a must for most systems
Possible Implementation
There is no model card or technical report yet. I don't know how much different from v2 it is.
Edit: they have uploaded the model card and paper:
https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf
https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/main/README.md
The text was updated successfully, but these errors were encountered: