[Usage]: deepseek v3 can not set tensor_parallel_size=16 and pipeline-parallel-size=2 on L20 #12256 Open #12258

xwz-ol · 2025-01-21T09:25:59Z

Your current environment

The output of `python collect_env.py`

(RayWorkerWrapper pid=5057, ip=10.121.129.5) Cache shape torch.Size([163840, 64]) [repeated 30x across cluster]
(RayWorkerWrapper pid=5849, ip=10.121.129.12) INFO 01-21 00:46:19 model_runner.py:1099] Loading model weights took 18.9152 GB [repeated 7x across cluster]
(RayWorkerWrapper pid=5148, ip=10.121.129.13) INFO 01-21 00:46:25 model_runner.py:1099] Loading model weights took 21.4118 GB [repeated 8x across cluster]

(RayWorkerWrapper pid=5050, ip=10.121.129.5) INFO 01-21 00:47:24 model_runner.py:1099] Loading model weights took 21.4118 GB [repeated 8x across cluster]
(RayWorkerWrapper pid=5054, ip=10.121.129.5) WARNING 01-21 00:47:31 fused_moe.py:374] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=128,device_name=NVIDIA_L20,dtype=fp8_w8a8.json
(RayWorkerWrapper pid=5053, ip=10.121.129.5) INFO 01-21 00:47:24 model_runner.py:1099] Loading model weights took 21.4118 GB [repeated 7x across cluster]
WARNING 01-21 00:47:34 fused_moe.py:374] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=128,device_name=NVIDIA_L20,dtype=fp8_w8a8.json
(RayWorkerWrapper pid=5146, ip=10.121.129.13) INFO 01-21 00:47:39 worker.py:241] Memory profiling takes 14.78 seconds
(RayWorkerWrapper pid=5146, ip=10.121.129.13) INFO 01-21 00:47:39 worker.py:241] the current vLLM instance can use total_gpu_memory (44.42GiB) x gpu_memory_utilization (0.70) = 31.10GiB
(RayWorkerWrapper pid=5146, ip=10.121.129.13) INFO 01-21 00:47:39 worker.py:241] model weights take 21.41GiB; non_torch_memory takes 0.40GiB; PyTorch activation peak memory takes 0.39GiB; the rest of the memory reserved for KV Cache is 8.89GiB.
(RayWorkerWrapper pid=5856, ip=10.121.129.12) WARNING 01-21 00:47:34 fused_moe.py:374] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=128,device_name=NVIDIA_L20,dtype=fp8_w8a8.json [repeated 30x across cluster]

How would you like to use vllm

I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

drikster80 · 2025-01-22T21:25:21Z

Can confirm there is an issue when using both TP and PP together on MoE models. They work independently, but not at the same time. Can also confirm that they work together fine on non-MoE models (e.g. tested on QWEN with TP=2 & PP=2).

Maybe a result of #12222 that added CUDA Graphs for MoE?

drikster80 · 2025-01-22T22:57:07Z

Just pulled from main and it appears this one has been fixed in the past 24-48 hours. I'm able to run with both TP=4 and PP=3 with Deepseek-R1. I think this can be closed.

xwz-ol added the usage How to use vllm label Jan 21, 2025

andoorve closed this as completed Jan 23, 2025

i4never mentioned this issue Feb 5, 2025

[Bug]: CUDA OOM occurs after serving a DeepSeek-V3 model with vLLM v0.7.1. #12756

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Usage]: deepseek v3 can not set tensor_parallel_size=16 and pipeline-parallel-size=2 on L20 #12256 Open #12258

[Usage]: deepseek v3 can not set tensor_parallel_size=16 and pipeline-parallel-size=2 on L20 #12256 Open #12258

xwz-ol commented Jan 21, 2025

drikster80 commented Jan 22, 2025

drikster80 commented Jan 22, 2025

[Usage]: deepseek v3 can not set tensor_parallel_size=16 and pipeline-parallel-size=2 on L20 #12256 Open #12258

[Usage]: deepseek v3 can not set tensor_parallel_size=16 and pipeline-parallel-size=2 on L20 #12256 Open #12258

Comments

xwz-ol commented Jan 21, 2025

Your current environment

How would you like to use vllm

Before submitting a new issue...

drikster80 commented Jan 22, 2025

drikster80 commented Jan 22, 2025