[Bug] Performance on DeepSeek-V2 #2890
Unanswered
liangzelang
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Checklist
Describe the bug
I have tested Deepseek-v2 on SGlang 0.3.5 , Recent, I test this model performance on SGlang 0.4.1.post5 again, and I found the MLA kernel (__fwd_kernel) faster in Prefill phase.
But there is no change with the __fwd_kernel triton operator . how does this kernel faster, or there are some differences on benchmark testcase?
SGLang0.3.5
Warmup ... Prefill. latency: 3.69200 s, throughput: 1974.00 token/s Decode. latency: 0.06813 s, throughput: 14.68 token/s Decode. latency: 0.04793 s, throughput: 20.87 token/s Decode. latency: 0.04779 s, throughput: 20.93 token/s Decode. latency: 0.04779 s, throughput: 20.93 token/s Decode. latency: 0.04796 s, throughput: 20.85 token/s Decode. median latency: 0.04796 s, median throughput: 20.85 token/s Total. latency: 4.048 s, throughput: 1802.30 token/s Benchmark ... Prefill. latency: 1.63664 s, throughput: 4453.03 token/s Decode. latency: 0.04823 s, throughput: 20.73 token/s Decode. latency: 0.04784 s, throughput: 20.90 token/s Decode. latency: 0.04782 s, throughput: 20.91 token/s Decode. latency: 0.04794 s, throughput: 20.86 token/s Decode. latency: 0.04793 s, throughput: 20.86 token/s Decode. median latency: 0.04877 s, median throughput: 20.50 token/s Total. latency: 11.365 s, throughput: 658.86 token/s
SGLang 0.4.1
Warmup ... Prefill. latency: 2.43144 s, throughput: 2997.39 token/s Decode. latency: 1.64079 s, throughput: 0.61 token/s Decode. latency: 0.02459 s, throughput: 40.66 token/s Decode. latency: 0.02415 s, throughput: 41.41 token/s Decode. latency: 0.02414 s, throughput: 41.43 token/s Decode. latency: 0.02417 s, throughput: 41.37 token/s Decode. median latency: 0.02415 s, median throughput: 41.41 token/s Total. latency: 4.217 s, throughput: 1730.07 token/s Benchmark ... Prefill. latency: 0.60336 s, throughput: 12078.95 token/s Decode. latency: 0.02449 s, throughput: 40.83 token/s Decode. latency: 0.02446 s, throughput: 40.87 token/s Decode. latency: 0.02423 s, throughput: 41.28 token/s Decode. latency: 0.02417 s, throughput: 41.37 token/s Decode. latency: 0.02407 s, throughput: 41.54 token/s Decode. median latency: 0.02371 s, median throughput: 42.17 token/s Total. latency: 5.328 s, throughput: 1405.34 token/s
Reproduction
sglang 0.3.5
python -m sglang.bench_latency --model-path Deepseek-v2 --tensor-parallel-size 4 --quantization fp8 --batch-size 1 --input-len 7288 --output-len 200 --trust-remote-code
sglang 0.4.1
python -m sglang.bench_one_batch --model-path Deepseek-V2 --tensor-parallel-size 4 --quantization fp8 --batch-size 1 --input-len 7288 --output-len 200 --trust-remote-code
Environment
SGLang 0.3.5
`/opt/conda/envs/sglang_0.3.5/lib/python3.10/site-packages/pydantic/_internal/_config.py:345: UserWarning: Valid config keys have changed in V2:
warnings.warn(message, UserWarning)
Python: 3.10.0 (default, Mar 3 2022, 09:58:08) [GCC 7.5.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H800
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.6, V12.6.85
CUDA Driver Version: 565.57.01
PyTorch: 2.4.0+cu121
sglang: 0.3.5
flashinfer: 0.1.6+cu121torch2.4
triton: 3.0.0
transformers: 4.48.0
requests: 2.32.3
tqdm: 4.67.1
numpy: 1.26.4
aiohttp: 3.11.11
fastapi: 0.115.6
hf_transfer: 0.1.9
huggingface_hub: 0.27.1
interegular: 0.3.3
packaging: 24.2
PIL: 10.4.0
psutil: 6.1.1
pydantic: 2.10.5
uvicorn: 0.34.0
uvloop: 0.21.0
zmq: 26.2.0
vllm: 0.6.3.post1
multipart: 0.0.20
openai: 1.59.7
anthropic: 0.42.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV8 NV8 NV8 NV8 NV8 NV8 NV8 SYS PHB PHB PHB PHB SYS SYS SYS SYS 0-89 0 N/A
GPU1 NV8 X NV8 NV8 NV8 NV8 NV8 NV8 SYS PHB PHB PHB PHB SYS SYS SYS SYS 0-89 0 N/A
GPU2 NV8 NV8 X NV8 NV8 NV8 NV8 NV8 SYS PHB PHB PHB PHB SYS SYS SYS SYS 0-89 0 N/A
GPU3 NV8 NV8 NV8 X NV8 NV8 NV8 NV8 SYS PHB PHB PHB PHB SYS SYS SYS SYS 0-89 0 N/A
GPU4 NV8 NV8 NV8 NV8 X NV8 NV8 NV8 SYS SYS SYS SYS SYS PHB PHB PHB PHB 90-179 1 N/A
GPU5 NV8 NV8 NV8 NV8 NV8 X NV8 NV8 SYS SYS SYS SYS SYS PHB PHB PHB PHB 90-179 1 N/A
GPU6 NV8 NV8 NV8 NV8 NV8 NV8 X NV8 SYS SYS SYS SYS SYS PHB PHB PHB PHB 90-179 1 N/A
GPU7 NV8 NV8 NV8 NV8 NV8 NV8 NV8 X SYS SYS SYS SYS SYS PHB PHB PHB PHB 90-179 1 N/A
NIC0 SYS SYS SYS SYS SYS SYS SYS SYS X SYS SYS SYS SYS SYS SYS SYS SYS
NIC1 PHB PHB PHB PHB SYS SYS SYS SYS SYS X PHB PHB PHB SYS SYS SYS SYS
NIC2 PHB PHB PHB PHB SYS SYS SYS SYS SYS PHB X PHB PHB SYS SYS SYS SYS
NIC3 PHB PHB PHB PHB SYS SYS SYS SYS SYS PHB PHB X PHB SYS SYS SYS SYS
NIC4 PHB PHB PHB PHB SYS SYS SYS SYS SYS PHB PHB PHB X SYS SYS SYS SYS
NIC5 SYS SYS SYS SYS PHB PHB PHB PHB SYS SYS SYS SYS SYS X PHB PHB PHB
NIC6 SYS SYS SYS SYS PHB PHB PHB PHB SYS SYS SYS SYS SYS PHB X PHB PHB
NIC7 SYS SYS SYS SYS PHB PHB PHB PHB SYS SYS SYS SYS SYS PHB PHB X PHB
NIC8 SYS SYS SYS SYS PHB PHB PHB PHB SYS SYS SYS SYS SYS PHB PHB PHB X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
NIC8: mlx5_8
Hypervisor vendor: KVM
ulimit soft: 1048576`
SGLang 0.4.1
`
/opt/conda/envs/sglang_0.4.1/lib/python3.10/site-packages/pydantic/_internal/_config.py:345: UserWarning: Valid config keys have changed in V2:
warnings.warn(message, UserWarning)
Python: 3.10.16 (main, Dec 11 2024, 16:24:50) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H800
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.6, V12.6.85
CUDA Driver Version: 565.57.01
PyTorch: 2.4.0+cu121
sglang: 0.4.1.post5
flashinfer: 0.1.6+cu121torch2.4
triton: 3.0.0
transformers: 4.48.0
torchao: 0.7.0
numpy: 1.26.4
aiohttp: 3.11.11
fastapi: 0.115.6
hf_transfer: 0.1.9
huggingface_hub: 0.27.1
interegular: 0.3.3
modelscope: 1.22.0
orjson: 3.10.14
packaging: 24.2
psutil: 6.1.1
pydantic: 2.10.5
multipart: 0.0.20
zmq: 26.2.0
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.6.3.post1
openai: 1.59.7
anthropic: 0.42.0
decord: 0.6.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV8 NV8 NV8 NV8 NV8 NV8 NV8 SYS PHB PHB PHB PHB SYS SYS SYS SYS 0-89 0 N/A
GPU1 NV8 X NV8 NV8 NV8 NV8 NV8 NV8 SYS PHB PHB PHB PHB SYS SYS SYS SYS 0-89 0 N/A
GPU2 NV8 NV8 X NV8 NV8 NV8 NV8 NV8 SYS PHB PHB PHB PHB SYS SYS SYS SYS 0-89 0 N/A
GPU3 NV8 NV8 NV8 X NV8 NV8 NV8 NV8 SYS PHB PHB PHB PHB SYS SYS SYS SYS 0-89 0 N/A
GPU4 NV8 NV8 NV8 NV8 X NV8 NV8 NV8 SYS SYS SYS SYS SYS PHB PHB PHB PHB 90-179 1 N/A
GPU5 NV8 NV8 NV8 NV8 NV8 X NV8 NV8 SYS SYS SYS SYS SYS PHB PHB PHB PHB 90-179 1 N/A
GPU6 NV8 NV8 NV8 NV8 NV8 NV8 X NV8 SYS SYS SYS SYS SYS PHB PHB PHB PHB 90-179 1 N/A
GPU7 NV8 NV8 NV8 NV8 NV8 NV8 NV8 X SYS SYS SYS SYS SYS PHB PHB PHB PHB 90-179 1 N/A
NIC0 SYS SYS SYS SYS SYS SYS SYS SYS X SYS SYS SYS SYS SYS SYS SYS SYS
NIC1 PHB PHB PHB PHB SYS SYS SYS SYS SYS X PHB PHB PHB SYS SYS SYS SYS
NIC2 PHB PHB PHB PHB SYS SYS SYS SYS SYS PHB X PHB PHB SYS SYS SYS SYS
NIC3 PHB PHB PHB PHB SYS SYS SYS SYS SYS PHB PHB X PHB SYS SYS SYS SYS
NIC4 PHB PHB PHB PHB SYS SYS SYS SYS SYS PHB PHB PHB X SYS SYS SYS SYS
NIC5 SYS SYS SYS SYS PHB PHB PHB PHB SYS SYS SYS SYS SYS X PHB PHB PHB
NIC6 SYS SYS SYS SYS PHB PHB PHB PHB SYS SYS SYS SYS SYS PHB X PHB PHB
NIC7 SYS SYS SYS SYS PHB PHB PHB PHB SYS SYS SYS SYS SYS PHB PHB X PHB
NIC8 SYS SYS SYS SYS PHB PHB PHB PHB SYS SYS SYS SYS SYS PHB PHB PHB X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
NIC8: mlx5_8
Hypervisor vendor: KVM
ulimit soft: 1048576`
Beta Was this translation helpful? Give feedback.
All reactions