[Feature] add disable-custom-all-reduce #1148

Xu-Chen · 2024-08-19T10:11:43Z

Motivation

Sometimes, we need to turn off Custom allreduce.
Especially on A800 with tp, to avoid timeout problems caused by NCCL communication.
Error like：vllm-project/vllm#6614
This may be the reason, not sure, but after setting disable-custom-all-reduce, the problem no longer occurs.

Modification

Checklist

Before submitting a PR for review, make sure it has passed verification in your local development environment at least.
Ensure pre-commit pre-commit run --all-files or other linting tools are used to fix potential lint issues.
Confirm that modifications are covered by complete unit tests. If not, please add more unit tests for correctness.
Modify documentation as needed, such as docstrings or example tutorials.

zhyncs · 2024-08-19T10:14:18Z

ref https://github.com/sgl-project/sglang/blob/main/docs/en/contributor_guide.md

zhyncs · 2024-08-19T10:26:30Z

@Xu-Chen May you use python3 -m sglang.check_env for A800 env?

Xu-Chen · 2024-08-19T10:31:28Z

@Xu-Chen May you use python3 -m sglang.check_env for A800 env?
4*A800

Python: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2,3: NVIDIA A800-SXM4-80GB
GPU 0,1,2,3 Compute Capability: 8.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
CUDA Driver Version: 525.125.06
PyTorch: 2.4.0+cu121
sglang: 0.2.12
flashinfer: 0.1.4+cu121torch2.4
triton: 3.0.0
transformers: 4.44.0
requests: 2.32.3
tqdm: 4.66.5
numpy: 1.26.4
aiohttp: 3.10.3
fastapi: 0.112.0
hf_transfer: Module Not Found
huggingface_hub: 0.24.5
interegular: 0.3.3
packaging: 24.1
PIL: 10.4.0
psutil: 6.0.0
pydantic: 2.8.2
uvicorn: 0.30.5
uvloop: 0.19.0
zmq: 26.1.0
vllm: 0.5.4
multipart: 0.0.9
openai: 1.40.6
anthropic: Module Not Found
litellm: Module Not Found
NVIDIA Topology:
	GPU0	GPU1	GPU2	GPU3	NIC0	NIC1	NIC2	NIC3	NIC4	NIC5	NIC6	NIC7	NIC8	CPU Affinity	NUMA Affinity
GPU0	 X 	NV8	NV8	NV8	NODE	PXB	PXB	NODE	NODE	SYS	SYS	SYS	SYS	0-31,64-95	0
GPU1	NV8	 X 	NV8	NV8	SYS	SYS	SYS	SYS	SYS	PXB	PXB	NODE	NODE	32-63,96-127	1
GPU2	NV8	NV8	 X 	NV8	SYS	SYS	SYS	SYS	SYS	NODE	NODE	PXB	PXB	32-63,96-127	1
GPU3	NV8	NV8	NV8	 X 	SYS	SYS	SYS	SYS	SYS	NODE	NODE	PXB	PXB	32-63,96-127	1
NIC0	NODE	SYS	SYS	SYS	 X 	NODE	NODE	NODE	NODE	SYS	SYS	SYS	SYS
NIC1	PXB	SYS	SYS	SYS	NODE	 X 	PIX	NODE	NODE	SYS	SYS	SYS	SYS
NIC2	PXB	SYS	SYS	SYS	NODE	PIX	 X 	NODE	NODE	SYS	SYS	SYS	SYS
NIC3	NODE	SYS	SYS	SYS	NODE	NODE	NODE	 X 	PIX	SYS	SYS	SYS	SYS
NIC4	NODE	SYS	SYS	SYS	NODE	NODE	NODE	PIX	 X 	SYS	SYS	SYS	SYS
NIC5	SYS	PXB	NODE	NODE	SYS	SYS	SYS	SYS	SYS	 X 	PIX	NODE	NODE
NIC6	SYS	PXB	NODE	NODE	SYS	SYS	SYS	SYS	SYS	PIX	 X 	NODE	NODE
NIC7	SYS	NODE	PXB	PXB	SYS	SYS	SYS	SYS	SYS	NODE	NODE	 X 	PIX
NIC8	SYS	NODE	PXB	PXB	SYS	SYS	SYS	SYS	SYS	NODE	NODE	PIX	 X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8


ulimit soft: 1024

zhyncs · 2024-08-20T07:27:02Z

@Xu-Chen Could you try using --enable-p2p-check instead of --disable-custom-all-reduce in your environment? Does it work?

revert

Xu-Chen · 2024-08-20T11:26:22Z

@Xu-Chen Could you try using --enable-p2p-check instead of --disable-custom-all-reduce in your environment? Does it work?

@zhyncs Unfortunately, this does not solve the problem. When there are too many requests, timeout problems occur.

merrymercy · 2024-08-20T15:44:25Z

@Xu-Chen Thanks for the contribution. It is merged.

m0g1cian · 2024-08-21T03:09:10Z

@Xu-Chen I am wondering what is the performance drop after disabling the custom_all_reduce?

Xu-Chen · 2024-09-20T06:52:25Z

@Xu-Chen I am wondering what is the performance drop after disabling the custom_all_reduce?

About 5% ~ 10%

add disable-custom-all-reduce

eddb973

Xu-Chen mentioned this pull request Aug 19, 2024

[Feature] add disable_custom_all_reduce #1118

Closed

2 tasks

fix: isort

5e065b3

Xu-Chen force-pushed the add-disable_custom_all_reduce branch from c0d9374 to 5e065b3 Compare August 19, 2024 10:24

Merge branch 'main' into add-disable_custom_all_reduce

c1c4504

zhyncs requested review from Ying1123, merrymercy, zhyncs and hnyls2002 August 19, 2024 10:26

zhyncs self-assigned this Aug 19, 2024

Ying1123 previously approved these changes Aug 20, 2024

View reviewed changes

Merge branch 'sgl-project:main' into add-disable_custom_all_reduce

3e92e72

merrymercy merged commit ff2cfdb into sgl-project:main Aug 20, 2024
1 of 5 checks passed

Xu-Chen deleted the add-disable_custom_all_reduce branch August 20, 2024 15:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] add disable-custom-all-reduce #1148

[Feature] add disable-custom-all-reduce #1148

Xu-Chen commented Aug 19, 2024 •

edited

Loading

zhyncs commented Aug 19, 2024

zhyncs commented Aug 19, 2024

Xu-Chen commented Aug 19, 2024

zhyncs commented Aug 20, 2024

Xu-Chen commented Aug 20, 2024 •

edited

Loading

merrymercy commented Aug 20, 2024

m0g1cian commented Aug 21, 2024

Xu-Chen commented Sep 20, 2024

[Feature] add disable-custom-all-reduce #1148

[Feature] add disable-custom-all-reduce #1148

Conversation

Xu-Chen commented Aug 19, 2024 • edited Loading

Motivation

Modification

Checklist

zhyncs commented Aug 19, 2024

zhyncs commented Aug 19, 2024

Xu-Chen commented Aug 19, 2024

zhyncs commented Aug 20, 2024

Xu-Chen commented Aug 20, 2024 • edited Loading

merrymercy commented Aug 20, 2024

m0g1cian commented Aug 21, 2024

Xu-Chen commented Sep 20, 2024

Xu-Chen commented Aug 19, 2024 •

edited

Loading

Xu-Chen commented Aug 20, 2024 •

edited

Loading