[Core][Distributed] add same-node detection #5369

youkaichao · 2024-06-10T03:33:43Z

This PR adds a function to detect if all processes inside a process group lives in the same node. It should take over #4903 .

The idea is to test if all processes can access the same part of shared memory.

Our CI can only run in the same node now. I manually tested the correctness for multi-node. In the future, we should add this test for multi-node.

DarkLight1337 · 2024-06-10T07:12:49Z

I think the code may be simplified a bit using multiprocessing.managers.SharedMemoryManager? Or is that not compatible with torch.distributed?

Edit: Also, we might want to log the exceptions that are being raised instead of silently ignoring them.

youkaichao · 2024-06-10T07:20:22Z

I think the code may be simplified a bit using multiprocessing.managers.SharedMemoryManager?

do you have any idea on how to use it? i checked the doc, and it seems quite difficult to use. We have to select a port to bind.

we might want to log the exceptions that are being raised

the exception is about one process in another node cannot access the shared memory segment, which is how we test if processes are within the same node. we don't need to expose this expected "exception" to users.

DarkLight1337 · 2024-06-10T07:31:57Z

I think the code may be simplified a bit using multiprocessing.managers.SharedMemoryManager?

do you have any idea on how to use it? i checked the doc, and it seems quite difficult to use. We have to select a port to bind.

Hmm, now that you've mentioned it, I don't see how the low-level SharedMemory is shared over the network. So we may have to use SharedMemoryManager anyway because it provides this functionality.

we might want to log the exceptions that are being raised

the exception is about one process in another node cannot access the shared memory segment, which is how we test if processes are within the same node. we don't need to expose this expected "exception" to users.

Is there any way to only suppress that specific class of exceptions? Otherwise I guess the current way is fine.

youkaichao · 2024-06-10T07:36:44Z

Is there any way to only suppress that specific class of exceptions?

limited to oserror now.

youkaichao · 2024-06-10T07:37:32Z

Hmm, now that you've mentioned it, I don't see how the low-level SharedMemory is shared over the network. So we may have to use SharedMemoryManager anyway because it provides this functionality.

can you elaborate on this? I don't get it.

DarkLight1337 · 2024-06-10T07:40:28Z

Hmm, now that you've mentioned it, I don't see how the low-level SharedMemory is shared over the network. So we may have to use SharedMemoryManager anyway because it provides this functionality.

can you elaborate on this? I don't get it.

From my understanding, you said that we need to use address parameter of SharedMemoryManager, otherwise the memory cannot be shared over the network which is required in multi-node case. However, there is no mention of this in the docs for SharedMemory. So, I think we can't use SharedMemory for multi-node case; the reason your test still succeeded might be because you caught the Exception that was raised due to this, which has nothing to do with torch.distributed.

If we are relying on this behaviour for the test, then we can also just omit the address parameter for SharedMemoryManager.

youkaichao · 2024-06-10T07:51:34Z

If I understand correctly, SharedMemoryManager will start a server process within the node it lives in, and whenever a connection comes to request a shared memory, it creates a shared memory segment in the node it lives in, and returns a wrapper/proxy of that memory. That said, I don't think it can be used to test if all process lives in the same node.

We can only test this via manually sharing the filename of SharedMemory. This will cause exception in multi-node cases, which is exactly what we want to detect.

DarkLight1337 · 2024-06-10T08:02:44Z

We can only test this via manually sharing the filename of SharedMemory. This will cause exception in multi-node cases, which is exactly what we want to detect.

Alright, I understand your intention more clearly now. In that case, it's probably not worth the effort to use SharedMemoryManager.

Could you suppress errors only for the relevant parts of the code so that the above becomes clear? Remember to add a try/except to ensure that the cleanup work is done even when unexpected exceptions occur.

youkaichao · 2024-06-10T08:10:37Z

Could you suppress errors only for the relevant parts of the code so that the above becomes clear? Remember to add a try/except to ensure that the cleanup work is done even when unexpected exceptions occur.

added.

DarkLight1337 · 2024-06-10T08:14:19Z

Could you suppress errors only for the relevant parts of the code so that the above becomes clear?

By this I mean that you should minimize the number of lines suppressed. (I assume that the error only occurs when constructing the SharedMemory object?)

added.

How about the code involving unlink?

youkaichao · 2024-06-10T17:44:48Z

I assume that the error only occurs when constructing the SharedMemory object?

it is difficult to assume. Python exception might appear from some random reason. I prefer to use contextlib.suppress(OSError) for a larger scope, to ensure this function will not crash execution.

How about the code involving unlink?

The unlink part does not need try-finally. It is already a cleanup step. If unlink fails, we don't need to unlink again.

DarkLight1337 · 2024-06-11T02:08:30Z

I assume that the error only occurs when constructing the SharedMemory object?

it is difficult to assume. Python exception might appear from some random reason. I prefer to use contextlib.suppress(OSError) for a larger scope, to ensure this function will not crash execution.

We can add an except to the try-finally block to handle those errors (such as logging them)

How about the code involving unlink?

The unlink part does not need try-finally. It is already a cleanup step. If unlink fails, we don't need to unlink again.

I see, that is fine then.

youkaichao · 2024-06-11T02:14:30Z

We can add an except to the try-finally block to handle those errors (such as logging them)

Added in 72961a5, please take a look.

DarkLight1337 · 2024-06-11T02:16:06Z

Looks good, let's get this merged then.

youkaichao added 7 commits June 9, 2024 17:07

stash

412e88d

use contextlib.suppress

589476e

add tests

eb02cf3

fix end of line

6c95a69

fix lint

4eefa0e

fix var name

abd7473

fix import

43d6625

youkaichao requested review from esmeetu and removed request for esmeetu June 10, 2024 03:38

youkaichao added 2 commits June 9, 2024 23:54

use same host in custom allreduce

32fb1c9

update tests

c60a023

narrow down the error suppression to OSError

6179cb5

ensure close by finally

182f58c

add suppress in cleanup

602fdcf

add logger for unexpected exception

72961a5

DarkLight1337 approved these changes Jun 11, 2024

View reviewed changes

youkaichao enabled auto-merge (squash) June 11, 2024 03:26

simon-mo disabled auto-merge June 11, 2024 17:53

simon-mo merged commit c4bd03c into vllm-project:main Jun 11, 2024
101 of 103 checks passed

youkaichao deleted the single_host branch June 11, 2024 18:00

youkaichao mentioned this pull request Jun 11, 2024

[Bugfix] Fix custom all reduce nvlink check on multi node #4903

Closed

robertgshaw2-redhat pushed a commit to neuralmagic/nm-vllm that referenced this pull request Jun 12, 2024

[Core][Distributed] add same-node detection (vllm-project#5369)

7c70e49

joerunde pushed a commit to joerunde/vllm that referenced this pull request Jun 17, 2024

[Core][Distributed] add same-node detection (vllm-project#5369)

f27219f

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jun 27, 2024

[Core][Distributed] add same-node detection (vllm-project#5369)

3bc5e6c

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 8, 2024

[Core][Distributed] add same-node detection (vllm-project#5369)

520ac84

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 24, 2024

[Core][Distributed] add same-node detection (vllm-project#5369)

ecd5fcd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core][Distributed] add same-node detection #5369

[Core][Distributed] add same-node detection #5369

youkaichao commented Jun 10, 2024

DarkLight1337 commented Jun 10, 2024 •

edited

Loading

youkaichao commented Jun 10, 2024

DarkLight1337 commented Jun 10, 2024 •

edited

Loading

youkaichao commented Jun 10, 2024

youkaichao commented Jun 10, 2024

DarkLight1337 commented Jun 10, 2024 •

edited

Loading

youkaichao commented Jun 10, 2024

DarkLight1337 commented Jun 10, 2024 •

edited

Loading

youkaichao commented Jun 10, 2024

DarkLight1337 commented Jun 10, 2024

youkaichao commented Jun 10, 2024

DarkLight1337 commented Jun 11, 2024

youkaichao commented Jun 11, 2024

DarkLight1337 commented Jun 11, 2024

[Core][Distributed] add same-node detection #5369

[Core][Distributed] add same-node detection #5369

Conversation

youkaichao commented Jun 10, 2024

DarkLight1337 commented Jun 10, 2024 • edited Loading

youkaichao commented Jun 10, 2024

DarkLight1337 commented Jun 10, 2024 • edited Loading

youkaichao commented Jun 10, 2024

youkaichao commented Jun 10, 2024

DarkLight1337 commented Jun 10, 2024 • edited Loading

youkaichao commented Jun 10, 2024

DarkLight1337 commented Jun 10, 2024 • edited Loading

youkaichao commented Jun 10, 2024

DarkLight1337 commented Jun 10, 2024

youkaichao commented Jun 10, 2024

DarkLight1337 commented Jun 11, 2024

youkaichao commented Jun 11, 2024

DarkLight1337 commented Jun 11, 2024

DarkLight1337 commented Jun 10, 2024 •

edited

Loading

DarkLight1337 commented Jun 10, 2024 •

edited

Loading

DarkLight1337 commented Jun 10, 2024 •

edited

Loading

DarkLight1337 commented Jun 10, 2024 •

edited

Loading