Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][Distributed] add same-node detection #5369

Merged
merged 13 commits into from
Jun 11, 2024

Conversation

youkaichao
Copy link
Member

This PR adds a function to detect if all processes inside a process group lives in the same node. It should take over #4903 .

The idea is to test if all processes can access the same part of shared memory.

Our CI can only run in the same node now. I manually tested the correctness for multi-node. In the future, we should add this test for multi-node.

@youkaichao youkaichao requested review from esmeetu and removed request for esmeetu June 10, 2024 03:38
@DarkLight1337
Copy link
Member

DarkLight1337 commented Jun 10, 2024

I think the code may be simplified a bit using multiprocessing.managers.SharedMemoryManager? Or is that not compatible with torch.distributed?

Edit: Also, we might want to log the exceptions that are being raised instead of silently ignoring them.

@youkaichao
Copy link
Member Author

I think the code may be simplified a bit using multiprocessing.managers.SharedMemoryManager?

do you have any idea on how to use it? i checked the doc, and it seems quite difficult to use. We have to select a port to bind.

we might want to log the exceptions that are being raised

the exception is about one process in another node cannot access the shared memory segment, which is how we test if processes are within the same node. we don't need to expose this expected "exception" to users.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Jun 10, 2024

I think the code may be simplified a bit using multiprocessing.managers.SharedMemoryManager?

do you have any idea on how to use it? i checked the doc, and it seems quite difficult to use. We have to select a port to bind.

Hmm, now that you've mentioned it, I don't see how the low-level SharedMemory is shared over the network. So we may have to use SharedMemoryManager anyway because it provides this functionality.

we might want to log the exceptions that are being raised

the exception is about one process in another node cannot access the shared memory segment, which is how we test if processes are within the same node. we don't need to expose this expected "exception" to users.

Is there any way to only suppress that specific class of exceptions? Otherwise I guess the current way is fine.

@youkaichao
Copy link
Member Author

Is there any way to only suppress that specific class of exceptions?

limited to oserror now.

@youkaichao
Copy link
Member Author

Hmm, now that you've mentioned it, I don't see how the low-level SharedMemory is shared over the network. So we may have to use SharedMemoryManager anyway because it provides this functionality.

can you elaborate on this? I don't get it.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Jun 10, 2024

Hmm, now that you've mentioned it, I don't see how the low-level SharedMemory is shared over the network. So we may have to use SharedMemoryManager anyway because it provides this functionality.

can you elaborate on this? I don't get it.

From my understanding, you said that we need to use address parameter of SharedMemoryManager, otherwise the memory cannot be shared over the network which is required in multi-node case. However, there is no mention of this in the docs for SharedMemory. So, I think we can't use SharedMemory for multi-node case; the reason your test still succeeded might be because you caught the Exception that was raised due to this, which has nothing to do with torch.distributed.

If we are relying on this behaviour for the test, then we can also just omit the address parameter for SharedMemoryManager.

@youkaichao
Copy link
Member Author

If I understand correctly, SharedMemoryManager will start a server process within the node it lives in, and whenever a connection comes to request a shared memory, it creates a shared memory segment in the node it lives in, and returns a wrapper/proxy of that memory. That said, I don't think it can be used to test if all process lives in the same node.

We can only test this via manually sharing the filename of SharedMemory. This will cause exception in multi-node cases, which is exactly what we want to detect.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Jun 10, 2024

We can only test this via manually sharing the filename of SharedMemory. This will cause exception in multi-node cases, which is exactly what we want to detect.

Alright, I understand your intention more clearly now. In that case, it's probably not worth the effort to use SharedMemoryManager.

Could you suppress errors only for the relevant parts of the code so that the above becomes clear? Remember to add a try/except to ensure that the cleanup work is done even when unexpected exceptions occur.

@youkaichao
Copy link
Member Author

Could you suppress errors only for the relevant parts of the code so that the above becomes clear? Remember to add a try/except to ensure that the cleanup work is done even when unexpected exceptions occur.

added.

@DarkLight1337
Copy link
Member

Could you suppress errors only for the relevant parts of the code so that the above becomes clear?

By this I mean that you should minimize the number of lines suppressed. (I assume that the error only occurs when constructing the SharedMemory object?)

added.

How about the code involving unlink?

@youkaichao
Copy link
Member Author

I assume that the error only occurs when constructing the SharedMemory object?

it is difficult to assume. Python exception might appear from some random reason. I prefer to use contextlib.suppress(OSError) for a larger scope, to ensure this function will not crash execution.

How about the code involving unlink?

The unlink part does not need try-finally. It is already a cleanup step. If unlink fails, we don't need to unlink again.

@DarkLight1337
Copy link
Member

I assume that the error only occurs when constructing the SharedMemory object?

it is difficult to assume. Python exception might appear from some random reason. I prefer to use contextlib.suppress(OSError) for a larger scope, to ensure this function will not crash execution.

We can add an except to the try-finally block to handle those errors (such as logging them)

How about the code involving unlink?

The unlink part does not need try-finally. It is already a cleanup step. If unlink fails, we don't need to unlink again.

I see, that is fine then.

@youkaichao
Copy link
Member Author

We can add an except to the try-finally block to handle those errors (such as logging them)

Added in 72961a5, please take a look.

@DarkLight1337
Copy link
Member

Looks good, let's get this merged then.

@youkaichao youkaichao enabled auto-merge (squash) June 11, 2024 03:26
@simon-mo simon-mo disabled auto-merge June 11, 2024 17:53
@simon-mo simon-mo merged commit c4bd03c into vllm-project:main Jun 11, 2024
101 of 103 checks passed
@youkaichao youkaichao deleted the single_host branch June 11, 2024 18:00
robertgshaw2-redhat pushed a commit to neuralmagic/nm-vllm that referenced this pull request Jun 12, 2024
joerunde pushed a commit to joerunde/vllm that referenced this pull request Jun 17, 2024
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jun 27, 2024
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 8, 2024
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants