-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core][Distributed] add same-node detection #5369
Conversation
I think the code may be simplified a bit using Edit: Also, we might want to log the exceptions that are being raised instead of silently ignoring them. |
do you have any idea on how to use it? i checked the doc, and it seems quite difficult to use. We have to select a port to bind.
the exception is about one process in another node cannot access the shared memory segment, which is how we test if processes are within the same node. we don't need to expose this expected "exception" to users. |
Hmm, now that you've mentioned it, I don't see how the low-level
Is there any way to only suppress that specific class of exceptions? Otherwise I guess the current way is fine. |
limited to oserror now. |
can you elaborate on this? I don't get it. |
From my understanding, you said that we need to use If we are relying on this behaviour for the test, then we can also just omit the |
If I understand correctly, We can only test this via manually sharing the filename of |
Alright, I understand your intention more clearly now. In that case, it's probably not worth the effort to use Could you suppress errors only for the relevant parts of the code so that the above becomes clear? Remember to add a |
added. |
By this I mean that you should minimize the number of lines suppressed. (I assume that the error only occurs when constructing the
How about the code involving |
it is difficult to assume. Python exception might appear from some random reason. I prefer to use
The |
We can add an
I see, that is fine then. |
Added in 72961a5, please take a look. |
Looks good, let's get this merged then. |
This PR adds a function to detect if all processes inside a process group lives in the same node. It should take over #4903 .
The idea is to test if all processes can access the same part of shared memory.
Our CI can only run in the same node now. I manually tested the correctness for multi-node. In the future, we should add this test for multi-node.