-
Notifications
You must be signed in to change notification settings - Fork 882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI comm_spawn is hanging on disconnect #4542
Comments
|
so i tried the latest Open MPI bottom line, (note i pushed 3b4b3bb in order to fix |
fwiw i set a breakpoint in to me, that suggests something is not correctly refcount'ed. i will resume investigations tomorrow if needed. |
my bad about the refcount thing, |
sounds like it is the result of the new connect/disconnect code in PMIx. I'll take a look. Thanks for the detective work! |
yep, and i do not understand the new logic here. with |
odd - i'm surprised that the pmix component would split the two like that. let me look and see what is going on. |
well, with PMIx v3, the pmix component splits the disconnect into 6 invocations of
but in PMIx v2.1,
|
yeah, i know - i'm working on the problem now. The issue is that PMIx v3 assigns a new nspace to the collection, and we aren't tracking it since OMPI doesn't assign a new jobid to the combined procs. We just need to track things correctly. |
First let me state that I haven't looked at the code yet, but I wanted to make sure we have a correct implementation of MPI_Disconnect because it's a very special function. In most cases it behaves almost as MPI_COMM_FREE (plus it waits for the completion of all pending operations), except when it really affect the last "connection" between sets of processes from different MPI_COMM_WORLD. In this particular case it allows these processes to clean all information about the other set of processes, or in OMPI terms it allows to clean the modex of the peers. Thus, MPI_COMM_DISCONNECT should not directly call pmix_server_disconnect because there might be extra connections to the processes in the other MPI_COMM_WORLD. As an example it is absolutely correct to have the following sequence: MPI_COMM_SPAWN( ..., &scomm);
MPI_COMM_DUP(scomm, &ncomm);
MPI_COMM_DISCONNECT(&scomm);
/* what about ncomm ? */ Instead if we are unable to compute the transitive relationship between connected processes we should simply replace MPI_COMM_DISCONNECT by a function that waits for all pending communications and then MPI_COMM_FREE. |
I believe we have always relied on the code in the MPI layer (specifically, ompi_mpi_disconnect in ompi/dpm/dpm.c) to provide that level of protection. What the pmix code does is simply communicate the disconnect command to the procs passed down to it. |
The ompi_mpi_disconnect does little to ensure anything MPI related. It simply does either a fence or inform PMIX to disconnect us. We are not in finalize but in the middle of a running MPI application, in a blocking call in the MPI library so for all purposes a simple barrier would have been enough. Moreover, MPI_COMM_DISCONNECT is supposed to wait for all pending operations to complete. We are failing that part as well. In other terms, our current implementation of MPI_COMM_DISCONNECT is unsatisfactory from the MPI standard perspective (and thus broken). I propose we replace it with MPI_COMM_FREE. |
I defer to you as I honestly don't know. Can someone please resolve that problem? I'm just trying to ensure that opal_pmix.disconnect itself operates correctly - and it isn't quite correct right now. |
I understand but I still wonder what is the meaning of opal_pmix.disconnect ? I see you are building a list of all processes in the communicator (from both local and remote group) and calling opal_pmix.disconnect on it. But I can't figure out what is the expected outcome ? Btw, construct_peers is of quadratic complexity as before adding a peer it search for the peer existence in the peer list. This is unnecessary as the peers are coming from groups associated with the same MPI communicators, and they are guaranteed non-overlapping. |
Okay, let me walk you thru PMIx_Connect/Disconnect. When a client calls connect, it passes the array of all participants (wildcards etc are accepted). This gets passed to the PMIx server, which caches it until all local participants have called connect (so it is a collective operation). When that point is reached, the server passes the request to the host RM. The host RM is responsible for ensuring that all participants (across all nodes) have called "connect" before proceeding. It then assigns a new "nspace" to this collection, and a new rank to each of the participants based on their position in the specified array. This allows the host RM to (if it so chooses) track the collection as a unique entity, and is designed to support the future MPI sessions work. This information is then passed back to the local PMIx server, which conveys it to each participating client. On disconnect, the client code simply passes the disconnect request to the server for the provided nspace - this indicates that this client wishes to "disconnect" from that nspace. This is indeed a collective operation, and so the local PMIx server once again caches the requests until all local participants have called "disconnect" from that nspace. It then passes the request to the host RM. The host RM enforces the collective operation across all the participants, and then tells the local PMIx server that the operation is complete. The PMIx server then notifies each local participant, and the client library returns from the disconnect call. So bottom line for OMPI is that we mainly use these to provide an effective "fence" (as we don't try to track them as a new jobid), and we realistically could replace the call in ompi/dpm with just a fence if that is what we want to do. However, that would potentially cause issues for RMs that do want to separately track the collection, and it would likely cause problems for MPI sessions (where we will need to update ORTE to track these collections), so I'm leery of doing so. I'm happy if someone optimizes things like construct_peers 😄 |
This PMIX mechanism only works if the processes that call disconnect have called connect before together. In MPI things are more convoluted, as you can disconnect any communicators not only the ones resulting from a spawn or a connect/accept. As an example one can merge multiple spawned worlds into a single communicator and then call disconnect on the it. This will lead to badness as a connect on the total group of processes that belong to this communicator has never been called so there is no common nspace. Let me make clear that I am not arguing about the good or the bad of current PMIX capabilities for connect/disconnect. I just want to pinpoint that in OMPI we cannot use them for disconnecting a communicator, and that as a result our current implementation of MPI_COMM_DISCONNECT is incorrect. |
I understand where you are coming from - no issues. I would only point out that merging multiple spawned "worlds" into a single communicator can and should involve calling PMIx_Connect as this ensures that all the participants have access to the job-level info of the other participants. This was part of the intended design. Otherwise, you are operating in a vacuum. |
The main problem is that there are constructs in MPI where we don't know that we are manipulating previously unconnected processes. So we have addressed this problem by doing the modex exchange directly between MPI processes. As an example for MPI_Intercomm_create each side of the intercomm prepare a packed version of the modex (using ompi_proc_pack into an opal_dss) to send to the remote peers. Once the info from the remote group has been received we call ompi_proc_unpack to unpack the new modex info and populate the local modex, and then we call pml_add_proc on the list of remote peers. If you want more info take a look at ompi_comm_get_rprocs in communicator/comm.c |
So let me offer an alternative solution. MPI doesn't care how the underlying RTE tracks things, so why don't you just have intercomm create and friends call "connect"? We can make it smart enough to see that these procs are already "connected" and ignore the duplicate call, or we can "refcount" the call - the behavior is something we can, after all, control. You could then drop that manual modex code. Not saying it is necessary - just trying to point out that we could find a way to avoid the special case if we think it advantageous. Manually passing around modex info just seems like a scaling problem - e.g., we have gone to some length to ensure that job-level info is always available. Hate to see it passed around "under the covers" as that means we might later be unnecessarily duplicating effort. |
One further thought: there might be additional benefit(s) to calling connect in such cases. When you create things like an inter-communicator, you are implicitly saying "treat these procs as a group". What "connect" does is assign a tracking ID to that collection so you can refer to it as a unique entity. For example, you can then generate an event and specify that only members of that group are to receive it, or you can issue a job control command for that group. It also allows the RM/RTE to "know" that this collection should be treated as a group for things like notifying members when one of their members unexpectedly dies, or terminates with non-zero status. Not pushing any of this - just saying that we may want to spend a little time thinking about how we take fuller advantage of pmix. All we've wired in so far has been based on simple replacement of what was already there - a fresh look might yield some interesting new perspectives. |
Per the discussion with @bosilca, I simply replaced the call to opal_pmix.disconnect with a call to opal_pmix.fence for now. I'll fix the connect/disconnect code later as I think we might want to utilize it down the road. I'll add this to the developer's meeting agenda so we can kick this around some more. I've applied to come - if it isn't approved, we'll come up with an alternative way to continue these thoughts. |
Seeing this on the current master (using orte/test/mpi/simple_spawn.c):
I'm not sure when this started. @ggouaillardet Would you have a chance to take a peek? If it is PMIx related, please let me know and I'll dive into it.
The text was updated successfully, but these errors were encountered: