Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI comm_spawn is hanging on disconnect #4542

Closed
rhc54 opened this issue Nov 27, 2017 · 21 comments · Fixed by #4549
Closed

MPI comm_spawn is hanging on disconnect #4542

rhc54 opened this issue Nov 27, 2017 · 21 comments · Fixed by #4549
Assignees

Comments

@rhc54
Copy link
Contributor

rhc54 commented Nov 27, 2017

Seeing this on the current master (using orte/test/mpi/simple_spawn.c):

$ mpirun -H rhc001:24 -n 3 ./simple_spawn
[17760257:0 pid 176346] starting up on node rhc001!
[17760257:1 pid 176347] starting up on node rhc001!
[17760257:2 pid 176348] starting up on node rhc001!
0 completed MPI_Init
2 completed MPI_Init
Parent [pid 176348] about to spawn!
Parent [pid 176346] about to spawn!
1 completed MPI_Init
Parent [pid 176347] about to spawn!
[17760258:0 pid 176355] starting up on node rhc001!
[17760258:1 pid 176356] starting up on node rhc001!
[17760258:2 pid 176357] starting up on node rhc001!
Parent done with spawn
Parent sending message to child
Parent done with spawn
Parent done with spawn
2 completed MPI_Init
Hello from the child 2 of 3 on host rhc001 pid 176357
0 completed MPI_Init
Hello from the child 0 of 3 on host rhc001 pid 176355
1 completed MPI_Init
Hello from the child 1 of 3 on host rhc001 pid 176356
Parent disconnected
Child 2 disconnected
Parent disconnected
Child 1 disconnected
Child 0 received msg: 38
<hang forever>

I'm not sure when this started. @ggouaillardet Would you have a chance to take a peek? If it is PMIx related, please let me know and I'll dive into it.

@ggouaillardet
Copy link
Contributor

git bisect points to c696e04 (move from PMIx v2 to v3).
i will try both v2 and v3 external components with the latest master from now.

@ggouaillardet
Copy link
Contributor

so i tried the latest Open MPI master, with external PMIx from both latest master and v2.1 branches.

bottom line, pmix/ext2x works fine, but pmix/ext3x hangs (most of the time)

(note i pushed 3b4b3bb in order to fix pmix/ext3x)

@ggouaillardet
Copy link
Contributor

fwiw

i set a breakpoint in MPI_Disconnect() in the 3 parents and let the children run.
when the parent end up executing MPI_Disconnect(), mpirun executes pmix_server_disconnect but at that time, pmix_server_globals.nspaces contains only 2 namespaces (job 0 aka mpirun and job 1 aka the parents), which means job 2 (aka the children) has already been removed.
so when the parent try to PMIx_disconnect() from jobs 2, the PMIX_ERR_NOT_FOUND error is raised because there is no such namespace in pmix_server_globals.nspaces.

to me, that suggests something is not correctly refcount'ed.

i will resume investigations tomorrow if needed.

@ggouaillardet
Copy link
Contributor

my bad about the refcount thing, orte_state_base_check_all_complete() invokes PMIx_server_deregister_nspace()

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 28, 2017

sounds like it is the result of the new connect/disconnect code in PMIx. I'll take a look.

Thanks for the detective work!

@ggouaillardet
Copy link
Contributor

yep, and i do not understand the new logic here.

with ext2x, a disconnect was a collective operation involving the 6 MPI tasks (e.g. 3 parents and 3 children).
but with ext3x, a disconnect is now two collective operations, the first one involves the 3 parents, and the second one involve the 3 children.
and from what i can observe, the 3 children do note even wait for the parents before exiting.

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 29, 2017

odd - i'm surprised that the pmix component would split the two like that. let me look and see what is going on.

@ggouaillardet
Copy link
Contributor

well, with PMIx v3, the pmix component splits the disconnect into 6 invocations of PMIX_Disconnect(), and then on the server side, these requests are "re-assembled" with all the procs from the same namespace.
In this case, we endup with 2 namespaces of 6 tasks each
from pmix_server_disconnect(), in PMIx v3

        if (NULL == (trk = new_tracker(&proc, 1, PMIX_DISCONNECTNB_CMD))) ...
        [...]
        trk->nlocal = nspace->nlocalprocs;

nspace->nlocalprocs is 6

but in PMIx v2.1, PMIx_Disconnect() is invoked only once with the 6 procs (coming from both namespaces)
from pmix_server_connect(), in PMIx v2.1

        if (NULL == (trk = new_tracker(procs, nprocs, type))) {

nprocs is received from the MPI app and is 6

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 29, 2017

yeah, i know - i'm working on the problem now. The issue is that PMIx v3 assigns a new nspace to the collection, and we aren't tracking it since OMPI doesn't assign a new jobid to the combined procs. We just need to track things correctly.

@bosilca
Copy link
Member

bosilca commented Nov 29, 2017

First let me state that I haven't looked at the code yet, but I wanted to make sure we have a correct implementation of MPI_Disconnect because it's a very special function. In most cases it behaves almost as MPI_COMM_FREE (plus it waits for the completion of all pending operations), except when it really affect the last "connection" between sets of processes from different MPI_COMM_WORLD. In this particular case it allows these processes to clean all information about the other set of processes, or in OMPI terms it allows to clean the modex of the peers.

Thus, MPI_COMM_DISCONNECT should not directly call pmix_server_disconnect because there might be extra connections to the processes in the other MPI_COMM_WORLD. As an example it is absolutely correct to have the following sequence:

MPI_COMM_SPAWN( ..., &scomm);
MPI_COMM_DUP(scomm, &ncomm);
MPI_COMM_DISCONNECT(&scomm);
/* what about ncomm ? */

Instead if we are unable to compute the transitive relationship between connected processes we should simply replace MPI_COMM_DISCONNECT by a function that waits for all pending communications and then MPI_COMM_FREE.

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 29, 2017

I believe we have always relied on the code in the MPI layer (specifically, ompi_mpi_disconnect in ompi/dpm/dpm.c) to provide that level of protection. What the pmix code does is simply communicate the disconnect command to the procs passed down to it.

@bosilca
Copy link
Member

bosilca commented Nov 29, 2017

The ompi_mpi_disconnect does little to ensure anything MPI related. It simply does either a fence or inform PMIX to disconnect us. We are not in finalize but in the middle of a running MPI application, in a blocking call in the MPI library so for all purposes a simple barrier would have been enough. Moreover, MPI_COMM_DISCONNECT is supposed to wait for all pending operations to complete. We are failing that part as well. In other terms, our current implementation of MPI_COMM_DISCONNECT is unsatisfactory from the MPI standard perspective (and thus broken). I propose we replace it with MPI_COMM_FREE.

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 29, 2017

I defer to you as I honestly don't know. Can someone please resolve that problem? I'm just trying to ensure that opal_pmix.disconnect itself operates correctly - and it isn't quite correct right now.

@bosilca
Copy link
Member

bosilca commented Nov 29, 2017

I understand but I still wonder what is the meaning of opal_pmix.disconnect ? I see you are building a list of all processes in the communicator (from both local and remote group) and calling opal_pmix.disconnect on it. But I can't figure out what is the expected outcome ?

Btw, construct_peers is of quadratic complexity as before adding a peer it search for the peer existence in the peer list. This is unnecessary as the peers are coming from groups associated with the same MPI communicators, and they are guaranteed non-overlapping.

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 29, 2017

Okay, let me walk you thru PMIx_Connect/Disconnect. When a client calls connect, it passes the array of all participants (wildcards etc are accepted). This gets passed to the PMIx server, which caches it until all local participants have called connect (so it is a collective operation). When that point is reached, the server passes the request to the host RM.

The host RM is responsible for ensuring that all participants (across all nodes) have called "connect" before proceeding. It then assigns a new "nspace" to this collection, and a new rank to each of the participants based on their position in the specified array. This allows the host RM to (if it so chooses) track the collection as a unique entity, and is designed to support the future MPI sessions work. This information is then passed back to the local PMIx server, which conveys it to each participating client.

On disconnect, the client code simply passes the disconnect request to the server for the provided nspace - this indicates that this client wishes to "disconnect" from that nspace. This is indeed a collective operation, and so the local PMIx server once again caches the requests until all local participants have called "disconnect" from that nspace. It then passes the request to the host RM.

The host RM enforces the collective operation across all the participants, and then tells the local PMIx server that the operation is complete. The PMIx server then notifies each local participant, and the client library returns from the disconnect call.

So bottom line for OMPI is that we mainly use these to provide an effective "fence" (as we don't try to track them as a new jobid), and we realistically could replace the call in ompi/dpm with just a fence if that is what we want to do. However, that would potentially cause issues for RMs that do want to separately track the collection, and it would likely cause problems for MPI sessions (where we will need to update ORTE to track these collections), so I'm leery of doing so.

I'm happy if someone optimizes things like construct_peers 😄
HTH

@bosilca
Copy link
Member

bosilca commented Nov 29, 2017

This PMIX mechanism only works if the processes that call disconnect have called connect before together. In MPI things are more convoluted, as you can disconnect any communicators not only the ones resulting from a spawn or a connect/accept. As an example one can merge multiple spawned worlds into a single communicator and then call disconnect on the it. This will lead to badness as a connect on the total group of processes that belong to this communicator has never been called so there is no common nspace.

Let me make clear that I am not arguing about the good or the bad of current PMIX capabilities for connect/disconnect. I just want to pinpoint that in OMPI we cannot use them for disconnecting a communicator, and that as a result our current implementation of MPI_COMM_DISCONNECT is incorrect.

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 29, 2017

I understand where you are coming from - no issues. I would only point out that merging multiple spawned "worlds" into a single communicator can and should involve calling PMIx_Connect as this ensures that all the participants have access to the job-level info of the other participants. This was part of the intended design. Otherwise, you are operating in a vacuum.

@bosilca
Copy link
Member

bosilca commented Nov 29, 2017

The main problem is that there are constructs in MPI where we don't know that we are manipulating previously unconnected processes. So we have addressed this problem by doing the modex exchange directly between MPI processes. As an example for MPI_Intercomm_create each side of the intercomm prepare a packed version of the modex (using ompi_proc_pack into an opal_dss) to send to the remote peers. Once the info from the remote group has been received we call ompi_proc_unpack to unpack the new modex info and populate the local modex, and then we call pml_add_proc on the list of remote peers. If you want more info take a look at ompi_comm_get_rprocs in communicator/comm.c

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 29, 2017

So let me offer an alternative solution. MPI doesn't care how the underlying RTE tracks things, so why don't you just have intercomm create and friends call "connect"? We can make it smart enough to see that these procs are already "connected" and ignore the duplicate call, or we can "refcount" the call - the behavior is something we can, after all, control. You could then drop that manual modex code.

Not saying it is necessary - just trying to point out that we could find a way to avoid the special case if we think it advantageous. Manually passing around modex info just seems like a scaling problem - e.g., we have gone to some length to ensure that job-level info is always available. Hate to see it passed around "under the covers" as that means we might later be unnecessarily duplicating effort.

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 29, 2017

One further thought: there might be additional benefit(s) to calling connect in such cases. When you create things like an inter-communicator, you are implicitly saying "treat these procs as a group". What "connect" does is assign a tracking ID to that collection so you can refer to it as a unique entity. For example, you can then generate an event and specify that only members of that group are to receive it, or you can issue a job control command for that group.

It also allows the RM/RTE to "know" that this collection should be treated as a group for things like notifying members when one of their members unexpectedly dies, or terminates with non-zero status.

Not pushing any of this - just saying that we may want to spend a little time thinking about how we take fuller advantage of pmix. All we've wired in so far has been based on simple replacement of what was already there - a fresh look might yield some interesting new perspectives.

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 29, 2017

Per the discussion with @bosilca, I simply replaced the call to opal_pmix.disconnect with a call to opal_pmix.fence for now. I'll fix the connect/disconnect code later as I think we might want to utilize it down the road.

I'll add this to the developer's meeting agenda so we can kick this around some more. I've applied to come - if it isn't approved, we'll come up with an alternative way to continue these thoughts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants