-
Notifications
You must be signed in to change notification settings - Fork 882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
can opal_progress() call opal_progress()? (test hanging in MPI_COMM_DUP) #2025
Comments
It is difficult to see the problem only from this stack trace. Moreover, recursively calling opal_progress is not forbidden (especially in this particular context where we need to progress non-blocking non-PML requests). |
yeah. was written before i knew about the callback. will update this soon |
@hjelmn You made mention that you were going to update the request callback function. Have you had a chance to do so? |
I am seeing an issue in v3.0.x, v3.1.x, and master that resembles a similar stack trace as this issue, however the issue is with multi threaded support. When performing MPI_Comm_dup in a loop with threads, the problem is almost instantly reproducible. The stack trace looks as follows: Rank 0:
Rank 1:
In opal/threads/wait_sync.c, it looks like we are stuck in the loop waiting on A gist of the testcase can be found at: https://gist.github.com/sam6258/8233342cee7c0acc93d2c9ffbf18ba35 |
Recap:
It looks like we're using the single threaded/fast one everywhere now. That may be a mistake. We may need to go back to:
|
@jsquyres as far as I understand, we now only use the thread safe, iterative and slow algorithm. |
There is an opportunity for these two issues not to be related. The original issue has not been updated for months (years), I will therefore assume it has been fixed. Let me scale back this example to try to explain what I think is the problem. Duplicating a communicator involves 2 allreduce, one with MPI_MAX (to find a suitable cid) and one with MPI_MIN (to agree on it). These reductions are done sequentially on the same communicator. In single threaded everything goes smooth, the global order of these 2 allreduce is ALWAYS respected. In multi-threaded cases, starting multiple duplications in same time will basically see a tuple of (max, mix) allreduce posted sequentially per dup, on the same communicator on each process. MPI requires that collectives are posted in the same order globally, but in a distributed system we have no way to enforce a global order between the tuple of allreduce from the dup operation. The current algorithm work just fine if the global order is respected, but fails when this is not the case, which is exactly the reported behavior: random deadlocks. For a toy example that tries to mimic the behavior of multiple duplications, replace the runfunc from the previous gist with the following: void *runfunc(void * args) {
MPI_Request req;
MPI_Status status;
int myrank, mysize;
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_Comm_size(MPI_COMM_WORLD, &mysize);
int old_value = myrank, new_value;
int *sbuf = &old_value, *rbuf = &new_value;
MPI_Iallreduce( sbuf, rbuf, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD, &req );
MPI_Wait(&req, &status);
assert( new_value == (mysize - 1) );
MPI_Iallreduce( sbuf, rbuf, 1, MPI_INT, MPI_MIN, MPI_COMM_WORLD, &req );
MPI_Wait(&req, &status);
assert( new_value == 0 );
pthread_exit(NULL);
} So the problem of our cid selection in multithreaded cases is that we mix the reduces in a totally nondeterministic way. I can see few solutions, but none trivial.
|
I think @ggouaillardet is correct. I will have to re-evaluate the test case from which we wrote this small test case to see that it is not violating the standard. I added the necessary synchronization in runfunc to ensure two threads were not calling the same collective on COMM_WORLD at the same time, and the test case does not hang. Thanks Gilles! And everyone for looking into this. |
Sounds like we can close this issue, then...? |
I will verify with our tester tomorrow and close if I see a violation of the standard in our original testcase. Thanks! |
This particular example is indeed incorrect, but replace the blocking dup with a non-blocking dup and we are back with a valid non-functional example. |
@bosilca in order to avoid any confusion/misunderstanding, would you mind updating such a valid non-functional example ? |
What we need to emphasize is the non-determinism between internal steps of the operation itself. We can achieve this by changing the runfunc in the gist example by /* Thread function */
void *runfunc(void * args) {
MPI_Comm comms[2];
MPI_Request reqs[1] = {MPI_REQUEST_NULL};
MPI_Comm_idup( MPI_COMM_WORLD, &comms[0], &reqs[0] );
MPI_Waitall(1, reqs, MPI_STATUSES_IGNORE);
MPI_Comm_free(&comms[0]);
pthread_exit(NULL);
} The question I couldn't find an answer to in the MPI Standard is if we are allowed to post 2 idup on the same communicator at the same time. |
well, I ran into Bill Gropp at SC'16 and asked him a related question. He replied that communicator management is considered as collective operations. So my view is that if |
I think Bill made reference to Section 5.12 from the MPI standard. While this section clarifies some corner cases, I do not see anything in there that forbid the scenario we are talking about. In fact here are the important bits from the Section 5.12:
According to this, it should be legal to have multiple outstanding idup on the same communicator, because from the MPI standard viewpoint, an idup is considered as a single operation. Unfortunately, from OMPI perspective this is not the case. |
I am still a bit puzzled, but anyway ... Would updating |
I am puzzled as well. I would be in favor to ask this to the Forum, either by sending an email or by volunteering @jsquyres (who is currently attending the MPI forum) to ask around what the forum think about having multiple outstanding idup on the same communicator. |
Fun fact: I'm not actually at the Forum this week 😲 |
well, regardless this is valid or not, the following snippet (no need for multiple threads) might lead to incorrect results (it does not hang, but a given communicator might have several CIDs), the issue can be evidenced after a few runs on 4 MPI tasks. MPI_Comm_idup(MPI_COMM_WORLD, comm, req);
MPI_Comm_idup(MPI_COMM_WORLD, comm+1, req+1);
MPI_Waitall(2, req, MPI_STATUSES_IGNORE); as I suggested earlier, progressing the second request after the first request is complete (since they both operate on the same here are attached two examples and a proof of concept. with #include <stdio.h>
#include <mpi.h>
int main(int argc, char *argv[]) {
MPI_Comm comm[3];
int rank, size;
int mycids[2];
int cids[8];
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (4 < size) {
MPI_Abort(MPI_COMM_WORLD, 1);
}
if (0 == rank) {
MPI_Comm_dup(MPI_COMM_SELF, comm);
MPI_Comm_dup(MPI_COMM_SELF, comm+2);
MPI_Comm_free(comm);
} else {
MPI_Comm_dup(MPI_COMM_SELF, comm+2);
}
MPI_Comm_dup(MPI_COMM_WORLD, comm);
MPI_Comm_dup(MPI_COMM_WORLD, comm+1);
mycids[0] = ((int *)comm[0])[0x138/4];
mycids[1] = ((int *)comm[1])[0x138/4];
MPI_Gather(mycids, 2, MPI_INT, cids, 2, MPI_INT, 0, MPI_COMM_WORLD);
if (0 == rank) {
int i;
for (i=1; i<size; i++) {
if (cids[0] != cids[2*i]) {
fprintf(stderr, "mismatch for comm 0 on rank %d, got %d but has %d\n", i, cids[0], cids[2*i]);
MPI_Abort(MPI_COMM_WORLD, 2);
}
if (cids[1] != cids[2*i+1]) {
fprintf(stderr, "mismatch for comm 1 on rank %d, got %d but has %d\n", i, cids[1], cids[2*i+1]);
MPI_Abort(MPI_COMM_WORLD, 2);
}
}
printf ("OK\n");
}
MPI_Finalize();
return 0;
} with #include <stdio.h>
#include <mpi.h>
int main(int argc, char *argv[]) {
MPI_Comm comm[3];
MPI_Request req[2];
int rank, size;
int mycids[2];
int cids[8];
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (4 < size) {
MPI_Abort(MPI_COMM_WORLD, 1);
}
if (0 == rank) {
MPI_Comm_dup(MPI_COMM_SELF, comm);
MPI_Comm_dup(MPI_COMM_SELF, comm+2);
MPI_Comm_free(comm);
} else {
MPI_Comm_dup(MPI_COMM_SELF, comm+2);
}
MPI_Comm_idup(MPI_COMM_WORLD, comm, req);
MPI_Comm_idup(MPI_COMM_WORLD, comm+1, req+1);
MPI_Waitall(2, req, MPI_STATUSES_IGNORE);
mycids[0] = ((int *)comm[0])[0x138/4];
mycids[1] = ((int *)comm[1])[0x138/4];
MPI_Gather(mycids, 2, MPI_INT, cids, 2, MPI_INT, 0, MPI_COMM_WORLD);
if (0 == rank) {
int i;
for (i=1; i<size; i++) {
if (cids[0] != cids[2*i]) {
fprintf(stderr, "mismatch for comm 0 on rank %d, got %d but has %d\n", i, cids[0], cids[2*i]);
MPI_Abort(MPI_COMM_WORLD, 2);
}
if (cids[1] != cids[2*i+1]) {
fprintf(stderr, "mismatch for comm 1 on rank %d, got %d but has %d\n", i, cids[1], cids[2*i+1]);
MPI_Abort(MPI_COMM_WORLD, 2);
}
}
printf ("OK\n");
}
MPI_Finalize();
return 0;
} here is a proof of concept that seems to fix the issue diff --git a/ompi/communicator/comm.c b/ompi/communicator/comm.c
index 228abae..3a7bece 100644
--- a/ompi/communicator/comm.c
+++ b/ompi/communicator/comm.c
@@ -1071,6 +1071,7 @@ static int ompi_comm_idup_internal (ompi_communicator_t *comm, ompi_group_t *gro
if (NULL == request) {
return OMPI_ERR_OUT_OF_RESOURCE;
}
+ request->contextid = comm->c_contextid;
context = OBJ_NEW(ompi_comm_idup_with_info_context_t);
if (NULL == context) {
diff --git a/ompi/communicator/comm_request.c b/ompi/communicator/comm_request.c
index 272fc33..6dbdd39 100644
--- a/ompi/communicator/comm_request.c
+++ b/ompi/communicator/comm_request.c
@@ -22,6 +22,7 @@
static opal_free_list_t ompi_comm_requests;
static opal_list_t ompi_comm_requests_active;
+static opal_list_t ompi_comm_requests_queue;
static opal_mutex_t ompi_comm_request_mutex;
bool ompi_comm_request_progress_active = false;
bool ompi_comm_request_initialized = false;
@@ -44,6 +45,7 @@ void ompi_comm_request_init (void)
NULL, 0, NULL, NULL, NULL);
OBJ_CONSTRUCT(&ompi_comm_requests_active, opal_list_t);
+ OBJ_CONSTRUCT(&ompi_comm_requests_queue, opal_list_t);
ompi_comm_request_progress_active = false;
OBJ_CONSTRUCT(&ompi_comm_request_mutex, opal_mutex_t);
ompi_comm_request_initialized = true;
@@ -64,6 +66,7 @@ void ompi_comm_request_fini (void)
opal_mutex_unlock (&ompi_comm_request_mutex);
OBJ_DESTRUCT(&ompi_comm_request_mutex);
OBJ_DESTRUCT(&ompi_comm_requests_active);
+ OBJ_DESTRUCT(&ompi_comm_requests_queue);
OBJ_DESTRUCT(&ompi_comm_requests);
}
@@ -141,9 +144,17 @@ static int ompi_comm_request_progress (void)
/* if the request schedule is empty then the request is complete */
if (0 == opal_list_get_size (&request->schedule)) {
+ ompi_comm_request_t *req;
opal_list_remove_item (&ompi_comm_requests_active, (opal_list_item_t *) request);
request->super.req_status.MPI_ERROR = (OMPI_SUCCESS == rc) ? MPI_SUCCESS : MPI_ERR_INTERN;
ompi_request_complete (&request->super, true);
+ OPAL_LIST_FOREACH(req, &ompi_comm_requests_queue, ompi_comm_request_t) {
+ if (request->contextid == req->contextid) {
+ opal_list_remove_item(&ompi_comm_requests_queue, (opal_list_item_t *)req);
+ opal_list_append(&ompi_comm_requests_active, (opal_list_item_t *)req);
+ break;
+ }
+ }
}
}
@@ -161,8 +172,21 @@ static int ompi_comm_request_progress (void)
void ompi_comm_request_start (ompi_comm_request_t *request)
{
+ ompi_comm_request_t *req;
+ bool queued = false;
opal_mutex_lock (&ompi_comm_request_mutex);
- opal_list_append (&ompi_comm_requests_active, (opal_list_item_t *) request);
+ if (MPI_UNDEFINED != request->contextid) {
+ OPAL_LIST_FOREACH(req, &ompi_comm_requests_active, ompi_comm_request_t) {
+ if (request->contextid == req->contextid) {
+ opal_list_append(&ompi_comm_requests_queue, (opal_list_item_t *)request);
+ queued = true;
+ break;
+ }
+ }
+ }
+ if (!queued) {
+ opal_list_append (&ompi_comm_requests_active, (opal_list_item_t *) request);
+ }
/* check if we need to start the communicator request progress function */
if (!ompi_comm_request_progress_active) {
@@ -230,6 +254,8 @@ static void ompi_comm_request_construct (ompi_comm_request_t *request)
request->super.req_cancel = ompi_comm_request_cancel;
OBJ_CONSTRUCT(&request->schedule, opal_list_t);
+
+ request->contextid = MPI_UNDEFINED;
}
static void ompi_comm_request_destruct (ompi_comm_request_t *request)
diff --git a/ompi/communicator/comm_request.h b/ompi/communicator/comm_request.h
index 43082d6..cbc680e 100644
--- a/ompi/communicator/comm_request.h
+++ b/ompi/communicator/comm_request.h
@@ -23,6 +23,7 @@ typedef struct ompi_comm_request_t {
opal_object_t *context;
opal_list_t schedule;
+ uint32_t contextid;
} ompi_comm_request_t;
OBJ_CLASS_DECLARATION(ompi_comm_request_t); |
Nice, I thought about this case but I couldn't make it break. I see your trick, force a mismatch on the cid by creating one additional communicator on the root. Your solution is to basically sequentialize all dup on the same communicator. Sound reasonable. |
So far, I haven't found anything in our large test case that violates the standard. I will try applying the code @ggouaillardet posted above to see if that makes a difference in our situation. |
I would like to raise this issue, since there has been no activity for quite some time. |
Does that mean that threads issue collective operations ( |
No, of course they have to obey a strict order of all collectives. I have a multi-task runtime that has to duplicate MPI communicators on-demand to allow collective patterns for a non-deterministic task execution. The application is closed source, so I tried to break it down to a minimal reproducer, however, only single-threaded for now: #include <mpi.h>
int main(int argc, char *argv[]) {
MPI_Request request[3];
MPI_Comm newComm[2];
int data = 1;
int flag;
MPI_Init(&argc, &argv);
MPI_Comm_idup(MPI_COMM_WORLD, &newComm[0], &request[0]);
MPI_Ibcast(&data, 1, MPI_INT, 0, MPI_COMM_WORLD, &request[1]);
MPI_Test(&request[0], &flag, MPI_STATUS_IGNORE);
MPI_Test(&request[0], &flag, MPI_STATUS_IGNORE);
MPI_Test(&request[0], &flag, MPI_STATUS_IGNORE);
MPI_Test(&request[0], &flag, MPI_STATUS_IGNORE);
MPI_Test(&request[0], &flag, MPI_STATUS_IGNORE);
MPI_Test(&request[0], &flag, MPI_STATUS_IGNORE);
MPI_Test(&request[0], &flag, MPI_STATUS_IGNORE);
MPI_Test(&request[0], &flag, MPI_STATUS_IGNORE);
MPI_Test(&request[0], &flag, MPI_STATUS_IGNORE);
MPI_Wait(&request[1], MPI_STATUS_IGNORE);
MPI_Comm_idup(MPI_COMM_WORLD, &newComm[1], &request[2]);
MPI_Wait(&request[2], MPI_STATUS_IGNORE);
MPI_Wait(&request[0], MPI_STATUS_IGNORE);
MPI_Finalize();
return 0;
} We actually do not use |
I think we are going back to the discussion we had last year that MPI_Comm_idup being implemented using multiple non blocking collectives we cannot allow other non-blocking collectives to be interpose between a idup and the corresponding completion. At least not without @ggouaillardet patch above. |
As I already mentioned the patch by @ggouaillardet did not help in case of my application. |
@jeorsch Here is an updated version of my patch. long story short, only one diff --git a/ompi/communicator/comm.c b/ompi/communicator/comm.c
index 50b19ee..ce76610 100644
--- a/ompi/communicator/comm.c
+++ b/ompi/communicator/comm.c
@@ -1074,6 +1074,7 @@ static int ompi_comm_idup_internal (ompi_communicator_t *comm, ompi_group_t *gro
if (NULL == request) {
return OMPI_ERR_OUT_OF_RESOURCE;
}
+ request->contextid = comm->c_contextid;
context = OBJ_NEW(ompi_comm_idup_with_info_context_t);
if (NULL == context) {
diff --git a/ompi/communicator/comm_cid.c b/ompi/communicator/comm_cid.c
index 3833d3a..f2cdea5 100644
--- a/ompi/communicator/comm_cid.c
+++ b/ompi/communicator/comm_cid.c
@@ -645,7 +645,7 @@ static int ompi_comm_allreduce_intra_nb (int *inbuf, int *outbuf, int count, str
ompi_communicator_t *comm = context->comm;
return comm->c_coll->coll_iallreduce (inbuf, outbuf, count, MPI_INT, op, comm,
- req, comm->c_coll->coll_iallreduce_module);
+ req, (struct mca_coll_base_module_2_3_0_t *)((ptrdiff_t)comm->c_coll->coll_iallreduce_module | 0x1));
}
/* Non-blocking version of ompi_comm_allreduce_inter */
diff --git a/ompi/communicator/comm_request.c b/ompi/communicator/comm_request.c
index 4e3288e..97b2c49 100644
--- a/ompi/communicator/comm_request.c
+++ b/ompi/communicator/comm_request.c
@@ -22,6 +22,7 @@
static opal_free_list_t ompi_comm_requests;
static opal_list_t ompi_comm_requests_active;
+static opal_list_t ompi_comm_requests_queue;
static opal_mutex_t ompi_comm_request_mutex;
bool ompi_comm_request_progress_active = false;
bool ompi_comm_request_initialized = false;
@@ -44,6 +45,7 @@ void ompi_comm_request_init (void)
NULL, 0, NULL, NULL, NULL);
OBJ_CONSTRUCT(&ompi_comm_requests_active, opal_list_t);
+ OBJ_CONSTRUCT(&ompi_comm_requests_queue, opal_list_t);
ompi_comm_request_progress_active = false;
OBJ_CONSTRUCT(&ompi_comm_request_mutex, opal_mutex_t);
ompi_comm_request_initialized = true;
@@ -64,6 +66,7 @@ void ompi_comm_request_fini (void)
opal_mutex_unlock (&ompi_comm_request_mutex);
OBJ_DESTRUCT(&ompi_comm_request_mutex);
OBJ_DESTRUCT(&ompi_comm_requests_active);
+ OBJ_DESTRUCT(&ompi_comm_requests_queue);
OBJ_DESTRUCT(&ompi_comm_requests);
}
@@ -148,9 +151,17 @@ static int ompi_comm_request_progress (void)
/* if the request schedule is empty then the request is complete */
if (0 == opal_list_get_size (&request->schedule)) {
+ ompi_comm_request_t *req, *nreq;
opal_list_remove_item (&ompi_comm_requests_active, (opal_list_item_t *) request);
request->super.req_status.MPI_ERROR = (OMPI_SUCCESS == rc) ? MPI_SUCCESS : rc;
ompi_request_complete (&request->super, true);
+ OPAL_LIST_FOREACH_SAFE(req, nreq, &ompi_comm_requests_queue, ompi_comm_request_t) {
+ if (request->contextid == req->contextid) {
+ opal_list_remove_item(&ompi_comm_requests_queue, (opal_list_item_t *)req);
+ opal_list_append(&ompi_comm_requests_active, (opal_list_item_t *)req);
+ break;
+ }
+ }
}
}
@@ -168,8 +179,21 @@ static int ompi_comm_request_progress (void)
void ompi_comm_request_start (ompi_comm_request_t *request)
{
+ ompi_comm_request_t *req;
+ bool queued = false;
opal_mutex_lock (&ompi_comm_request_mutex);
- opal_list_append (&ompi_comm_requests_active, (opal_list_item_t *) request);
+ if (MPI_UNDEFINED != request->contextid) {
+ OPAL_LIST_FOREACH(req, &ompi_comm_requests_active, ompi_comm_request_t) {
+ if (request->contextid == req->contextid) {
+ opal_list_append(&ompi_comm_requests_queue, (opal_list_item_t *)request);
+ queued = true;
+ break;
+ }
+ }
+ }
+ if (!queued) {
+ opal_list_append (&ompi_comm_requests_active, (opal_list_item_t *) request);
+ }
/* check if we need to start the communicator request progress function */
if (!ompi_comm_request_progress_active) {
@@ -238,6 +262,8 @@ static void ompi_comm_request_construct (ompi_comm_request_t *request)
request->super.req_cancel = ompi_comm_request_cancel;
OBJ_CONSTRUCT(&request->schedule, opal_list_t);
+
+ request->contextid = MPI_UNDEFINED;
}
static void ompi_comm_request_destruct (ompi_comm_request_t *request)
diff --git a/ompi/communicator/comm_request.h b/ompi/communicator/comm_request.h
index 43082d6..cbc680e 100644
--- a/ompi/communicator/comm_request.h
+++ b/ompi/communicator/comm_request.h
@@ -23,6 +23,7 @@ typedef struct ompi_comm_request_t {
opal_object_t *context;
opal_list_t schedule;
+ uint32_t contextid;
} ompi_comm_request_t;
OBJ_CLASS_DECLARATION(ompi_comm_request_t);
diff --git a/ompi/mca/coll/base/coll_tags.h b/ompi/mca/coll/base/coll_tags.h
index f40f029..ec1eb1a 100644
--- a/ompi/mca/coll/base/coll_tags.h
+++ b/ompi/mca/coll/base/coll_tags.h
@@ -42,7 +42,8 @@
#define MCA_COLL_BASE_TAG_SCATTER -25
#define MCA_COLL_BASE_TAG_SCATTERV -26
#define MCA_COLL_BASE_TAG_NONBLOCKING_BASE -27
-#define MCA_COLL_BASE_TAG_NONBLOCKING_END ((-1 * INT_MAX/2) + 1)
+#define MCA_COLL_BASE_TAG_NONBLOCKING_END ((-1 * INT_MAX/2) + 2)
+#define MCA_COLL_BASE_TAG_NONBLOCKING_DUP ((-1 * INT_MAX/2) + 1)
#define MCA_COLL_BASE_TAG_HCOLL_BASE (-1 * INT_MAX/2)
#define MCA_COLL_BASE_TAG_HCOLL_END (-1 * INT_MAX)
#endif /* MCA_COLL_BASE_TAGS_H */
diff --git a/ompi/mca/coll/libnbc/nbc.c b/ompi/mca/coll/libnbc/nbc.c
index 171f5a3..60c8f9e 100644
--- a/ompi/mca/coll/libnbc/nbc.c
+++ b/ompi/mca/coll/libnbc/nbc.c
@@ -683,15 +683,19 @@ int NBC_Schedule_request(NBC_Schedule *schedule, ompi_communicator_t *comm,
return OMPI_ERR_OUT_OF_RESOURCE;
}
- /* update the module->tag here because other processes may have operations
- * and they may update the module->tag */
- OPAL_THREAD_LOCK(&module->mutex);
- tmp_tag = module->tag--;
- if (tmp_tag == MCA_COLL_BASE_TAG_NONBLOCKING_END) {
- tmp_tag = module->tag = MCA_COLL_BASE_TAG_NONBLOCKING_BASE;
- NBC_DEBUG(2,"resetting tags ...\n");
+ if ((ptrdiff_t)module & 0x1) {
+ tmp_tag = MCA_COLL_BASE_TAG_NONBLOCKING_DUP;
+ } else {
+ /* update the module->tag here because other processes may have operations
+ * and they may update the module->tag */
+ OPAL_THREAD_LOCK(&module->mutex);
+ tmp_tag = module->tag--;
+ if (tmp_tag == MCA_COLL_BASE_TAG_NONBLOCKING_END) {
+ tmp_tag = module->tag = MCA_COLL_BASE_TAG_NONBLOCKING_BASE;
+ NBC_DEBUG(2,"resetting tags ...\n");
+ }
+ OPAL_THREAD_UNLOCK(&module->mutex);
}
- OPAL_THREAD_UNLOCK(&module->mutex);
OBJ_RELEASE(schedule);
free(tmpbuf);
@@ -712,18 +716,27 @@ int NBC_Schedule_request(NBC_Schedule *schedule, ompi_communicator_t *comm,
/******************** Do the tag and shadow comm administration ... ***************/
- OPAL_THREAD_LOCK(&module->mutex);
- tmp_tag = module->tag--;
- if (tmp_tag == MCA_COLL_BASE_TAG_NONBLOCKING_END) {
- tmp_tag = module->tag = MCA_COLL_BASE_TAG_NONBLOCKING_BASE;
- NBC_DEBUG(2,"resetting tags ...\n");
- }
+ if ((ptrdiff_t)module & 0x1) {
+ tmp_tag = MCA_COLL_BASE_TAG_NONBLOCKING_DUP;
+ module = (ompi_coll_libnbc_module_t *)((ptrdiff_t)module & ~(ptrdiff_t)0x1);
+ if (true != module->comm_registered) {
+ module->comm_registered = true;
+ need_register = true;
+ }
+ } else {
+ OPAL_THREAD_LOCK(&module->mutex);
+ tmp_tag = module->tag--;
+ if (tmp_tag == MCA_COLL_BASE_TAG_NONBLOCKING_END) {
+ tmp_tag = module->tag = MCA_COLL_BASE_TAG_NONBLOCKING_BASE;
+ NBC_DEBUG(2,"resetting tags ...\n");
+ }
- if (true != module->comm_registered) {
- module->comm_registered = true;
- need_register = true;
+ if (true != module->comm_registered) {
+ module->comm_registered = true;
+ need_register = true;
+ }
+ OPAL_THREAD_UNLOCK(&module->mutex);
}
- OPAL_THREAD_UNLOCK(&module->mutex);
handle->tag = tmp_tag;
|
@ggouaillardet Unfortunately, your updated patch also does not help. However, once my application succeeded with two ranks which it never did with the old patch. |
An intel test is hanging in MPI_COMM_DUP (MPI_Keyval1_c and MPI_Keyval1_f), and the backtrace from one of the hung processes is a bit strange:
A snapshot backtrace from a hung process is:
Notes:
ompi_comm_dup_with_info()
is essentially waiting on a request that never completes. This seems to be the real issue. The only communication this test does is duping communicators.opal_progress()
allowed to callopal_progress()
?These 2 tests (the C and Fortran versions) are not hanging on the v2.x branch.
@bosilca @hjelmn
The text was updated successfully, but these errors were encountered: