Out-of-sequence (OOS) messages #2067

bosilca · 2016-09-08T21:00:55Z

Short version Out-of-sequence messages exists in a single link scenario

Cause Intermediary buffering at different layers in the software stack allows the delivery of message out-of-sequence

Target Most of the BTLs, with a particular emphasis on vader and IB.

Long version This issue was raised during the discussion about the performance degradation seem between 1.8 and what will eventually become 3.x. While we identified the builtin atomics as being the main culprit, it turns out that enabling multi-threading raised a set of additional issues, not necessarily visible outside this particular usage.

Having multiple threads injecting messages into the PML in the context of a single communicator, lead to a significant number of out-of-sequence messages. The reason is that the per peer sequence number is taken very early in the software stack (optimization that makes sense for single threaded scenarios). Thus, between the moment when a thread acquires the sequence number and the moment when it's message is pushed into the network, there are many opportunities for another thread to bypass and reach the network first. From the receiver perspective this is seen as an out-of-sequence message, and it will be kept in linear structures and copied multiple time before it becomes in-sequence and can be delivered to the matching logic. There are multiple ways to mitigate this, but this discussion is outside the scope of this particular issue.

More worrisome is the fact that we observe out-of-sequence messages, using a single link and supposedly ordered BTLs, and this even when each thread is using a unique communicator. Logically, in this case no out-of-sequence message should be seen. At this point we assume that the immediate send optimization without a proper implementation in the BTLs is allowing message to bypass other messages waiting in the PML/BTL queues.

jsquyres · 2016-09-08T22:09:38Z

I marked this as a blocker bug -- just so that we don't miss it when talking about v2.1.0. I'm not 100% sure that it's actually a blocker, but it does seem like an important performance issue with THREAD_MULTIPLE scenarios containing lots of sending threads. We can figure out whether this is a blocker for v2.1.0 over time.

jsquyres · 2016-09-08T22:13:25Z

One scenario that @bosilca cited to me earlier today is:

App calls MPI_ISEND
Get MPI sequence number X
openib BTL tries to do an inline send, and fails (e.g., out of resources). So the send goes onto a queue to send later
Control returns to the app
App calls MPI_ISEND again
Get MPI sequence number (X+1)
openib BTL tries to do an inline send -- without checking the "I failed to inline send so I queued it up" queue -- and succeeds
Later, the "I failed to inline send..." queue is actually progressed and message X is sent

In this scenario, message X is received after message (X+1).

@bosilca Are similar scenarios happening in other BTLs?

bosilca · 2016-09-09T03:14:49Z

Yes, @thananon has confirmed that a small number of OOS messages exists with vader.

Going back to your example, this is indeed how things unfold in the single threaded scenario. For the multi-threaded case it doesn't have to be MPI_Isend. The OOS only kicks in for the matching fragments, but if you push small messages from multiple threads we have this issue for every send below the eager limit, and for up to the number of sending threads for larger messages.

jladd-mlnx · 2016-09-09T14:06:03Z

@bosilca @thananon I would like to summarize my understanding of the issue. Please correct me where needed.

OOS messages are believed to be the root cause of the observed degradation in message rates between 1.8.8 and master/2.x.
OOS does not happen in single threaded scenario in the 1.8 series but can, and does, occur, even in a single threaded scenario, in master/2.x.
This is the result of various thread safety/thread multiple optimizations added to 2.x.

Can this affect other PMLs? @yosefe @gpaulsen please be advised.

bosilca · 2016-09-13T02:44:20Z

@jladd-mlnx no, no and no. I clearly stated in my original post that this has nothing to do with the performance degradation between 1.8 and 2.x, for which we have already identified the root cause. Let me try to be even more clear on the technical details.

In a single threaded case, in an injection rate scenario, some of the messages are delivered out of order. This small number of OOS messages, have little impact on the performance (and let me stress this again: in single threaded case). That being said, in a normal run, there is absolutely no reason to have OOS messages, so this might be an indicator of some subtle issue in our communication layers. We identified this issue on master, but we have no plan to look further in the other releases.

Multi-threaded scenarios are something new, and we cannot run any test to completion in 1.8. So, this part of the discussion is moot. However, we have highlighted the fact that in a multi-threaded scenario, the number of OOS message is extremely high, and they are certainly responsible for a significant part of the measured performance degradation. At this point, it is difficult to quantify this impact, but we are working on it. Again, our efforts are entirely focused on master, and we have no plan to pursue this topic outside this scope.

thananon · 2016-09-19T15:15:28Z

Below is the data I tested on vader. It has huge impact on the injection rate.

Results from Artem's benchmark

Vader on arc00 with windows size = 256 and message size = 64 bytes.
Each thread will post 256 requests ping-ping pong-pong with thread id as tag for 100 iterations. The numbers shown are aggregated from both processes. The result is obtained by @davideberius 's version of PAPI integrated in OMPI.

2 Threads

Communicator	Message rate	Unexpected	OOS
Same	700437.32	39	321
Separated	821592.27	13	48

4 Threads

Communicator	Message rate	Unexpected	OOS
Same	627965.81	329	8173
Separated	1087428.83	529	130

8 Threads

Communicator	Message rate	Unexpected	OOS
Same	260344.8	1377	301402
Separated	1060360	1314	267

thananon · 2016-09-26T14:30:57Z

Additional information after some test in single threaded case. Posting 500 non blocking messages.

BTL	OOS
SM	0
TCP	0
Vader	63
OpenIB	39

hppritcha · 2016-10-03T16:46:14Z

Is this really a blocker for 2.1.0? Is anyone working on a fix? It would need to come soon or else we'll need to reset the milestone.

thananon · 2016-10-03T16:56:55Z

If we want 2.1.0 to be multithreaded-efficient, we might need it as you can see, the performance is affected a lot. I will keep on reporting and let you guys decide on the milestone.

Right now the data is pointing towards individual BTL. I'm investigating further. The solution might be as easy as disable inline send. Will report back.

hjelmn · 2016-10-03T20:20:41Z

@thananon I expect some OOS messages from vader when switching protocols. Might be worth setting the fast box limit to 1 and measuring after the transition. Should be 0 and if not I need to fix something.

hjelmn · 2016-10-03T20:27:30Z

Hmm, i do see a case I am not properly checking which can lead to OOS messages under heavy load. Working on a patch for you to test.

hjelmn · 2016-10-03T20:44:02Z

@thananon See if this makes a difference for vader:

diff --git a/opal/mca/btl/vader/btl_vader_module.c b/opal/mca/btl/vader/btl_vader_module.c
index f54b407..f2d8f66 100644
--- a/opal/mca/btl/vader/btl_vader_module.c
+++ b/opal/mca/btl/vader/btl_vader_module.c
@@ -497,7 +497,8 @@ static struct mca_btl_base_descriptor_t *vader_prepare_src (struct mca_btl_base_
 #endif

             /* inline send */
-            if (OPAL_LIKELY(MCA_BTL_DES_FLAGS_BTL_OWNERSHIP & flags)) {
+            if (OPAL_LIKELY((MCA_BTL_DES_FLAGS_BTL_OWNERSHIP & flags) &&
+                            (MCA_BTL_NO_ORDER == order || 0 == opal_list_get_size (&endpoint->pending_frags)))) {
                 /* try to reserve a fast box for this transfer only if the
                  * fragment does not belong to the caller */
                 fbox = mca_btl_vader_reserve_fbox (endpoint, total_size);

hjelmn · 2016-10-04T04:05:49Z

I tested my patch on a Mac and it improved the 1-byte message injection rate by ~ 50%. There is a problem with larger messages (1k-16k). I am working on a fix for that. Hoping to have the fix ready some time tomorrow. It doesn't address the multi-threaded OOS problem but it will help somewhat.

hppritcha · 2016-10-04T15:44:17Z

removing blocker label for this per discussion at 10/4/16 devel con call

thananon · 2016-10-04T19:20:15Z

Turning off the inline send by these MCA parameters doesn't solve reduce the number of OOS messages on both vader and IB (also both THREAD_SINGLE AND THREAD_MULTIPLE)

-mca btl_vader_max_inline_send 0
-mca btl_openib_max_inline_send 0
-mca btl_openib_max_inline_data 0

As per @hjelmn, I will wait until you finish your patch and try again. I ran @artpol84 benchmark with the message size of 64 for all test.

hjelmn · 2016-10-04T23:55:23Z

I have the "problem" fixed. The vader inline send is meaningless without xpmem so I wouldn't expect it to have an effect. I should really condition it on xpmem support.

For IB it is likely due to coalescing. Set btl_openib_use_message_coalescing to 0.

hjelmn · 2016-10-04T23:55:42Z

I will open a PR tomorrow for the vader fix.

thananon · 2016-10-10T16:23:10Z

@hjelmn With btl_openib_use_message_coalescing 0 still shows some OOS messages but the number is lower. I haven't run the test enough to be able to statistically say that. This is just eyeballing observation.

hjelmn · 2016-10-10T16:33:24Z

Ok, the remainder might be due to the transition to eager rdma. Try setting btl_openib_use_eager_rdma to 0.

thananon · 2016-10-10T16:41:45Z

With -mca btl_openib_use_eager_rdma 0 alone, the OOS message is now 0.

So we found the issue for IB. As you might expect, the performance drops if we set that flag but at least we know where to look now. :)

larrystevenwise · 2016-10-10T17:21:04Z

Thanks @hjelmn and @thananon, I'll have a look at the eager logic in openib...

larrystevenwise · 2016-10-11T15:43:09Z

Hey @thananon, how can I reproduce the single threaded OOS issue using openib? Specifically, is there some statistic or counter that shows the OOS count? Or do you have a patch? Thanks in advance!

larrystevenwise · 2016-10-11T15:53:12Z

Looking at mca_btl_openib_endpoint_credit_acquire(), I see if parameter queue_frag is true, then a fragment is queued if there aren't resources available, even though OPAL_ERR_OUT_OF_RESOURCE is returned. And I see that mca_btl_openib_endpoint_post_send() calls mca_btl_openib_endpoint_credit_acquire() with queue_frag == true.

Perhaps this is where things get out of order?

larrystevenwise · 2016-10-11T16:11:54Z

mca_btl_openib_endpoint_send_eager_rdma() calls mca_btl_openib_endpoint_send() which can queue the frag. Then the frag is processed for resend in progress_no_credits_pending_frags() called by btl_openib_handle_incoming() and btl_openib_handle_incoming_completion().

So maybe if mca_btl_openib_endpoint_send_eager_rdma() is called and queues, and then mca_btl_openib_endpoint_send_eager_rdma() is called again, but before progress_no_credits_pending_frags() is called, and further does not queue because resources are now available, we get an OOS message sent...

bosilca · 2016-10-11T19:28:23Z

@larrystevenwise why is not mca_btl_openib_endpoint_send_eager_rdma checking if there are pending messages and send them prior to sending the current fragment ? The issue is that OOS affects the injection rate, especially on the receiver side where we must match the messages in FIFO order.

thananon · 2016-10-12T15:01:33Z

@larrystevenwise I used internal version of PAPI (Another project here) to obtain the number. It is not in the release state and a little bit too messy to share.

The code where we add the counter is pml_ob1_recvfrag.cline 745 with the tag wrong_seq. You can add one line there.

larrystevenwise · 2016-10-12T15:04:17Z

Thank you @thananon!

gpaulsen · 2016-12-02T20:15:02Z

Are there directions on how to try running with the integrated PAPI? How integrated is it? I'm interested in learning more.

bosilca · 2016-12-02T20:47:42Z

We are using PAPI capabilities that are not in any release branch that I know of. It will be difficult to share all the code necessary for this with you. But we might be able to expose the counters different, directly in the PML module. I'll get back to you.

gpaulsen · 2016-12-02T21:02:12Z

Thanks.

jsquyres · 2017-01-24T19:34:14Z

Per Jan 2017 F2F discussion: the only single-threaded issue that still needs to remain open is #2161 (OOS issues in openib).

In the multi-threaded case, fixing the performance will be... challenging. And will remain challenging. 😄 The MPI-3.0 spec is (very) likely to include an info key that allows relaxing ordering of matching, which is the user-level workaround for this issue.

bosilca added this to the v2.1.0 milestone Sep 8, 2016

jsquyres added bug enhancement Severity: blocker labels Sep 8, 2016

hppritcha removed the Severity: blocker label Oct 4, 2016

This was referenced Oct 4, 2016

Audit OB1 + BTLs for multi-threaded out-of-sequence performance degredation #2159

Open

v2.1.0: Fix OOS issues in vader BTL with single-threaded scenarios #2160

Closed

v2.1.0: Fix OOS issues in openib BTL with single-threaded scenarios #2161

Open

artpol84 mentioned this issue Dec 2, 2016

mr_th_nb perfs artpol84/poc#1

Open

jsquyres closed this as completed Jan 24, 2017

rishards mentioned this issue Mar 15, 2017

Getting Segmentation fault when run OpenFoam with openib on aarch64 #3145

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out-of-sequence (OOS) messages #2067

Out-of-sequence (OOS) messages #2067

bosilca commented Sep 8, 2016

jsquyres commented Sep 8, 2016

jsquyres commented Sep 8, 2016

bosilca commented Sep 9, 2016

jladd-mlnx commented Sep 9, 2016 •

edited

Loading

bosilca commented Sep 13, 2016

thananon commented Sep 19, 2016 •

edited

Loading

thananon commented Sep 26, 2016 •

edited

Loading

hppritcha commented Oct 3, 2016

thananon commented Oct 3, 2016

hjelmn commented Oct 3, 2016

hjelmn commented Oct 3, 2016

hjelmn commented Oct 3, 2016 •

edited

Loading

hjelmn commented Oct 4, 2016

hppritcha commented Oct 4, 2016

thananon commented Oct 4, 2016 •

edited

Loading

hjelmn commented Oct 4, 2016 •

edited

Loading

hjelmn commented Oct 4, 2016

thananon commented Oct 10, 2016

hjelmn commented Oct 10, 2016

thananon commented Oct 10, 2016 •

edited

Loading

larrystevenwise commented Oct 10, 2016

larrystevenwise commented Oct 11, 2016

larrystevenwise commented Oct 11, 2016

larrystevenwise commented Oct 11, 2016 •

edited

Loading

bosilca commented Oct 11, 2016

thananon commented Oct 12, 2016

larrystevenwise commented Oct 12, 2016

gpaulsen commented Dec 2, 2016

bosilca commented Dec 2, 2016

gpaulsen commented Dec 2, 2016

jsquyres commented Jan 24, 2017

Out-of-sequence (OOS) messages #2067

Out-of-sequence (OOS) messages #2067

Comments

bosilca commented Sep 8, 2016

jsquyres commented Sep 8, 2016

jsquyres commented Sep 8, 2016

bosilca commented Sep 9, 2016

jladd-mlnx commented Sep 9, 2016 • edited Loading

bosilca commented Sep 13, 2016

thananon commented Sep 19, 2016 • edited Loading

Results from Artem's benchmark

2 Threads

4 Threads

8 Threads

thananon commented Sep 26, 2016 • edited Loading

hppritcha commented Oct 3, 2016

thananon commented Oct 3, 2016

hjelmn commented Oct 3, 2016

hjelmn commented Oct 3, 2016

hjelmn commented Oct 3, 2016 • edited Loading

hjelmn commented Oct 4, 2016

hppritcha commented Oct 4, 2016

thananon commented Oct 4, 2016 • edited Loading

hjelmn commented Oct 4, 2016 • edited Loading

hjelmn commented Oct 4, 2016

thananon commented Oct 10, 2016

hjelmn commented Oct 10, 2016

thananon commented Oct 10, 2016 • edited Loading

larrystevenwise commented Oct 10, 2016

larrystevenwise commented Oct 11, 2016

larrystevenwise commented Oct 11, 2016

larrystevenwise commented Oct 11, 2016 • edited Loading

bosilca commented Oct 11, 2016

thananon commented Oct 12, 2016

larrystevenwise commented Oct 12, 2016

gpaulsen commented Dec 2, 2016

bosilca commented Dec 2, 2016

gpaulsen commented Dec 2, 2016

jsquyres commented Jan 24, 2017

jladd-mlnx commented Sep 9, 2016 •

edited

Loading

thananon commented Sep 19, 2016 •

edited

Loading

thananon commented Sep 26, 2016 •

edited

Loading

hjelmn commented Oct 3, 2016 •

edited

Loading

thananon commented Oct 4, 2016 •

edited

Loading

hjelmn commented Oct 4, 2016 •

edited

Loading

thananon commented Oct 10, 2016 •

edited

Loading

larrystevenwise commented Oct 11, 2016 •

edited

Loading