-
Notifications
You must be signed in to change notification settings - Fork 882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Out-of-sequence (OOS) messages #2067
Comments
I marked this as a blocker bug -- just so that we don't miss it when talking about v2.1.0. I'm not 100% sure that it's actually a blocker, but it does seem like an important performance issue with THREAD_MULTIPLE scenarios containing lots of sending threads. We can figure out whether this is a blocker for v2.1.0 over time. |
One scenario that @bosilca cited to me earlier today is:
In this scenario, message X is received after message (X+1). @bosilca Are similar scenarios happening in other BTLs? |
Yes, @thananon has confirmed that a small number of OOS messages exists with vader. Going back to your example, this is indeed how things unfold in the single threaded scenario. For the multi-threaded case it doesn't have to be MPI_Isend. The OOS only kicks in for the matching fragments, but if you push small messages from multiple threads we have this issue for every send below the eager limit, and for up to the number of sending threads for larger messages. |
@bosilca @thananon I would like to summarize my understanding of the issue. Please correct me where needed.
Can this affect other PMLs? @yosefe @gpaulsen please be advised. |
@jladd-mlnx no, no and no. I clearly stated in my original post that this has nothing to do with the performance degradation between 1.8 and 2.x, for which we have already identified the root cause. Let me try to be even more clear on the technical details. In a single threaded case, in an injection rate scenario, some of the messages are delivered out of order. This small number of OOS messages, have little impact on the performance (and let me stress this again: in single threaded case). That being said, in a normal run, there is absolutely no reason to have OOS messages, so this might be an indicator of some subtle issue in our communication layers. We identified this issue on master, but we have no plan to look further in the other releases. Multi-threaded scenarios are something new, and we cannot run any test to completion in 1.8. So, this part of the discussion is moot. However, we have highlighted the fact that in a multi-threaded scenario, the number of OOS message is extremely high, and they are certainly responsible for a significant part of the measured performance degradation. At this point, it is difficult to quantify this impact, but we are working on it. Again, our efforts are entirely focused on master, and we have no plan to pursue this topic outside this scope. |
Below is the data I tested on vader. It has huge impact on the injection rate. Results from Artem's benchmarkVader on arc00 with windows size = 256 and message size = 64 bytes. 2 Threads
4 Threads
8 Threads
|
Additional information after some test in single threaded case. Posting 500 non blocking messages.
|
Is this really a blocker for 2.1.0? Is anyone working on a fix? It would need to come soon or else we'll need to reset the milestone. |
If we want 2.1.0 to be multithreaded-efficient, we might need it as you can see, the performance is affected a lot. I will keep on reporting and let you guys decide on the milestone. Right now the data is pointing towards individual BTL. I'm investigating further. The solution might be as easy as disable inline send. Will report back. |
@thananon I expect some OOS messages from vader when switching protocols. Might be worth setting the fast box limit to 1 and measuring after the transition. Should be 0 and if not I need to fix something. |
Hmm, i do see a case I am not properly checking which can lead to OOS messages under heavy load. Working on a patch for you to test. |
@thananon See if this makes a difference for vader: diff --git a/opal/mca/btl/vader/btl_vader_module.c b/opal/mca/btl/vader/btl_vader_module.c
index f54b407..f2d8f66 100644
--- a/opal/mca/btl/vader/btl_vader_module.c
+++ b/opal/mca/btl/vader/btl_vader_module.c
@@ -497,7 +497,8 @@ static struct mca_btl_base_descriptor_t *vader_prepare_src (struct mca_btl_base_
#endif
/* inline send */
- if (OPAL_LIKELY(MCA_BTL_DES_FLAGS_BTL_OWNERSHIP & flags)) {
+ if (OPAL_LIKELY((MCA_BTL_DES_FLAGS_BTL_OWNERSHIP & flags) &&
+ (MCA_BTL_NO_ORDER == order || 0 == opal_list_get_size (&endpoint->pending_frags)))) {
/* try to reserve a fast box for this transfer only if the
* fragment does not belong to the caller */
fbox = mca_btl_vader_reserve_fbox (endpoint, total_size); |
I tested my patch on a Mac and it improved the 1-byte message injection rate by ~ 50%. There is a problem with larger messages (1k-16k). I am working on a fix for that. Hoping to have the fix ready some time tomorrow. It doesn't address the multi-threaded OOS problem but it will help somewhat. |
removing blocker label for this per discussion at 10/4/16 devel con call |
Turning off the inline send by these MCA parameters doesn't solve reduce the number of OOS messages on both vader and IB (also both THREAD_SINGLE AND THREAD_MULTIPLE)
As per @hjelmn, I will wait until you finish your patch and try again. I ran @artpol84 benchmark with the message size of 64 for all test. |
I have the "problem" fixed. The vader inline send is meaningless without xpmem so I wouldn't expect it to have an effect. I should really condition it on xpmem support. For IB it is likely due to coalescing. Set btl_openib_use_message_coalescing to 0. |
I will open a PR tomorrow for the vader fix. |
@hjelmn With btl_openib_use_message_coalescing 0 still shows some OOS messages but the number is lower. I haven't run the test enough to be able to statistically say that. This is just eyeballing observation. |
Ok, the remainder might be due to the transition to eager rdma. Try setting btl_openib_use_eager_rdma to 0. |
With So we found the issue for IB. As you might expect, the performance drops if we set that flag but at least we know where to look now. :) |
Hey @thananon, how can I reproduce the single threaded OOS issue using openib? Specifically, is there some statistic or counter that shows the OOS count? Or do you have a patch? Thanks in advance! |
Looking at mca_btl_openib_endpoint_credit_acquire(), I see if parameter queue_frag is true, then a fragment is queued if there aren't resources available, even though OPAL_ERR_OUT_OF_RESOURCE is returned. And I see that mca_btl_openib_endpoint_post_send() calls mca_btl_openib_endpoint_credit_acquire() with queue_frag == true. Perhaps this is where things get out of order? |
mca_btl_openib_endpoint_send_eager_rdma() calls mca_btl_openib_endpoint_send() which can queue the frag. Then the frag is processed for resend in progress_no_credits_pending_frags() called by btl_openib_handle_incoming() and btl_openib_handle_incoming_completion(). So maybe if mca_btl_openib_endpoint_send_eager_rdma() is called and queues, and then mca_btl_openib_endpoint_send_eager_rdma() is called again, but before progress_no_credits_pending_frags() is called, and further does not queue because resources are now available, we get an OOS message sent... |
@larrystevenwise why is not mca_btl_openib_endpoint_send_eager_rdma checking if there are pending messages and send them prior to sending the current fragment ? The issue is that OOS affects the injection rate, especially on the receiver side where we must match the messages in FIFO order. |
@larrystevenwise I used internal version of PAPI (Another project here) to obtain the number. It is not in the release state and a little bit too messy to share. The code where we add the counter is |
Thank you @thananon! |
Are there directions on how to try running with the integrated PAPI? How integrated is it? I'm interested in learning more. |
We are using PAPI capabilities that are not in any release branch that I know of. It will be difficult to share all the code necessary for this with you. But we might be able to expose the counters different, directly in the PML module. I'll get back to you. |
Thanks. |
Per Jan 2017 F2F discussion: the only single-threaded issue that still needs to remain open is #2161 (OOS issues in openib). In the multi-threaded case, fixing the performance will be... challenging. And will remain challenging. 😄 The MPI-3.0 spec is (very) likely to include an info key that allows relaxing ordering of matching, which is the user-level workaround for this issue. |
Short version Out-of-sequence messages exists in a single link scenario
Cause Intermediary buffering at different layers in the software stack allows the delivery of message out-of-sequence
Target Most of the BTLs, with a particular emphasis on vader and IB.
Long version This issue was raised during the discussion about the performance degradation seem between 1.8 and what will eventually become 3.x. While we identified the builtin atomics as being the main culprit, it turns out that enabling multi-threading raised a set of additional issues, not necessarily visible outside this particular usage.
Having multiple threads injecting messages into the PML in the context of a single communicator, lead to a significant number of out-of-sequence messages. The reason is that the per peer sequence number is taken very early in the software stack (optimization that makes sense for single threaded scenarios). Thus, between the moment when a thread acquires the sequence number and the moment when it's message is pushed into the network, there are many opportunities for another thread to bypass and reach the network first. From the receiver perspective this is seen as an out-of-sequence message, and it will be kept in linear structures and copied multiple time before it becomes in-sequence and can be delivered to the matching logic. There are multiple ways to mitigate this, but this discussion is outside the scope of this particular issue.
More worrisome is the fact that we observe out-of-sequence messages, using a single link and supposedly ordered BTLs, and this even when each thread is using a unique communicator. Logically, in this case no out-of-sequence message should be seen. At this point we assume that the immediate send optimization without a proper implementation in the BTLs is allowing message to bypass other messages waiting in the PML/BTL queues.
The text was updated successfully, but these errors were encountered: