Re-enable all op segments when in batch mode #9055

KellenSunderland · 2017-12-13T12:48:51Z

Description

This patch fixes a performance regression that is causing a roughly 20% slowdown in some inference scenarios. I haven't investigated in depth how wide spread this issue is, but at a minimum it's affecting inference with a resnet model and a batch size of 1.

I'm not totally familiar with the commit that makes the change, but to me it looks like the regression occurred because of a refactor in the graph executor that improves readability. The performance problems occur because we introduce a number of cudaStreamSyncronize calls which lower our overall gpu utilization (likely due to a restricted execution plan that leads to a lower overall SM occupancy). It's not clear to me if this would cause any issues when running inference in a non-GPU environment.

It is possible that the change in operator segmentation behaviour was intended and has other useful implications. If so we should be able to modify the code so that we get both the intended benefits, and the faster performance.

Investigation

After verifying the regression on a demonstration model we did some high level measurements to see if any trends emerged. We saw two stats that helped diagnose the issue. First the gpu usage during inference dropping from ~95% to ~85%. Second we had a large increase in the number of cudaStreamSynchronize calls.

0.9.5 cuda calls:

Time(%)      Time     Calls       Avg       Min       Max  Name
 30.59%  16.6780s      2401  6.9463ms  5.7920us  31.420ms  cudaStreamSynchronize
 22.59%  12.3190s    105146  117.16us  22.368us  12.287ms  cudaLaunch
 17.10%  9.32475s        16  582.80ms  195.55us  4.66099s  cudaStreamCreateWithFlags

0.12 cuda calls:

==2726== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 46.13%  37.5542s     48601  772.70us  5.1200us  5.5922ms  cudaStreamSynchronize
 17.72%  14.4214s    102946  140.09us  22.432us  9.2845ms  cudaLaunch
 11.32%  9.21538s        16  575.96ms  222.94us  4.60608s  cudaStreamCreateWithFlags
...

Note the number of cudaStreamSynchronize calls increases from 2401 to 48601. Once this PR has been applied our models return to exactly 2401 calls.

We verified that the cudaStreamSynchronize calls were responsible for the low GPU utilization with some further profiling. The following images show a timeline view of one inference call through the model. A timespan is highlighted at the top of the timeline to show relative performance. Gaps in the compute row show the relative utilization of the GPU. I've also added a little instrumentation that adds our existing profiling names to nvidia's tool's timeline (very small change, will PR it separately after Chris's awesome profiling work is merged). This also highlights the difference in behaviour when we're segmenting operators. The 0.9.5 and 1.0 (with this PR) builds both group operators together into a single segment for a single inference. The 0.12 build breaks operators apart into several segments.

0.9.5:

0.12:

1.0 with PR applied:

Checklist

Essentials

Passed code style checking (make lint)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Performance regression fix. While refactoring this function it appears we introduced an unexpected side-effect: a number of cudaStreamSyncCalls which caused a slowdown in performance.

Comments

This needs to be reviewed by someone with familiarity with this graph executor. Tests are passing, and the performance tests I've run have been sped up by a factor of ~20%, but I would like to have a discussion around the performance implications of this change. Could this negatively affect some aspect of MXNet performance (for example CPU inference, large batched prediction, etc?).
This change also groups together several operators into the same reported segment from the profiler. Is this the right thing to do from the users's perspective?

eric-haibin-lin

Thanks for taking time to investigate this. This is indeed a regression - I didn't review the refactoring PR carefully enough. We should
setup benchmarks for inference to make sure no performance regression happens.

eric-haibin-lin · 2017-12-13T19:43:10Z

src/executor/graph_executor.cc

-    num_nodes_threshold = std::numeric_limits<size_t>::max();
+    // Bulk the whole graph for inference
+    cached_seg_opr_[0] = this->CreateCachedSegOpr(0, num_forward_nodes_);
+    return;
  }

  if (prefer_bulk_exec) {


We create just one segment because kLocal and kCrossDeviceCopy ops should not be included in bulk for inference. We still need to visit all nodes in the graph, but for inference we don't need to create a new segment if node->is_variable().

eric-haibin-lin · 2017-12-13T19:55:05Z

Regarding your question about CPU performance - it will mostly not be affected by bulk execution, since the executor doesn't explicit sync stream anyway (implicitly done by openmp)

marcoabreu · 2017-12-13T21:09:45Z

@eric-haibin-lin "We should
setup benchmarks for inference to make sure no performance regression happens." This is my next task after finishing the work on CI.

KellenSunderland · 2018-01-04T16:26:45Z

Rebased.

Would also be great if you could have a look @larroy

eric-haibin-lin

LGTM in general. Thanks for the fix

eric-haibin-lin · 2018-01-04T18:04:30Z

src/executor/graph_executor.cc

-    // the last segmenet
-    if (topo_start != num_forward_nodes_) {
+  // the last segment
+  if (topo_start != num_forward_nodes_) {


nit: indentation for line 1384 & 1393

eric-haibin-lin · 2018-01-04T18:05:23Z

src/executor/graph_executor.cc

+  // required for kLocal and kCrossDeviceCopy operations.
+  size_t topo_start = 0;
+  for (size_t nid = 0; nid < num_forward_nodes_; nid++) {
+      auto &node = graph_.indexed_graph()[nid].source;


nit: two space indentation

eric-haibin-lin · 2018-01-04T18:05:59Z

src/executor/graph_executor.cc

-    for (size_t nid = 0; nid < num_forward_nodes_; nid++) {
+  // create forward segments for training
+  size_t topo_start = 0;
+  for (size_t nid = 0; nid < num_forward_nodes_; nid++) {


nit: we use two space indentation

KellenSunderland · 2018-01-04T18:19:19Z

Ok, don't merge just yet, I'll fix indentation.

Edit: should be fixed.

piiswrong · 2018-01-05T19:18:36Z

src/executor/graph_executor.cc

+    if (node->is_variable()) continue;
+
+    if (op_node.exec->exec_type() == ExecType::kLocal ||
+        op_node.exec->exec_type() == ExecType::kCrossDeviceCopy) {


op_node.exec->exec_type() != ExecType::kSync

Right. I see we still wouldn't be calling Stream.Wait() for kAsync operators, and we'd get better profiler visibility. Updated.

KellenSunderland · 2018-01-09T16:01:44Z

I believe this one should be ready to merge. What do you think @piiswrong and @eric-haibin-lin. Any more changes you'd like to see?

eric-haibin-lin · 2018-01-11T19:32:43Z

src/executor/graph_executor.cc

-  return;
+void GraphExecutor::BulkInferenceOpSegs() {
+  // Attempt to bulk the whole graph for inference.  We will only create new segments when
+  // required for kLocal and kCrossDeviceCopy operations.


nit: the comment is not accurate. New segments are required for non-kSync operations. Do you mind updating it?

Good point, thanks. Updated.

piiswrong · 2018-01-12T23:55:19Z

Please try to retrigger CI

marcoabreu · 2018-01-13T00:42:50Z

Done. You can also trigger a build yourself by logging in with your github account, @piiswrong

As suggested by Haibin, this segments after kLocal and kCrossDeviceCopy ops.

nicklhy · 2018-01-15T03:40:17Z

@KellenSunderland Thanks a lot for your fix. After reading what you discussed above, I tested your mxnet fork(branch: batched_op_perf_regression) with a script from here. Though the number of cudaStreamSynchronize calls returned to a small one which is exactly the same as the old mxnet version(i.e. 0.10.0), I also noticed the avg time cost of this function increased a little bit. In my case, the network runs faster than the original mxnet 1.0.0 but a bit slower than mxnet 0.10.0. At the same time, the gpu usage increased from 83% to 95%, but is still not fully utilized. The details is like below:

# mxnet 0.10.0
# GPU usage: 99~100%
==21421== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 52.04%  7.76526s      3691  2.1038ms  1.4510us  9.5652ms  cudaStreamSynchronize
 21.31%  3.17943s    659205  4.8230us  3.3450us  1.2354ms  cudaLaunch
 11.92%  1.77868s         4  444.67ms  60.328us  889.60ms  cudaStreamCreate

# mxnet 1.0.0 (batched_op_perf_regression)
# GPU usage: 95%
==3718== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 53.11%  8.31733s      3691  2.2534ms  1.5270us  9.8319ms  cudaStreamSynchronize
 19.48%  3.05095s    631908  4.8280us  2.8860us  1.6870ms  cudaLaunch
  8.01%  1.25366s        16  78.354ms  7.7330us  627.02ms  cudaStreamCreateWithFlags
  4.66%  729.40ms       291  2.5065ms     314ns  178.82ms  cudaFree

Software env: Ubuntu 16.04
GPU: GTX 1080 Ti

larroy · 2018-01-15T15:01:17Z

Looks interesting that we are creating more streams and that we spent so much time creating streams.

* Re-enable all op segments when in batch mode * Split training/inference logic, split when required during inference. As suggested by Haibin, this segments after kLocal and kCrossDeviceCopy ops.

KellenSunderland force-pushed the batched_op_perf_regression branch from 9d949c4 to 8bc24d1 Compare December 13, 2017 14:22

KellenSunderland changed the title ~~WIP: Re-enable all op segments when in batch mode~~ Re-enable all op segments when in batch mode Dec 13, 2017

eric-haibin-lin reviewed Dec 13, 2017

View reviewed changes

eric-haibin-lin self-assigned this Dec 15, 2017

KellenSunderland force-pushed the batched_op_perf_regression branch 4 times, most recently from 31b4299 to b53c50a Compare December 21, 2017 10:31

KellenSunderland force-pushed the batched_op_perf_regression branch 2 times, most recently from 9a7ee13 to 2065684 Compare January 4, 2018 16:25

eric-haibin-lin reviewed Jan 4, 2018

View reviewed changes

KellenSunderland force-pushed the batched_op_perf_regression branch from 2065684 to 87d7e1f Compare January 4, 2018 18:49

eric-haibin-lin approved these changes Jan 5, 2018

View reviewed changes

piiswrong reviewed Jan 5, 2018

View reviewed changes

KellenSunderland force-pushed the batched_op_perf_regression branch from 87d7e1f to 12a3e5b Compare January 6, 2018 09:57

eric-haibin-lin reviewed Jan 11, 2018

View reviewed changes

KellenSunderland force-pushed the batched_op_perf_regression branch from 12a3e5b to e9df8bd Compare January 11, 2018 19:46

eric-haibin-lin mentioned this pull request Jan 12, 2018

inference speed drop after updating mxnet from 0.10.0 to 1.0.0 #9396

Closed

KellenSunderland added 2 commits January 14, 2018 20:47

Re-enable all op segments when in batch mode

83e4677

Split training/inference logic, split when required during inference.

e435043

As suggested by Haibin, this segments after kLocal and kCrossDeviceCopy ops.

KellenSunderland force-pushed the batched_op_perf_regression branch from e9df8bd to e435043 Compare January 14, 2018 19:48

piiswrong merged commit 9cc8ea3 into apache:master Jan 15, 2018

eric-haibin-lin mentioned this pull request Jan 16, 2018

Environment variable MXNET_EXEC_BULK_EXEC_INFERENCE=0 does not work #8852

Closed

anirudh2290 mentioned this pull request Jan 16, 2020

Multithreaded Inference Support #16654

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-enable all op segments when in batch mode #9055

Re-enable all op segments when in batch mode #9055

KellenSunderland commented Dec 13, 2017 •

edited

Loading

eric-haibin-lin left a comment

eric-haibin-lin Dec 13, 2017

eric-haibin-lin commented Dec 13, 2017 •

edited

Loading

marcoabreu commented Dec 13, 2017

KellenSunderland commented Jan 4, 2018

eric-haibin-lin left a comment •

edited

Loading

eric-haibin-lin Jan 4, 2018

eric-haibin-lin Jan 4, 2018

eric-haibin-lin Jan 4, 2018

KellenSunderland commented Jan 4, 2018 •

edited

Loading

piiswrong Jan 5, 2018

KellenSunderland Jan 6, 2018

KellenSunderland commented Jan 9, 2018

eric-haibin-lin Jan 11, 2018

KellenSunderland Jan 11, 2018

piiswrong commented Jan 12, 2018

marcoabreu commented Jan 13, 2018

nicklhy commented Jan 15, 2018 •

edited

Loading

larroy commented Jan 15, 2018

Re-enable all op segments when in batch mode #9055

Re-enable all op segments when in batch mode #9055

Conversation

KellenSunderland commented Dec 13, 2017 • edited Loading

Description

Investigation

Checklist

Essentials

Changes

Comments

eric-haibin-lin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-haibin-lin commented Dec 13, 2017 • edited Loading

marcoabreu commented Dec 13, 2017

KellenSunderland commented Jan 4, 2018

eric-haibin-lin left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KellenSunderland commented Jan 4, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KellenSunderland commented Jan 9, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piiswrong commented Jan 12, 2018

marcoabreu commented Jan 13, 2018

nicklhy commented Jan 15, 2018 • edited Loading

larroy commented Jan 15, 2018

KellenSunderland commented Dec 13, 2017 •

edited

Loading

eric-haibin-lin commented Dec 13, 2017 •

edited

Loading

eric-haibin-lin left a comment •

edited

Loading

KellenSunderland commented Jan 4, 2018 •

edited

Loading

nicklhy commented Jan 15, 2018 •

edited

Loading