Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Re-enable all op segments when in batch mode #9055

Merged
merged 2 commits into from
Jan 15, 2018

Conversation

KellenSunderland
Copy link
Contributor

@KellenSunderland KellenSunderland commented Dec 13, 2017

Description

This patch fixes a performance regression that is causing a roughly 20% slowdown in some inference scenarios. I haven't investigated in depth how wide spread this issue is, but at a minimum it's affecting inference with a resnet model and a batch size of 1.

I'm not totally familiar with the commit that makes the change, but to me it looks like the regression occurred because of a refactor in the graph executor that improves readability. The performance problems occur because we introduce a number of cudaStreamSyncronize calls which lower our overall gpu utilization (likely due to a restricted execution plan that leads to a lower overall SM occupancy). It's not clear to me if this would cause any issues when running inference in a non-GPU environment.

It is possible that the change in operator segmentation behaviour was intended and has other useful implications. If so we should be able to modify the code so that we get both the intended benefits, and the faster performance.

Investigation

After verifying the regression on a demonstration model we did some high level measurements to see if any trends emerged. We saw two stats that helped diagnose the issue. First the gpu usage during inference dropping from ~95% to ~85%. Second we had a large increase in the number of cudaStreamSynchronize calls.

0.9.5 cuda calls:

Time(%)      Time     Calls       Avg       Min       Max  Name
 30.59%  16.6780s      2401  6.9463ms  5.7920us  31.420ms  cudaStreamSynchronize
 22.59%  12.3190s    105146  117.16us  22.368us  12.287ms  cudaLaunch
 17.10%  9.32475s        16  582.80ms  195.55us  4.66099s  cudaStreamCreateWithFlags

0.12 cuda calls:

==2726== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 46.13%  37.5542s     48601  772.70us  5.1200us  5.5922ms  cudaStreamSynchronize
 17.72%  14.4214s    102946  140.09us  22.432us  9.2845ms  cudaLaunch
 11.32%  9.21538s        16  575.96ms  222.94us  4.60608s  cudaStreamCreateWithFlags
...

Note the number of cudaStreamSynchronize calls increases from 2401 to 48601. Once this PR has been applied our models return to exactly 2401 calls.

We verified that the cudaStreamSynchronize calls were responsible for the low GPU utilization with some further profiling. The following images show a timeline view of one inference call through the model. A timespan is highlighted at the top of the timeline to show relative performance. Gaps in the compute row show the relative utilization of the GPU. I've also added a little instrumentation that adds our existing profiling names to nvidia's tool's timeline (very small change, will PR it separately after Chris's awesome profiling work is merged). This also highlights the difference in behaviour when we're segmenting operators. The 0.9.5 and 1.0 (with this PR) builds both group operators together into a single segment for a single inference. The 0.12 build breaks operators apart into several segments.

0.9.5:
0 9 5_with_tracing

0.12:
0 12_with_tracing

1.0 with PR applied:
1 0_with_fix_and_tracing

Checklist

Essentials

  • Passed code style checking (make lint)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Performance regression fix. While refactoring this function it appears we introduced an unexpected side-effect: a number of cudaStreamSyncCalls which caused a slowdown in performance.

Comments

  • This needs to be reviewed by someone with familiarity with this graph executor. Tests are passing, and the performance tests I've run have been sped up by a factor of ~20%, but I would like to have a discussion around the performance implications of this change. Could this negatively affect some aspect of MXNet performance (for example CPU inference, large batched prediction, etc?).
  • This change also groups together several operators into the same reported segment from the profiler. Is this the right thing to do from the users's perspective?

@KellenSunderland KellenSunderland force-pushed the batched_op_perf_regression branch from 9d949c4 to 8bc24d1 Compare December 13, 2017 14:22
@KellenSunderland KellenSunderland changed the title WIP: Re-enable all op segments when in batch mode Re-enable all op segments when in batch mode Dec 13, 2017
Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking time to investigate this. This is indeed a regression - I didn't review the refactoring PR carefully enough. We should
setup benchmarks for inference to make sure no performance regression happens.

num_nodes_threshold = std::numeric_limits<size_t>::max();
// Bulk the whole graph for inference
cached_seg_opr_[0] = this->CreateCachedSegOpr(0, num_forward_nodes_);
return;
}

if (prefer_bulk_exec) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We create just one segment because kLocal and kCrossDeviceCopy ops should not be included in bulk for inference. We still need to visit all nodes in the graph, but for inference we don't need to create a new segment if node->is_variable().

@eric-haibin-lin
Copy link
Member

eric-haibin-lin commented Dec 13, 2017

Regarding your question about CPU performance - it will mostly not be affected by bulk execution, since the executor doesn't explicit sync stream anyway (implicitly done by openmp)

@marcoabreu
Copy link
Contributor

@eric-haibin-lin "We should
setup benchmarks for inference to make sure no performance regression happens." This is my next task after finishing the work on CI.

@eric-haibin-lin eric-haibin-lin self-assigned this Dec 15, 2017
@KellenSunderland KellenSunderland force-pushed the batched_op_perf_regression branch 4 times, most recently from 31b4299 to b53c50a Compare December 21, 2017 10:31
@KellenSunderland KellenSunderland force-pushed the batched_op_perf_regression branch 2 times, most recently from 9a7ee13 to 2065684 Compare January 4, 2018 16:25
@KellenSunderland
Copy link
Contributor Author

Rebased.

Would also be great if you could have a look @larroy

Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general. Thanks for the fix

// the last segmenet
if (topo_start != num_forward_nodes_) {
// the last segment
if (topo_start != num_forward_nodes_) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indentation for line 1384 & 1393

// required for kLocal and kCrossDeviceCopy operations.
size_t topo_start = 0;
for (size_t nid = 0; nid < num_forward_nodes_; nid++) {
auto &node = graph_.indexed_graph()[nid].source;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: two space indentation

for (size_t nid = 0; nid < num_forward_nodes_; nid++) {
// create forward segments for training
size_t topo_start = 0;
for (size_t nid = 0; nid < num_forward_nodes_; nid++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we use two space indentation

@KellenSunderland
Copy link
Contributor Author

KellenSunderland commented Jan 4, 2018

Ok, don't merge just yet, I'll fix indentation.

Edit: should be fixed.

@KellenSunderland KellenSunderland force-pushed the batched_op_perf_regression branch from 2065684 to 87d7e1f Compare January 4, 2018 18:49
if (node->is_variable()) continue;

if (op_node.exec->exec_type() == ExecType::kLocal ||
op_node.exec->exec_type() == ExecType::kCrossDeviceCopy) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

op_node.exec->exec_type() != ExecType::kSync

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I see we still wouldn't be calling Stream.Wait() for kAsync operators, and we'd get better profiler visibility. Updated.

@KellenSunderland KellenSunderland force-pushed the batched_op_perf_regression branch from 87d7e1f to 12a3e5b Compare January 6, 2018 09:57
@KellenSunderland
Copy link
Contributor Author

I believe this one should be ready to merge. What do you think @piiswrong and @eric-haibin-lin. Any more changes you'd like to see?

return;
void GraphExecutor::BulkInferenceOpSegs() {
// Attempt to bulk the whole graph for inference. We will only create new segments when
// required for kLocal and kCrossDeviceCopy operations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the comment is not accurate. New segments are required for non-kSync operations. Do you mind updating it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, thanks. Updated.

@piiswrong
Copy link
Contributor

Please try to retrigger CI

@marcoabreu
Copy link
Contributor

Done. You can also trigger a build yourself by logging in with your github account, @piiswrong

@KellenSunderland KellenSunderland force-pushed the batched_op_perf_regression branch from e9df8bd to e435043 Compare January 14, 2018 19:48
@nicklhy
Copy link
Contributor

nicklhy commented Jan 15, 2018

@KellenSunderland Thanks a lot for your fix. After reading what you discussed above, I tested your mxnet fork(branch: batched_op_perf_regression) with a script from here. Though the number of cudaStreamSynchronize calls returned to a small one which is exactly the same as the old mxnet version(i.e. 0.10.0), I also noticed the avg time cost of this function increased a little bit. In my case, the network runs faster than the original mxnet 1.0.0 but a bit slower than mxnet 0.10.0. At the same time, the gpu usage increased from 83% to 95%, but is still not fully utilized. The details is like below:

# mxnet 0.10.0
# GPU usage: 99~100%
==21421== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 52.04%  7.76526s      3691  2.1038ms  1.4510us  9.5652ms  cudaStreamSynchronize
 21.31%  3.17943s    659205  4.8230us  3.3450us  1.2354ms  cudaLaunch
 11.92%  1.77868s         4  444.67ms  60.328us  889.60ms  cudaStreamCreate
# mxnet 1.0.0 (batched_op_perf_regression)
# GPU usage: 95%
==3718== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 53.11%  8.31733s      3691  2.2534ms  1.5270us  9.8319ms  cudaStreamSynchronize
 19.48%  3.05095s    631908  4.8280us  2.8860us  1.6870ms  cudaLaunch
  8.01%  1.25366s        16  78.354ms  7.7330us  627.02ms  cudaStreamCreateWithFlags
  4.66%  729.40ms       291  2.5065ms     314ns  178.82ms  cudaFree

Software env: Ubuntu 16.04
GPU: GTX 1080 Ti

@larroy
Copy link
Contributor

larroy commented Jan 15, 2018

Looks interesting that we are creating more streams and that we spent so much time creating streams.

@piiswrong piiswrong merged commit 9cc8ea3 into apache:master Jan 15, 2018
CodingCat pushed a commit to CodingCat/mxnet that referenced this pull request Jan 16, 2018
* Re-enable all op segments when in batch mode

* Split training/inference logic, split when required during inference.

As suggested by Haibin, this segments after kLocal and kCrossDeviceCopy ops.
larroy pushed a commit to larroy/mxnet that referenced this pull request Jan 18, 2018
* Re-enable all op segments when in batch mode

* Split training/inference logic, split when required during inference.

As suggested by Haibin, this segments after kLocal and kCrossDeviceCopy ops.
yuxiangw pushed a commit to yuxiangw/incubator-mxnet that referenced this pull request Jan 25, 2018
* Re-enable all op segments when in batch mode

* Split training/inference logic, split when required during inference.

As suggested by Haibin, this segments after kLocal and kCrossDeviceCopy ops.
rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
* Re-enable all op segments when in batch mode

* Split training/inference logic, split when required during inference.

As suggested by Haibin, this segments after kLocal and kCrossDeviceCopy ops.
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
* Re-enable all op segments when in batch mode

* Split training/inference logic, split when required during inference.

As suggested by Haibin, this segments after kLocal and kCrossDeviceCopy ops.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants