-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Re-enable all op segments when in batch mode #9055
Re-enable all op segments when in batch mode #9055
Conversation
9d949c4
to
8bc24d1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking time to investigate this. This is indeed a regression - I didn't review the refactoring PR carefully enough. We should
setup benchmarks for inference to make sure no performance regression happens.
src/executor/graph_executor.cc
Outdated
num_nodes_threshold = std::numeric_limits<size_t>::max(); | ||
// Bulk the whole graph for inference | ||
cached_seg_opr_[0] = this->CreateCachedSegOpr(0, num_forward_nodes_); | ||
return; | ||
} | ||
|
||
if (prefer_bulk_exec) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We create just one segment because kLocal
and kCrossDeviceCopy
ops should not be included in bulk for inference. We still need to visit all nodes in the graph, but for inference we don't need to create a new segment if node->is_variable()
.
Regarding your question about CPU performance - it will mostly not be affected by bulk execution, since the executor doesn't explicit sync stream anyway (implicitly done by openmp) |
@eric-haibin-lin "We should |
31b4299
to
b53c50a
Compare
9a7ee13
to
2065684
Compare
Rebased. Would also be great if you could have a look @larroy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM in general. Thanks for the fix
src/executor/graph_executor.cc
Outdated
// the last segmenet | ||
if (topo_start != num_forward_nodes_) { | ||
// the last segment | ||
if (topo_start != num_forward_nodes_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indentation for line 1384 & 1393
src/executor/graph_executor.cc
Outdated
// required for kLocal and kCrossDeviceCopy operations. | ||
size_t topo_start = 0; | ||
for (size_t nid = 0; nid < num_forward_nodes_; nid++) { | ||
auto &node = graph_.indexed_graph()[nid].source; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: two space indentation
src/executor/graph_executor.cc
Outdated
for (size_t nid = 0; nid < num_forward_nodes_; nid++) { | ||
// create forward segments for training | ||
size_t topo_start = 0; | ||
for (size_t nid = 0; nid < num_forward_nodes_; nid++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we use two space indentation
Ok, don't merge just yet, I'll fix indentation. Edit: should be fixed. |
2065684
to
87d7e1f
Compare
src/executor/graph_executor.cc
Outdated
if (node->is_variable()) continue; | ||
|
||
if (op_node.exec->exec_type() == ExecType::kLocal || | ||
op_node.exec->exec_type() == ExecType::kCrossDeviceCopy) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
op_node.exec->exec_type() != ExecType::kSync
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I see we still wouldn't be calling Stream.Wait() for kAsync operators, and we'd get better profiler visibility. Updated.
87d7e1f
to
12a3e5b
Compare
I believe this one should be ready to merge. What do you think @piiswrong and @eric-haibin-lin. Any more changes you'd like to see? |
src/executor/graph_executor.cc
Outdated
return; | ||
void GraphExecutor::BulkInferenceOpSegs() { | ||
// Attempt to bulk the whole graph for inference. We will only create new segments when | ||
// required for kLocal and kCrossDeviceCopy operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the comment is not accurate. New segments are required for non-kSync operations. Do you mind updating it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, thanks. Updated.
12a3e5b
to
e9df8bd
Compare
Please try to retrigger CI |
Done. You can also trigger a build yourself by logging in with your github account, @piiswrong |
As suggested by Haibin, this segments after kLocal and kCrossDeviceCopy ops.
e9df8bd
to
e435043
Compare
@KellenSunderland Thanks a lot for your fix. After reading what you discussed above, I tested your mxnet fork(branch: batched_op_perf_regression) with a script from here. Though the number of cudaStreamSynchronize calls returned to a small one which is exactly the same as the old mxnet version(i.e. 0.10.0), I also noticed the avg time cost of this function increased a little bit. In my case, the network runs faster than the original mxnet 1.0.0 but a bit slower than mxnet 0.10.0. At the same time, the gpu usage increased from 83% to 95%, but is still not fully utilized. The details is like below:
Software env: Ubuntu 16.04 |
Looks interesting that we are creating more streams and that we spent so much time creating streams. |
* Re-enable all op segments when in batch mode * Split training/inference logic, split when required during inference. As suggested by Haibin, this segments after kLocal and kCrossDeviceCopy ops.
* Re-enable all op segments when in batch mode * Split training/inference logic, split when required during inference. As suggested by Haibin, this segments after kLocal and kCrossDeviceCopy ops.
* Re-enable all op segments when in batch mode * Split training/inference logic, split when required during inference. As suggested by Haibin, this segments after kLocal and kCrossDeviceCopy ops.
* Re-enable all op segments when in batch mode * Split training/inference logic, split when required during inference. As suggested by Haibin, this segments after kLocal and kCrossDeviceCopy ops.
* Re-enable all op segments when in batch mode * Split training/inference logic, split when required during inference. As suggested by Haibin, this segments after kLocal and kCrossDeviceCopy ops.
Description
This patch fixes a performance regression that is causing a roughly 20% slowdown in some inference scenarios. I haven't investigated in depth how wide spread this issue is, but at a minimum it's affecting inference with a resnet model and a batch size of 1.
I'm not totally familiar with the commit that makes the change, but to me it looks like the regression occurred because of a refactor in the graph executor that improves readability. The performance problems occur because we introduce a number of cudaStreamSyncronize calls which lower our overall gpu utilization (likely due to a restricted execution plan that leads to a lower overall SM occupancy). It's not clear to me if this would cause any issues when running inference in a non-GPU environment.
It is possible that the change in operator segmentation behaviour was intended and has other useful implications. If so we should be able to modify the code so that we get both the intended benefits, and the faster performance.
Investigation
After verifying the regression on a demonstration model we did some high level measurements to see if any trends emerged. We saw two stats that helped diagnose the issue. First the gpu usage during inference dropping from ~95% to ~85%. Second we had a large increase in the number of cudaStreamSynchronize calls.
0.9.5 cuda calls:
0.12 cuda calls:
Note the number of cudaStreamSynchronize calls increases from 2401 to 48601. Once this PR has been applied our models return to exactly 2401 calls.
We verified that the cudaStreamSynchronize calls were responsible for the low GPU utilization with some further profiling. The following images show a timeline view of one inference call through the model. A timespan is highlighted at the top of the timeline to show relative performance. Gaps in the compute row show the relative utilization of the GPU. I've also added a little instrumentation that adds our existing profiling names to nvidia's tool's timeline (very small change, will PR it separately after Chris's awesome profiling work is merged). This also highlights the difference in behaviour when we're segmenting operators. The 0.9.5 and 1.0 (with this PR) builds both group operators together into a single segment for a single inference. The 0.12 build breaks operators apart into several segments.
0.9.5:
0.12:
1.0 with PR applied:
Checklist
Essentials
make lint
)Changes
Comments