Faster pointwise fusion graph pass #19269

ptrendx · 2020-10-01T23:32:08Z

Description

This PR greatly increases the speed of pointwise fusion graph pass. As a test case I used XLNet from Gluon-NLP which did expose slowness in this graph pass before: #17105 - the total time of fwd+bwd fusion after the fix from that issue was ~8s, after this PR the time is ~11 ms, with 3 ms of that taken by IndexedGraph construction from the original graph (so the actual fusion graph pass takes ~8ms for that case - about 1000x improvement).

The motivation of this PR is getting the pointwise fusion graph pass to be lightweight enough to possibly be run every time the shapes change in the graph, enabling the fusion to be shape-aware. This is important as in 2.0 NumPy semantics make multiple operations (like add, sub etc.) broadcast by default, even if in the end they are simple elementwise operations. This PR does not fully get us there (I would much like it being <5ms for the network like XLNet to be sure it is lightweight enough to not be bottleneck for all usecases), although there are some parts that would not be needed anymore if we could make this pass after infershape is already done (e.g. we would not need to insert FusedOpHelper/FusedOpOutHelper, which currently takes a little over 1ms). That said, it is a big step in that direction.

The main problem that the fusion graph pass needs to solve is not allowing the cycles to be formed: nodes which both consume the output of and provide the input to a single subgraph. The original pointwise fusion graph pass' algorithm to avoid cycles was to first construct a mapping, which nodes are excluded to be in the same subgraph with a given node. The construction of the mapping was simple but inefficient (and then improved in #17114, but still pretty slow):

for each node n that does not qualify to be in a subgraph:
    outputs = nodes reached by DFS from n in the direction of n's outputs
    inputs = nodes reached by DFS from n in the direction of n's inputs
    for each output in outputs:
        for each input in inputs:
            put (input, output) and (output, input) pair into the exclusion mapping

This was O(n^3) algorithm (improved in #17114 to be O(n^2) on average) and required a separate BidirectionalGraph datastructure that enabled DFS in both directions.

The new algorithm for pointwise fusion graph pass is designed based on 2 observations:

most of the entries in the exclusion mappings are not useful, as those nodes are not considered to be part of the same subgraph
we can traverse the graph in topological order, which the original algorithm did not take advantage of

In the new algorithm the graph is traversed in the topological order (and so we are sure that all the inputs of the current node were already processed) and each node has its own exclusion set of subgraphs that it can't be a part of. The exclusion set of node is constructed as the union of the exclusion sets of its inputs + all the subsets that its inputs are part of if the node itself is ineligible to be in a subgraph. Because subsets can merge (e.g. when you have operation a + b where a is part of subgraph s1 and b is part of subgraph s2, then s1 and s2 need to be merged into a single subgraph containing everything from s1, s2 and the + operator), the mapping is maintained to know which other subsets the current subset is merged with.

There are number of additional optimizations:

because of the topological ordering of the graph traversal, subset ids in the exclusion set are typically consecutive number (or a small group of consecutive numbers) -> therefore in the exclusion set we actually keep intervals of numbers instead of the numbers themselves
in most cases the union of exclusion sets is equal to one of the sets -> in order to avoid costly unnecessary memory allocations, we share the exclusion sets between nodes
the fwd and bwd fusions are done together in a single pass to avoid overheads

The second part of the PR is the overhaul of the actual subgraph substitution - previously it was done 1 subgraph at a time, with multiple DFSVisit calls per subgraph. Unfortunately DFSVisit is costly and most of that work was wasted (as the DFS over the entire graph was needed for the substitution of a few nodes. In the new approach a new graph is created based on the subgraph assignment generated in the previous part, which requires only a single pass over the graph to apply all subgraphs.

@samskalicky @Caenorst @mk-61

Checklist

Essentials

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

mxnet-bot · 2020-10-01T23:32:12Z

Hey @ptrendx , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

To trigger all jobs: @mxnet-bot run ci [all]
To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [website, sanity, windows-cpu, edge, unix-gpu, windows-gpu, centos-gpu, miscellaneous, unix-cpu, centos-cpu, clang]

Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

ptrendx · 2020-10-02T00:21:43Z

Seems cpplint that we use does not like structured binding from C++17.

ptrendx · 2020-10-06T15:51:32Z

Ok, issue #19264 came at a perfect time - it actually exposed a corner case that breaks both the original and the new algorithm. The minimal example looks like this:

where a0, a1, b0 and b1 are fusable, while xa and xb are not. According to the original algorithm a0 and a1 can be fused together and b0 and b1 can be fused together. But if we do that we get:

It happens because the algorithm does not take into account that the exclusion sets of the nodes connected to a subgraph need to contain the excluded nodes from all nodes in that subgraph, even if they were added later in the subgraph creation process.

I am working on a solution that will fix that.

ptrendx · 2020-10-10T00:10:22Z

I fixed the problem with the cycle - every time when I add a new node to a subset, I check whether it's exclusion set is fully included in the future fusion node exclusion set. If so, I update all descendants of the fusion node (with the assumption that typically fusion subsets are not very big and their nodes are quite close together in the indexed graph, so the cost of that is not very high). Still, the check is quite expensive compared to the rest of the algorithm and now the cost of the graph pass for XLNet is ~10-11 ms (still a huge improvement compared to the previous algorithm).

mseth10 · 2020-10-12T11:11:01Z

Thanks for the optimization @ptrendx . A couple of questions based on the PR description:

Because subsets can merge (...), the mapping is maintained to know which other subsets the current subset is merged with.

What do you mean by subsets? Is it the same as subgraphs?
What mapping are you referring to here?

mseth10 · 2020-10-12T11:37:18Z

@ptrendx In regards to this edge case #19269 (comment), considering a0 -> xa -> b0 -> b1 -> xb -> a1 as the topological graph, before reaching node xb, we should have already combined b0 and b1 into subset b and updated exclusion set of b to contain a0. If that's the case, a0 should automatically be included in the exclusion set for xb. Why then do we need to "update all descendants of the fusion node"?

ptrendx · 2020-10-12T16:29:21Z

Hi @mseth10, I think all of your 3 questions are quite connected. I agree that the nomenclature here is slightly confusing. I use "subsets" because in this algorithm there are actually 2 parts:

the first one just finds the sets of nodes that can be fused together but does not do any actual fusion
the second one turns those subsets into the actual subgraphs and puts them inside the fusion node
About the mapping - consider a following graph:

where a and b are the outputs of some previously identified subgraphs (let's call those subgraphs A and B). Node c (which can't be fused) has A in its exclusion list. Then + node is being considered, which can be fused. It merges subgraphs A and B together under one of them (let's say B). Then subgraph A needs to have a mapping that says that it is a part of B now (and B should also have an inverse mapping that it contains subgraphs A and B in order to be able to check against nodes like c which only saw A but did not see B).

ptrendx · 2020-10-13T20:41:14Z

@mxnet-bot run ci [unix-cpu, unix-gpu]

mxnet-bot · 2020-10-13T20:41:20Z

Jenkins CI successfully triggered : [unix-cpu, unix-gpu]

ptrendx · 2020-10-13T22:42:25Z

Unix-gpu failed due to test_countsketch (issue #10988)
Unix-cpu failed due to instability in test_np_average (issue #19071)

@mxnet-bot run ci [unix-gpu, unix-cpu]

mxnet-bot · 2020-10-13T22:42:32Z

Jenkins CI successfully triggered : [unix-cpu, unix-gpu]

szha · 2020-10-14T00:55:49Z

@mseth10 any more comments on the PR?

src/imperative/simple_partition_pass.cc

src/imperative/pointwise_fusion_pass.cc

mseth10 · 2020-10-14T12:13:43Z

src/imperative/pointwise_fusion_pass.cc

+                node_id));
+        new_nodes[i]->control_deps.insert(new_nodes[i]->control_deps.begin() + dep_num,
+                                          std::move(helper_node));
+      } else if (their_subgraph_id != subgraph_id &&


can subgraph_id be -1 here? or should we add that check here?

It can. The role of FusedOpHelper is to get information (for shape/type inference) from inside the FusedOp to the outside world.

So if your dependency is going to be fused, then you need to have this helper instead.

Makes sense. In general, what are the control flow dependencies of a node? Is it just the dependency of backward op node on its corresponding forward op node?

Yeah, that is the dependency we are changing as it is used for infershape/infertype. Going forward I would like to get rid of this and never do the inferattr on the fused graph (instead do it on the original graph and map the results to the fused one), but it is outside the scope of this PR.

ptrendx · 2020-10-16T21:16:40Z

@mxnet-bot run ci [all]

mxnet-bot · 2020-10-16T21:16:47Z

Jenkins CI successfully triggered : [windows-gpu, unix-gpu, sanity, edge, windows-cpu, miscellaneous, centos-cpu, clang, website, unix-cpu, centos-gpu]

ptrendx · 2020-10-16T22:10:27Z

@ChaiBapchya FYI - there seem to be some connection issues in CI - first sanity test failed because of connection reset, then centos-cpu test failed with

[2020-10-16T21:36:45.894Z] ERROR: for centos7_cpu  ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))

(https://jenkins.mxnet-ci.amazon-ml.com/job/mxnet-validation/job/centos-cpu/job/PR-19269/8/display/redirect)

@mxnet-bot run ci [centos-cpu]

mxnet-bot · 2020-10-16T22:10:29Z

Undefined action detected.
Permissible actions are : run ci [all], run ci [job1, job2]
Example : @mxnet-bot run ci [all]
Example : @mxnet-bot run ci [centos-cpu, clang]

ptrendx · 2020-10-16T22:11:13Z

@mxnet-bot run ci [centos-cpu]

mxnet-bot · 2020-10-16T22:11:17Z

Jenkins CI successfully triggered : [centos-cpu]

mseth10

LGTM

* Faster pointwise fusion graph pass * Fix lint * Fix lint 2 * Fixes * Fixing slice parameter handling in fusion * Fixing the slice fix * Fix the cycle bug * Added test * Fix lint * Fix merging of subgraphs * Fixes from review

* Faster pointwise fusion graph pass (#19269) * Faster pointwise fusion graph pass * Fix lint * Fix lint 2 * Fixes * Fixing slice parameter handling in fusion * Fixing the slice fix * Fix the cycle bug * Added test * Fix lint * Fix merging of subgraphs * Fixes from review * Use std::tie instead of C++17 structured binding * More fixes for lack of c++17 * Fix

…che#19413) * Faster pointwise fusion graph pass (apache#19269) * Faster pointwise fusion graph pass * Fix lint * Fix lint 2 * Fixes * Fixing slice parameter handling in fusion * Fixing the slice fix * Fix the cycle bug * Added test * Fix lint * Fix merging of subgraphs * Fixes from review * Use std::tie instead of C++17 structured binding * More fixes for lack of c++17 * Fix

Faster pointwise fusion graph pass

08a0b37

ptrendx added the pr-work-in-progress PR is still work in progress label Oct 1, 2020

ptrendx added 2 commits October 1, 2020 17:27

Fix lint

adba037

Fix lint 2

fe11c99

szha requested a review from mseth10 October 3, 2020 21:13

ptrendx added 2 commits October 5, 2020 09:51

Fixes

898922b

Fixing slice parameter handling in fusion

3c7e3d2

ptrendx mentioned this pull request Oct 5, 2020

Same Network can hybridize on CPU but can not hybridize on GPU. #19264

Closed

ptrendx added 4 commits October 6, 2020 10:00

Fixing the slice fix

a9d1c7f

Fix the cycle bug

c56de4c

Added test

800f8aa

Fix lint

bb131e4

ptrendx added pr-awaiting-review PR is waiting for code review and removed pr-work-in-progress PR is still work in progress labels Oct 9, 2020

Fix merging of subgraphs

01e22a0

ptrendx linked an issue Oct 12, 2020 that may be closed by this pull request

Same Network can hybridize on CPU but can not hybridize on GPU. #19264

Closed

ptrendx mentioned this pull request Oct 12, 2020

Gluoncv HRNet Code raise error in mxnet-1.6.0 gpu hybridize mode #18624

Closed

ptrendx linked an issue Oct 12, 2020 that may be closed by this pull request

Gluoncv HRNet Code raise error in mxnet-1.6.0 gpu hybridize mode #18624

Closed

mseth10 reviewed Oct 14, 2020

View reviewed changes

Fixes from review

db20ba6

lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-awaiting-review PR is waiting for code review pr-awaiting-testing PR is reviewed and waiting CI build and test labels Oct 16, 2020

lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Oct 16, 2020

lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-awaiting-review PR is waiting for code review and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Oct 16, 2020

mseth10 approved these changes Oct 18, 2020

View reviewed changes

sxjscience merged commit a0fd1fe into apache:master Oct 19, 2020

ptrendx mentioned this pull request Oct 22, 2020

[1.x] Backport Faster pointwise fusion graph pass (#19269) #19413

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster pointwise fusion graph pass #19269

Faster pointwise fusion graph pass #19269

ptrendx commented Oct 1, 2020

mxnet-bot commented Oct 1, 2020

ptrendx commented Oct 2, 2020

ptrendx commented Oct 6, 2020

ptrendx commented Oct 10, 2020 •

edited

Loading

mseth10 commented Oct 12, 2020

mseth10 commented Oct 12, 2020

ptrendx commented Oct 12, 2020

ptrendx commented Oct 13, 2020

mxnet-bot commented Oct 13, 2020

ptrendx commented Oct 13, 2020

mxnet-bot commented Oct 13, 2020

szha commented Oct 14, 2020

mseth10 Oct 14, 2020

ptrendx Oct 16, 2020

ptrendx Oct 16, 2020

mseth10 Oct 17, 2020 •

edited

Loading

ptrendx Oct 17, 2020

ptrendx commented Oct 16, 2020

mxnet-bot commented Oct 16, 2020

ptrendx commented Oct 16, 2020

mxnet-bot commented Oct 16, 2020

ptrendx commented Oct 16, 2020

mxnet-bot commented Oct 16, 2020

mseth10 left a comment

Faster pointwise fusion graph pass #19269

Faster pointwise fusion graph pass #19269

Conversation

ptrendx commented Oct 1, 2020

Description

Checklist

Essentials

Comments

mxnet-bot commented Oct 1, 2020

ptrendx commented Oct 2, 2020

ptrendx commented Oct 6, 2020

ptrendx commented Oct 10, 2020 • edited Loading

mseth10 commented Oct 12, 2020

mseth10 commented Oct 12, 2020

ptrendx commented Oct 12, 2020

ptrendx commented Oct 13, 2020

mxnet-bot commented Oct 13, 2020

ptrendx commented Oct 13, 2020

mxnet-bot commented Oct 13, 2020

szha commented Oct 14, 2020

mseth10 Oct 14, 2020

Choose a reason for hiding this comment

ptrendx Oct 16, 2020

Choose a reason for hiding this comment

ptrendx Oct 16, 2020

Choose a reason for hiding this comment

mseth10 Oct 17, 2020 • edited Loading

Choose a reason for hiding this comment

ptrendx Oct 17, 2020

Choose a reason for hiding this comment

ptrendx commented Oct 16, 2020

mxnet-bot commented Oct 16, 2020

ptrendx commented Oct 16, 2020

mxnet-bot commented Oct 16, 2020

ptrendx commented Oct 16, 2020

mxnet-bot commented Oct 16, 2020

mseth10 left a comment

Choose a reason for hiding this comment

ptrendx commented Oct 10, 2020 •

edited

Loading

mseth10 Oct 17, 2020 •

edited

Loading