Partitioning Gluon HybridBlocks #15969

samskalicky · 2019-08-22T01:43:39Z

Description

Adds partitioning support for Gluon HybridBlocks. This is a continuation of the partitioning support for Symbol #15886

Design

In Gluon, a HybridBlock contains a Symbol after hybridizing and executing a forward pass. The Symbol is contained and managed within the block. The partitioning logic will be integrated into the hybridize flow.

There are many ways to create a Gluon Hybrid block and after this process, users call the hybridize() function to start the flow. We add two new arguments to support partitioning: backend which is a string corresponding to the subgraph_backend name, and opt_args which is a map of arguments that should be passed to the subgraph_property during partitioning. These values are stored until used during the first inference call. Heres an example specifying these new arguments:

net = create()
net.hybridize(backend='default', opt_args={excluded_ops=['BatchNorm']})

Notice that in the above example, the new arguments have the same value as the example in #15886. These arguments will ultimately be passed to a call to the optimize_for API.

In the Gluon, the hybridize flow starts before the first inference. The Symbol object is created in the _build_cache function:
https://github.com/apache/incubator-mxnet/blob/bd67723da96e6d36e72c9a42535a4fe68f234a71/python/mxnet/gluon/block.py#L933-L934
We'll add a new line of code to partition it and pass the new arguments from the hybridize call:

def _build_cache(self, *args):
        data, out = self._get_graph(*args)
        if self.backend:
                out = out.optimize_for(self.backend, **self.opt_args)

This supports the partitioning flow without shape/type propagation. Some backends do not need shapes and types so there is no reason to require it for all backends. Other backends will require shapes and types in order to partition the model correctly (examples being backends that only support float16 and not float32, or only support small shapes and not large ones).

For the partitioning with with shape/type propagation we can get the args to the model from the parameters in the Gluon block. By default, the initialization of Gluon parameters may be delayed. If the parameters are not initialized yet, we'll continue with the flow shown in the code snippet above that does not infer shapes/types.

In Gluon users can force initialization (see this guide) and if all parameters are initialized after calling hybridize and setting the backend name, we will pass the arguments from the Gluon parameters into the optimize_for API to infer shapes/types before partitioning. This gives the user the control over partitioning in the same way that they do for Symbol API. Heres a code snippet to produce the arg array and pass it to optimize_for:

arg_array = []
try:
    for name in out.list_arguments():
        if name in data_names.keys():
            arg_array.append(args[data_names[name]])
        else:
            arg_array.append(params.get(name))
except DeferredInitializationError:
    arg_array = None
except RuntimeError:
    arg_array = None
out = out.optimize_for(self._backend, arg_array, ctx, **self._backend_args)

The context will be gathered from the inputs to the model like this:

ctx = args[0].context

Context is required to infer storage types.

Note

Partitioning is done as part of the hybridize flow, when building the cachedOp. So if shapes change between infer calls the graph is not re-partitioned.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Refactor subgraph tests, returns input names and shapes (needed for gluon). improves test architecture to be more clear about what we're testing (subgraph API, optimize for, gluon).

leezu · 2019-08-22T12:21:42Z

Should this be run prior to training or prior to exporting the HybridBlock? Could/Should it be run automatically?

Edit: Based on offline discussion, automatic optimization could be run if the backend can be detected automatically. We would not want to automatically export an optimized symbol.

samskalicky · 2019-08-23T19:13:03Z

Waiting on #15886 to be merged to re-use optimize_for API call on symbol

anirudhacharya · 2019-08-26T16:22:12Z

@mxnet-label-bot add [pr-awaiting-review]

python/mxnet/gluon/block.py

samskalicky · 2020-01-23T00:26:46Z

Thanks @guanxinq for the latest update! I think we need to call optimize for again here too when we create a SymbolBlock, otherwise the partitioning wont happen:
https://github.com/apache/incubator-mxnet/blob/3d18974fdc990b7def2401fae8e46fb0b030442f/python/mxnet/gluon/block.py#L1340
Because once self._cached_graph is set the previous code wont be executed in the _get_graph function

samskalicky · 2020-01-23T00:44:10Z

@eric-haibin-lin we shouldnt partition inside _get_graph since its also called here:
https://github.com/apache/incubator-mxnet/blob/3d18974fdc990b7def2401fae8e46fb0b030442f/python/mxnet/gluon/block.py#L1068

python/mxnet/gluon/block.py

mseth10

Looks good to me. Good job @guanxinq @samskalicky

mseth10 · 2020-02-05T00:05:43Z

@leezu your comments have been addressed. can you please review again?

leezu

Changes in python/mxnet/gluon/block.py LGTM

guanxinq

Looks good to me.

eric-haibin-lin

Request for documentation. Otherwise looks good to me

eric-haibin-lin · 2020-02-05T19:05:31Z

python/mxnet/gluon/block.py

@@ -1040,7 +1052,12 @@ def register_child(self, block, name=None):
        super(HybridBlock, self).register_child(block, name)
        self._clear_cached_op()

-    def hybridize(self, active=True, **kwargs):
+    def hybridize(self, active=True, backend=None, backend_args=None, **kwargs):


hmmm. This is specific for hybridblock? Can we add documentation?

Based on prior discussion with @samskalicky, documentation should describe what happens if input shapes change in subsequent forward calls. (Ie. currently no repartitioning is triggered).

Added the description for hybridblock hybridize().

Don't see it - did you push?

Just pushed. Could you help review the description?

the concept of SubgraphBackendRegistry, PostPartition, etc are new and not very straightforward to users. Is it possible to also add a link to any tutorial that teaches user how to register a subgraph backend?

As discussed offline, we plan to add a tutorial as part of our next PR and link it to the example. I have put together the TODO list for the next PR in this github issue #17532 .

tests/python/unittest/test_subgraph_op.py

python/mxnet/gluon/block.py

rondogency · 2020-02-05T20:06:22Z

python/mxnet/gluon/block.py

+        if backend_args is None:
+            self._backend_args = {}
+        else:
+            self._backend_args = backend_args


we need to enforce users to pass a dictionary (since user may pass a string), so we need to add a check below before assign it to _backend_args

lets add something like

if isinstance(backend_args, dict)

Fixed. Thanks.

we still need an else block.

else: self._backend_args = {}

Okay, we don't need else as it is initialized to {}.

mseth10 · 2020-02-05T22:46:22Z

python/mxnet/gluon/block.py

+            Whether to turn hybrid on or off.
+        backend : str
+            The name of backend, as registered in `SubgraphBackendRegistry`, default None
+        backend_args : dict of optional arguments, optional


nit: optional twice

mseth10 · 2020-02-05T23:42:04Z

python/mxnet/gluon/block.py

-            but slower.
-        """
+        """ Please refer description of HybridBlock hybridize().
+        """ 


can you get rid of trailing whitespaces

agree to address doc issue in a future PR

* stub for optimizing Gluon block * Init commit for Gluon hybridblocks partition(sample test included) * Added tests for Gluon and refactored tests * call optimize_for in _build_cache * Pass in 4 paras for gluon optimize_for * Fixed auxiliary state issue, args issue and added 2 tests. * Fixed auxiliary state issue, args issue and added 2 tests. * changed parameter check * refactored param init since needed for partitioning * fixed whitespace * fixed flattened args * fixed sanity & updated tests * fixed whitespace * added context support in tests * Fix python2 errors * clean code remove cargs * Add hybridblock hybridize() description Co-authored-by: guanxinq <[email protected]>

stub for optimizing Gluon block

1967b76

samskalicky requested a review from szha as a code owner August 22, 2019 01:43

marcoabreu added the pr-awaiting-review PR is waiting for code review label Aug 26, 2019

samskalicky mentioned this pull request Jan 7, 2020

Dynamic subgraph property #17034

Merged

4 tasks

Merge remote-tracking branch 'upstream/master' into cached_op_partition

2c50232

samskalicky requested review from anirudh2290 and eric-haibin-lin as code owners January 17, 2020 18:57

guanxinq force-pushed the cached_op_partition branch from 3a64283 to abea125 Compare January 17, 2020 21:16

Merge remote-tracking branch 'upstream/master' into cached_op_partition

e2803b2

guanxinq force-pushed the cached_op_partition branch from abea125 to f90b9ab Compare January 17, 2020 23:32

samskalicky commented Jan 18, 2020

View reviewed changes

python/mxnet/gluon/block.py Outdated Show resolved Hide resolved

samskalicky commented Jan 18, 2020

View reviewed changes

python/mxnet/gluon/block.py Outdated Show resolved Hide resolved

guanxinq force-pushed the cached_op_partition branch from f90b9ab to d286167 Compare January 21, 2020 18:47

Init commit for Gluon hybridblocks partition(sample test included)

4b3d076

guanxinq force-pushed the cached_op_partition branch from d286167 to 4b3d076 Compare January 21, 2020 18:50

samskalicky commented Jan 21, 2020

View reviewed changes

python/mxnet/gluon/block.py Outdated Show resolved Hide resolved

samskalicky commented Jan 21, 2020

View reviewed changes

python/mxnet/gluon/block.py Outdated Show resolved Hide resolved

Added tests for Gluon and refactored tests

3d18974

leezu reviewed Jan 23, 2020

View reviewed changes

python/mxnet/gluon/block.py Outdated Show resolved Hide resolved

guanxinq added 2 commits January 23, 2020 18:07

call optimize_for in _build_cache

4622878

Pass in 4 paras for gluon optimize_for

434c3c7

guanxinq force-pushed the cached_op_partition branch 2 times, most recently from b6636da to 7228343 Compare January 29, 2020 22:55

samskalicky commented Jan 30, 2020

View reviewed changes

python/mxnet/gluon/block.py Outdated Show resolved Hide resolved

samskalicky commented Jan 30, 2020

View reviewed changes

python/mxnet/gluon/block.py Outdated Show resolved Hide resolved

samskalicky and others added 4 commits February 4, 2020 07:23

fixed sanity & updated tests

f7e27b6

fixed whitespace

4449755

added context support in tests

d827dd5

Fix python2 errors

b037872

guanxinq reviewed Feb 4, 2020

View reviewed changes

python/mxnet/gluon/block.py Outdated Show resolved Hide resolved

mseth10 reviewed Feb 4, 2020

View reviewed changes

python/mxnet/gluon/block.py Outdated Show resolved Hide resolved

mseth10 reviewed Feb 4, 2020

View reviewed changes

python/mxnet/gluon/block.py Outdated Show resolved Hide resolved

clean code remove cargs

80dfaed

guanxinq force-pushed the cached_op_partition branch from 90daf39 to 80dfaed Compare February 4, 2020 21:54

mseth10 approved these changes Feb 4, 2020

View reviewed changes

leezu approved these changes Feb 5, 2020

View reviewed changes

guanxinq approved these changes Feb 5, 2020

View reviewed changes

eric-haibin-lin previously requested changes Feb 5, 2020

View reviewed changes

rondogency reviewed Feb 5, 2020

View reviewed changes

tests/python/unittest/test_subgraph_op.py Show resolved Hide resolved

python/mxnet/gluon/block.py Show resolved Hide resolved

python/mxnet/gluon/block.py Outdated Show resolved Hide resolved

rondogency reviewed Feb 5, 2020

View reviewed changes

guanxinq force-pushed the cached_op_partition branch from 71c88a7 to 5aaff0b Compare February 5, 2020 22:37

mseth10 reviewed Feb 5, 2020

View reviewed changes

guanxinq force-pushed the cached_op_partition branch 2 times, most recently from b332edd to 966a383 Compare February 5, 2020 23:08

mseth10 reviewed Feb 5, 2020

View reviewed changes

Add hybridblock hybridize() description

14dcc14

guanxinq force-pushed the cached_op_partition branch from 966a383 to 14dcc14 Compare February 5, 2020 23:47

mseth10 mentioned this pull request Feb 6, 2020

[RFC] Partitioning for a given backend #17532

Closed

eric-haibin-lin merged commit 9993738 into apache:master Feb 6, 2020

samskalicky mentioned this pull request Feb 13, 2020

Dynamic subgraph property doc #17585

Merged

4 tasks

samskalicky mentioned this pull request Feb 21, 2020

Dynamic subgraph compile support #17623

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partitioning Gluon HybridBlocks #15969

Partitioning Gluon HybridBlocks #15969

samskalicky commented Aug 22, 2019 •

edited

Loading

leezu commented Aug 22, 2019 •

edited

Loading

samskalicky commented Aug 23, 2019

anirudhacharya commented Aug 26, 2019

samskalicky commented Jan 23, 2020

samskalicky commented Jan 23, 2020

mseth10 left a comment

mseth10 commented Feb 5, 2020

leezu left a comment

guanxinq left a comment

eric-haibin-lin left a comment

eric-haibin-lin Feb 5, 2020

leezu Feb 5, 2020

guanxinq Feb 5, 2020

eric-haibin-lin Feb 5, 2020 •

edited

Loading

guanxinq Feb 5, 2020 •

edited

Loading

eric-haibin-lin Feb 6, 2020

mseth10 Feb 6, 2020

rondogency Feb 5, 2020

samskalicky Feb 5, 2020

guanxinq Feb 5, 2020

mseth10 Feb 5, 2020

mseth10 Feb 5, 2020

mseth10 Feb 5, 2020

guanxinq Feb 5, 2020

mseth10 Feb 5, 2020

Partitioning Gluon HybridBlocks #15969

Partitioning Gluon HybridBlocks #15969

Conversation

samskalicky commented Aug 22, 2019 • edited Loading

Description

Design

Note

Checklist

Essentials

Changes

leezu commented Aug 22, 2019 • edited Loading

samskalicky commented Aug 23, 2019

anirudhacharya commented Aug 26, 2019

samskalicky commented Jan 23, 2020

samskalicky commented Jan 23, 2020

mseth10 left a comment

Choose a reason for hiding this comment

mseth10 commented Feb 5, 2020

leezu left a comment

Choose a reason for hiding this comment

guanxinq left a comment

Choose a reason for hiding this comment

eric-haibin-lin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-haibin-lin Feb 5, 2020 • edited Loading

Choose a reason for hiding this comment

guanxinq Feb 5, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samskalicky commented Aug 22, 2019 •

edited

Loading

leezu commented Aug 22, 2019 •

edited

Loading

eric-haibin-lin Feb 5, 2020 •

edited

Loading

guanxinq Feb 5, 2020 •

edited

Loading