Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

TorchModuleGraph: graph construct error when the model has shared layers #2485

Closed
zheng-ningxin opened this issue May 25, 2020 · 2 comments · Fixed by #2524
Closed

TorchModuleGraph: graph construct error when the model has shared layers #2485

zheng-ningxin opened this issue May 25, 2020 · 2 comments · Fixed by #2524
Assignees

Comments

@zheng-ningxin
Copy link
Contributor

Short summary about the issue/question:
TorchModuleGraph builds the index for the nodes in jit.trace based on the name of the layers.
When the model reuses a layer(such as relu) at different places, jit.trace will create two nodes for the same layer, these two nodes have the same name, but different inputs and outputs. So, the layername is not a unique identifier of the node globally.

How to reproduce it:
For example, I trace the resnet18 using jit.trace and print the traced details, we can find there are too nodes called layer1.0.relu. This is caused by reusing the same relu layer in the code which is common.
%input.8 : Float(1, 64, 56, 56) = aten::relu_(%input.7), scope: __module.layer1/__module.layer1.0/__module.layer1.0.relu # /home/core/anaconda3/envs/znx/lib/python3.6/site-packages/torch/nn/functional.py:912:0 %input.11 : Float(1, 64, 56, 56) = aten::relu_(%input.10), scope: __module.layer1/__module.layer1.0/__module.layer1.0.relu # /home/core/anaconda3/envs/znx/lib/python3.6/site-packages/torch/nn/functional.py:912:0

image

When I traverse to the "layer1.1.relu" at the bottom of the picture, because it has same name with the "layer1.1.relu" at the top, so when I call find_successor to find the next nodes of the bottom "layer1.1.relu", it will also return the "layer1.1.conv2" which is actually a successor of the top "layer1.1.relu".

nni Environment:

  • nni version: the latest
  • nni mode(local|pai|remote):
  • OS:
  • python version: 3.6
  • is conda or virtualenv used?: yes
  • is running in docker?: No
@zheng-ningxin
Copy link
Contributor Author

zheng-ningxin commented Jun 2, 2020

Find another bug that: we cannot merge the node only based on the scopename. For example, there are many nodes whose scopename is empty, the following code tries to merge them into several NodeGroup.
`

    for tname, nodes in func_to_nodes.items():
        print('###', tname)
        print(len(nodes))
        used = set()
        # extract non prim:: nodes
        non_prim_nodes = list()
        for node in nodes:
            if not node.kind().startswith('prim::'):
                non_prim_nodes.append(node)
        # for each non prim node, expand it
        for node in non_prim_nodes:
            node_group = self._expand_non_prim_node(node, nodes, input_to_node, output_to_node)
            used.update(node_group.node_cpps)
            nodes_py.nodes_op.append(node_group)
            # get shape infor for view (aten::view) func
            if node_group.op_type in ['aten::view', 'aten::flatten']:
                node_group.auxiliary = self._extract_shape_info(node)
        print(len(set(nodes)-used))
        print(set(nodes)-used)

`
However, most of the 'prim' nodes actually belong to the module nodes, so there are quite a few prim nodes that not merged into the graph.

@zheng-ningxin
Copy link
Contributor Author

#2524

@QuanluZhang QuanluZhang linked a pull request Jun 10, 2020 that will close this issue
@chicm-ms chicm-ms mentioned this issue Jul 1, 2020
24 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants