Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing Test Cases #125

Open
vieting opened this issue Sep 15, 2022 · 7 comments
Open

Failing Test Cases #125

vieting opened this issue Sep 15, 2022 · 7 comments

Comments

@vieting
Copy link
Contributor

vieting commented Sep 15, 2022

There are failing test cases for the last commit on the main branch. Since the commit is only a minor change to the readme, it is very likely that recent updates of RETURNN cause the failures.

See linked tests in aaa35f9

@vieting
Copy link
Contributor Author

vieting commented Nov 17, 2022

As this came up again, I had a look at the issues. They occur when running from a network dict, all runs up until there (including the net dict creation) work.

In RandIntLayer.get_out_data_from_opts(), the dim tags inferred from shape are

>>> dim_tags
[Dim{B}, Dim{'3*time:data'[B]}]
>>> dim_tags[1].dyn_size
<tf.Tensor 'mul_randint/mul:0' shape=(?,) dtype=int32>

However, the dyn_size is removed in get_for_batch_ctx.

>>> dim_tags[1].get_for_batch_ctx(batch, ctx).dyn_size
None

I'm not familiar enough with the details there to have a clear idea why this is the case. This leads to the error subsequently in RandIntLayer.__init__() when calling .get_dim_value() on that dim tag. @albertz do you have an idea how to fix this?

The other two failing tests are related.

@albertz
Copy link
Member

albertz commented Nov 18, 2022

I cannot see the tests anymore. Can you post the relevant exceptions here?

What is batch and ctx in your example?

Probably some complete_dyn_size is missing here.

@vieting
Copy link
Contributor Author

vieting commented Nov 18, 2022

You can currently see the tests here from the last commit in main.

>>> ctx                                                                                                                                                                                                                         
None
>>> batch                                                                                                                                                                                                                       
BatchInfo{B}
>>> vars(batch)                                                                                                                                                                                                                 
{'_descendants_by_beam_name': {},
 '_dim': None,
 '_global_beam_dims_by_beam_name': {},
 '_global_descendants_by_virtual_dims': {(GlobalBatchDim{B},): BatchInfo{B}},
 '_global_padded_dims_by_dim_tag': {},
 '_packed_dims_by_dim_tag': {},
 'base': None,
 'descendants': [],
 'virtual_dims': [GlobalBatchDim{B}]}

I cannot see a difference regarding these between the working _run_torch_returnn_drop_in() and the failing _run_returnn_standalone_net_dict().

@albertz
Copy link
Member

albertz commented Nov 18, 2022

Error:

ERROR: test_layers.test_randint_dynamic
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/.local/lib/python3.8/site-packages/nose/case.py", line 198, in TestBase.runTest
    line: self.test(*self.arg)
    locals:
      self = <local> test_layers.test_randint_dynamic
      self.test = <local> <function test_randint_dynamic at 0x7f79b0d533a0>
      self.arg = <local> ()
  File "/home/runner/work/pytorch-to-returnn/pytorch-to-returnn/tests/test_layers.py", line 46, in test_randint_dynamic
    line: verify_torch_and_convert_to_returnn(model_func, inputs=x, inputs_data_kwargs={
            "shape": (None, n_feat), "batch_dim_axis": 0, "time_dim_axis": 1, "feature_dim_axis": 2})
    locals:
      verify_torch_and_convert_to_returnn = <global> <function verify_torch_and_convert_to_returnn at 0x7f7990de6430>
      model_func = <local> <function test_randint_dynamic.<locals>.model_func at 0x7f7990cdf430>
      inputs = <not found>
      x = <local> array([[[ 0.49671414, -0.1382643 ,  0.64768857,  1.5[23](https://github.com/rwth-i6/pytorch-to-returnn/actions/runs/3489868241/jobs/5840540146#step:7:24)0298 ,
                           -0.23415338, -0.23413695,  1.5792128 ],
                          [ 0.7674347 , -0.46947438,  0.54256004, -0.46341768,
                           -0.46572974,  0.[24](https://github.com/rwth-i6/pytorch-to-returnn/actions/runs/3489868241/jobs/5840540146#step:7:25)196227, -1.9132802 ],
                          [-1.7249179 , -0.5622875 , -1.0128311 ,  0.31424734,
                           -0.9080...
      inputs_data_kwargs = <not found>
      n_feat = <local> 7
  File "/home/runner/work/pytorch-to-returnn/pytorch-to-returnn/pytorch_to_returnn/converter/converter.py", line 436, in verify_torch_and_convert_to_returnn
    line: converter.run()
    locals:
      converter = <local> <pytorch_to_returnn.converter.converter.Converter object at 0x7f7990d39[25](https://github.com/rwth-i6/pytorch-to-returnn/actions/runs/3489868241/jobs/5840540146#step:7:26)0>
      converter.run = <local> <bound method Converter.run of <pytorch_to_returnn.converter.converter.Converter object at 0x7f7990d39250>>
  File "/home/runner/work/pytorch-to-returnn/pytorch-to-returnn/pytorch_to_returnn/converter/converter.py", line 143, in Converter.run
    line: self._run_returnn_standalone_net_dict()
    locals:
      self = <local> <pytorch_to_returnn.converter.converter.Converter object at 0x7f7990d39250>
      self._run_returnn_standalone_net_dict = <local> <bound method Converter._run_returnn_standalone_net_dict of <pytorch_to_returnn.converter.converter.Converter object at 0x7f7990d39250>>
  File "/home/runner/work/pytorch-to-returnn/pytorch-to-returnn/pytorch_to_returnn/converter/converter.py", line 353, in Converter._run_returnn_standalone_net_dict
    line: network.construct_from_dict(self._returnn_net_dict)
...
  File "/home/runner/.local/lib/python3.8/site-packages/returnn/tf/network.py", line 1189, in TFNetwork._create_layer
    line: layer = layer_class(**layer_desc)
    locals:
      layer = <not found>
      layer_class = <local> <class 'returnn.tf.layers.basic.RandIntLayer'>
      layer_desc = <local> {'shape': (Dim{B}, Dim{'3*time:data'[B]}), 'maxval': <CastLayer 'mul_randint_Cast' out_type=Data{[], dtype='int64'}>, 'minval': 0, 'dtype': 'int64', '_network': <TFNetwork 'root' train=False>, '_name': 'mul_randint', 'sources': [<SourceLayer 'data' out_type=Data{[B,T|'time:data'[B],F|F'feature:da..., len = 10
  File "/home/runner/.local/lib/python3.8/site-packages/returnn/tf/layers/basic.py", line [27](https://github.com/rwth-i6/pytorch-to-returnn/actions/runs/3489868241/jobs/5840540146#step:7:28)35, in RandIntLayer.__init__
    line: shape_ = [
            d.get_for_batch_ctx(batch, self.network.control_flow_ctx).get_dim_value()
            for d in self.output.dim_tags]
    locals:
      shape_ = <not found>
      d = <not found>
      d.get_for_batch_ctx = <not found>
      batch = <local> BatchInfo{B}
      self = <local> <RandIntLayer 'mul_randint' out_type=Data{[B,T|'3*time:data'[B]], dtype='int64'}>
      self.network = <local> <TFNetwork 'root' train=False>
      self.network.control_flow_ctx = <local> None
      get_dim_value = <not found>
      self.output = <local> Data{'mul_randint_output', [B,T|'3*time:data'[B]], dtype='int64'}
      self.output.dim_tags = <local> (Dim{B}, Dim{'3*time:data'[B]})
  File "/home/runner/.local/lib/python3.8/site-packages/returnn/tf/layers/basic.py", line 2736, in <listcomp>
    line: d.get_for_batch_ctx(batch, self.network.control_flow_ctx).get_dim_value()
    locals:
      d = <local> Dim{'3*time:data'[B]}
      d.get_for_batch_ctx = <local> <bound method Dim.get_for_batch_ctx of Dim{'3*time:data'[B]}>
      batch = <local> BatchInfo{B}
      self = <local> <RandIntLayer 'mul_randint' out_type=Data{[B,T|'3*time:data'[B]], dtype='int64'}>
      self.network = <local> <TFNetwork 'root' train=False>
      self.network.control_flow_ctx = <local> None
      get_dim_value = <not found>
  File "/home/runner/.local/lib/python3.8/site-packages/returnn/tf/util/data.py", line 1191, in Dim.get_dim_value
    line: raise Exception('%s: need placeholder, self.dimension or self.dyn_size for dim value' % self)
    locals:
      Exception = <builtin> <class 'Exception'>
      self = <local> Dim{'3*time:data'[B]}
Exception: Dim{'3*time:data'[B]}: need placeholder, self.dimension or self.dyn_size for dim value

@albertz
Copy link
Member

albertz commented Nov 18, 2022

I wonder, in get_dim_value we already call complete_dyn_size, so why is it not available? It would maybe be helpful to debug-step through it.

@vieting
Copy link
Contributor Author

vieting commented Nov 19, 2022

In the first run, dim_tags[1].batch = None while in the second run we get dim_tags[1].batch = BatchInfo{B}. So when calling dim_tags[1].get_for_batch_ctx(batch, ctx), it is that batch == dim_tags[1].batch is True which leads to different behavior in get_for_batch_ctx.

The differences in get_for_batch_ctx() are:

This is executed in the second run:

    315       self._validate_in_current_graph()
--> 316       self._maybe_update()

Then later, same_base.batch == batch evaluates to False in the second run because their virtualdims are not the same.

>>> same_base.batch.virtual_dims[0].size
<tf.Tensor 'extern_data/placeholders/batch_dim:0' shape=() dtype=int32>
>>> batch.virtual_dims[0].size
<tf.Tensor 'extern_data/placeholders/batch_dim:0' shape=() dtype=int32>
>>> same_base.batch.virtual_dims[0].size == batch.virtual_dims[0].size
False

so not same_base is returned as in the first run.

The difference in .batch is already present in the input shape which comes from the network dict.

@vieting
Copy link
Contributor Author

vieting commented Nov 21, 2022

rwth-i6/returnn@4978ecb affects the errors here which potentially further helps to track the issue down, see the test cases of the latest commit here.

For test_randint_dynamic and test_contrastive_loss, we now get

ValueError: Tensor("mul_randint/Max:0", shape=(), dtype=int32) must be from the same graph as Tensor("extern_data/placeholders/batch_dim:0", shape=(), dtype=int32) (graphs are <tensorflow.python.framework.ops.Graph object at 0x7f5fcaaaa100> and <tensorflow.python.framework.ops.Graph object at 0x7f5fca768160>)

which is the same as observed in rwth-i6/returnn#1224.

For test_index_merged_dim, it is

tensorflow.python.framework.errors_impl.InvalidArgumentError: You must feed a value for placeholder tensor 'extern_data/placeholders/data/data_dim0_size' with dtype int32 and shape [?]
	 [[node extern_data/placeholders/data/data_dim0_size (defined at home/runner/.local/lib/python3.8/site-packages/returnn/tf/util/data.py:5801) ]]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants