Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Reduce ops with keepdim=False are not supported #13361

Open
mrakitaTT opened this issue Oct 2, 2024 · 14 comments
Open

[Feature Request] Reduce ops with keepdim=False are not supported #13361

mrakitaTT opened this issue Oct 2, 2024 · 14 comments
Assignees
Labels
feature-request External feature request forge P2

Comments

@mrakitaTT
Copy link
Contributor

Currently ttnn reduce ops throw error when called with keepdim=False. There is a check inside tt-metal/ttnn/cpp/ttnn/operations/reduction/generic/generic_reductions.cpp in reduce_impl function which checks for this and throws error.

This presents an issue for us in tt-mlir compiler because then we have to always manually reshape the returned tensor after calling reduce ops with keepdim=True.

Do you have any plans to support this feature in the near future?

@tt-mpantic
Copy link

P0 since it is blocking JAX MNIST.

@mrakitaTT
Copy link
Contributor Author

Do you have any plans to support this feature in the near future?

Hi @ntarafdar, could you please provide some info on this?

@ntarafdar
Copy link
Contributor

Hey @mrakitaTT , can you ask @bbradelTT . Currently reduction has no home but I think Jasmina mentioned he might be able to take them.

@mrakitaTT
Copy link
Contributor Author

Do you have any plans to support this feature in the near future?

Hi @bbradelTT, could you please provide some info on this? :)

@bbradelTT bbradelTT self-assigned this Nov 4, 2024
@bbradelTT
Copy link
Contributor

I'll take a look.

@bbradelTT
Copy link
Contributor

@mrakitaTT can you please provide all of the calls that you are interested in (including shapes if input tensors, etc.)?

With recent changes, depending on the shape, sometimes there are downstream fatals, sometimes there aren't. Depending on the shapes you are interested in, it may be possible to unblock you sooner.

@mrakitaTT mrakitaTT added P1 and removed P0 labels Nov 5, 2024
@mrakitaTT
Copy link
Contributor Author

Reducing to P1 since even though it is blocking MNIST model we have a way to make a workaround for this issue in compiler if this will take a lot of time to fix on Metal side.

@mrakitaTT
Copy link
Contributor Author

@bbradelTT Sure, here are the traces for MNIST model:

Reduce op dimArg Input tensor shape Output tensor shape
ttnn::max 1 tensor<128x10xf32> tensor<128xf32>
ttnn::sum 1 tensor<128x10xf32> tensor<128xf32>
ttnn::sum 1 tensor<128x1xf32> tensor<128xf32>
ttnn::sum 0 tensor<128x10xf32> tensor<10xf32>
ttnn::sum 0 tensor<1x10xf32> tensor<10xf32>
ttnn::sum 0 tensor<128x512xf32> tensor<512xf32>
ttnn::sum 0 tensor<1x512xf32> tensor<512xf32>

@bbradelTT
Copy link
Contributor

I'll try out the shapes tomorrow. From my own investigation, it looks like the problem is in reduce, which seems to be called when keepdim=False.

import torch
import ttnn
device_id = 0
device = ttnn.open_device(device_id=device_id)
e=ttnn.from_torch(torch.ones([1,1,32,1], dtype=torch.bfloat16), layout=ttnn.TILE_LAYOUT, device=device)
s=ttnn.Shape([32],[32])
f=ttnn.reshape(e,s)
ttnn.close_device(device)
exit()

results in

RuntimeError: TT_FATAL @ ../ttnn/cpp/ttnn/tensor/types.cpp:211: normalized_index >= 0 and normalized_index < rank
info:
Index is out of bounds for the rank, should be between 0 and 0 however is 18446744073709551615
backtrace:
 --- /localdev/bbradel/tt-metal/ttnn/ttnn/_ttnn.so(+0x1a5c7cb) [0x7f05fab907cb]
 --- tt::tt_metal::LegacyShape::get_normalized_index(long) const
 --- tt::tt_metal::LegacyShape::operator[](long) const
 --- ttnn::types::Shape::operator[](long) const
 --- ttnn::operations::data_movement::ReshapeViewOperation::invoke(tt::tt_metal::Tensor const&, ttnn::types::Shape const&)
...

I need to figure out a way to avoid this error.

@bbradelTT
Copy link
Contributor

Unfortunately there are a lot of issues with padding.

E.g. some internal shapes that are showing up in my tests are:

[128, 1[32], 10[32], 32]
[128, 1, 10[32], 32]
[1[32], 128, 10[32], 32]

The new tensor shape work should make all of the padding irrelevant. We'll have to wait for that to be done, and then I'll revisit this issue.

cc @TT-BrianLiu @ayerofieiev-tt

@prajaramanTT
Copy link

Keeping Priority label in sync with Priority field - P2

bbradelTT added a commit that referenced this issue Dec 20, 2024
### Ticket
Link to Github Issue
#12662
#14898
#13361
#12170

### Problem description
- padding caused issues for max
- keepdim=False errored out

### What's changed
- remove the erroring out of keepdim=False and adjust code to handle
keepdim=False properly
- adding padding within min/max to ensure that it's set up properly has
been pushed back to a future PR

### Checklist
- [x] Post commit CI passes
https://github.com/tenstorrent/tt-metal/actions/runs/12432801168
- [x] Blackhole Post commit (if applicable)
https://github.com/tenstorrent/tt-metal/actions/runs/12423085751
- [x] Model regression CI testing passes (if applicable)
https://github.com/tenstorrent/tt-metal/actions/runs/12423092106 same as
main
https://github.com/tenstorrent/tt-metal/actions/runs/12422179419/job/34683976776
- [x] Device performance regression CI testing passes (if applicable)
https://github.com/tenstorrent/tt-metal/actions/runs/12423088573
- [ ] **(For models and ops writers)** Full [new
models](https://github.com/tenstorrent/tt-metal/actions/workflows/full-new-models-suite.yaml)
tests passes
- [x] New/Existing tests provide coverage for changes
@bbradelTT
Copy link
Contributor

@mrakitaTT could you please test with newest tt-metal main and let me know what you find?

You may need to test with a release build to avoid some asserts.

mrakitaTT added a commit to tenstorrent/tt-mlir that referenced this issue Dec 20, 2024
This PR adds TTNN workarounds for these Metal issues:
- tenstorrent/tt-metal#13361 - By decomposing
`reduce(keepDim=false)` into `reduce(keepDim=true) + reshape`
- tenstorrent/tt-metal#16118 - By annulling
dimensions argument when all dims are being reduced

As part of this work I've also:
- Enabled conversion of `stablehlo.reduce` op with multiple reduce
dimensions
- Added reduce ops verifiers in TTIR
- Added a separate function in TTNNWorkarounds to run rewrite patterns
for decomposition and layout workarounds
- Added lots of unit tests for reduce ops to cover conversions and
verifiers
- Added lots of silicon tests for reduce ops

Opened issue #1624 on
myself to revert these workarounds once Metal issues are fixed.

Closes #805, #848

After implementing these workarounds and running tests, I've encountered
[another Metal
issue](tenstorrent/tt-metal#16104), this time
in `reshape` op. I've debugged it and I have a local fix, I will send a
PR to fix it in Metal repo, confirmed with reshape op owners. I've
opened myself an issue
#1640 to enable Reduce ops
silicon tests after this fix is uplifted.

Another issue that I've encountered while working on this is that after
the workaround pass decompositions, if we are changing the shapes of the
ops tensors, that means that their layout needs to be changed too, but
layout pass is done before the workaround pass. I've managed to solve it
by reusing the layout of the input tensor, but I am not sure if that is
a good solution and maybe we need to repeat some of the layout logic
again after workaround decompositions. FYI @sdjordjevicTT

Here is the example TTNN IR before the workarounds:
```
%3 = "ttnn.sum"(%2) <{dim_arg = [0: i32, 1 : i32, 2: i32], keep_dim = false}> : (tensor<128x32x4xf32, #ttnn_layout2>) -> tensor<1xf32, #ttnn_layout2>
```

and after the workarounds:
```
%3 = "ttnn.sum"(%2) <{keep_dim = true}> : (tensor<128x32x4xf32, #ttnn_layout2>) -> tensor<1x1x1xf32, #ttnn_layout2>
%4 = "ttnn.reshape"(%3) <{shape = [1 : i32]}> : (tensor<1x1x1xf32, #ttnn_layout2>) -> tensor<1xf32, #ttnn_layout3>
```
@mmanzoorTT
Copy link

@bbradelTT Thanks for the patch. It solves the issue except in case of reducing for entire tensor. If we want to apply reduction for entire tensor with rank 4 (e.g. ttnn.sum(input)); the output will have a shape of 1x1x1x1xf32; which is inconsistent with pytorch or other packages. The expected shape is 1xf32

@bbradelTT
Copy link
Contributor

@mmanzoorTT I'm glad to hear the other cases are solved. I need to wait until #15416 is fixed before looking at the 1xf32 case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request External feature request forge P2
Projects
None yet
Development

No branches or pull requests

6 participants