Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove padding along non height/width dims for transpose #14308

Closed
TT-BrianLiu opened this issue Oct 25, 2024 · 10 comments
Closed

Remove padding along non height/width dims for transpose #14308

TT-BrianLiu opened this issue Oct 25, 2024 · 10 comments
Assignees

Comments

@TT-BrianLiu
Copy link
Contributor

Need to update transpose hc (and other variants that break apart tiles) to natively output non-padded new batch dims.

@bbradelTT
Copy link
Contributor

@ntarafdar
This will block a lot of reduce work. Could you please triage this issue and provide an estimate for when it could be completed.

@sjameelTT
Copy link
Contributor

sjameelTT commented Nov 7, 2024

Is reduce reliant on transpose atm? I am starting to look at this issue today.

Edit: as well, is the expectation that the new padding is always all 0s?

@bbradelTT
Copy link
Contributor

@sjameelTT yes, reduce relies on transpose. There are calls to ttnn::transpose in ttnn/cpp/ttnn/operations/reduction/generic/generic_reductions.cpp

@sjameelTT
Copy link
Contributor

I see, are dimensions currently mismatching or is pcc bad in reduce? I'm curious because the user of ttnn::transpose on main right now should not see any unnecessary padding at the moment. Unless there's a bug, they should see the correct shape on the output, and shouldn't need to do any processing on the input.

input shape for transpose_hc:
[N, C, H[H_padded], W[W_padded]]
output:
[N, H, C[C_padded], W[W_padded]]

If this isn't true then that's a bug.

@bbradelTT
Copy link
Contributor

I'm not sure of the exact problem.

I added logging:

--- a/ttnn/cpp/ttnn/operations/data_movement/transpose/transpose.cpp
+++ b/ttnn/cpp/ttnn/operations/data_movement/transpose/transpose.cpp
@@ -171,6 +171,7 @@ ttnn::Tensor ExecuteTranspose::invoke(
     const int64_t& dim1,
     const int64_t& dim2,
     const std::optional<MemoryConfig>& memory_config_arg) {
+    tt::log_debug(tt::LogOp, "tranpose it {} dim1 {} dim2 {}", input_tensor, dim1, dim2);

and the input parameters that are giving me errors are:

tranpose it Tensor(storage=DeviceStorage(memory_config=MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([1[32], 128, 10[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE,tile=Tile(tile_shape={32, 32},face_shape={16, 16},num_faces=4)) dim1 0 dim2 1

It looks like transpose calls reshape

                     Op | DEBUG    | Started   C++ ttnn operation: ttnn::reshape
                     Op | DEBUG    | it Tensor(storage=DeviceStorage(memory_config=MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED,buffer_type=BufferType::DRAM,shard_spec=std::nullopt)),shape=ttnn.Shape([128, 1[32], 10[32], 32]),dtype=DataType::BFLOAT16,layout=Layout::TILE,tile=Tile(tile_shape={32, 32},face_shape={16, 16},num_faces=4)) ns ttnn.Shape([128, 1, 10[32], 32])

The error and stack are:

                 Always | FATAL    | 4194304 != 131072
Traceback (most recent call last):
  File "/home/bbradel/reduce.py", line 18, in <module>
    y4=ttnn.sum(x1,0)
  File "/localdev/bbradel/tt-metal/ttnn/ttnn/decorators.py", line 329, in __call__
    return self.function(*function_args, **function_kwargs)
RuntimeError: TT_ASSERT @ ../ttnn/cpp/ttnn/tensor/tensor_ops.cpp:345: input_tensor.volume() == new_padded_shape.volume()
info:
4194304 != 131072
backtrace:
 --- /localdev/bbradel/tt-metal/ttnn/ttnn/_ttnn.so(+0x685e5b) [0x7fed91644e5b]
 --- void tt::assert::tt_assert<unsigned int, unsigned long>(char const*, int, char const*, bool, char const*, fmt::v11::basic_format_string<char, fmt::v11::type_identity<unsigned int const&>::type, fmt::v11::type_identity<unsigned long const&>::type>, unsigned int const&, unsigned long const&)
 --- tt::tt_metal::tensor_ops::tensor_reshape(tt::tt_metal::Tensor const&, ttnn::types::Shape const&)
 --- tt::tt_metal::Tensor::reshape(ttnn::types::Shape const&) const
 --- ttnn::operations::data_movement::ReshapeViewOperation::invoke(tt::tt_metal::Tensor const&, ttnn::types::Shape const&)
 --- /localdev/bbradel/tt-metal/ttnn/ttnn/_ttnn.so(+0x80940c) [0x7fed917c840c]
 --- /localdev/bbradel/tt-metal/ttnn/ttnn/_ttnn.so(+0x808ed0) [0x7fed917c7ed0]
 --- /localdev/bbradel/tt-metal/ttnn/ttnn/_ttnn.so(+0x8054b2) [0x7fed917c44b2]
 --- ttnn::operations::data_movement::ExecuteTranspose::invoke(unsigned char, tt::tt_metal::Tensor const&, long const&, long const&, std::__1::optional<tt::tt_metal::MemoryConfig> const&)
 --- ttnn::operations::data_movement::ExecuteTranspose::invoke(tt::tt_metal::Tensor const&, long const&, long const&, std::__1::optional<tt::tt_metal::MemoryConfig> const&)
 --- /localdev/bbradel/tt-metal/ttnn/ttnn/_ttnn.so(+0x93a3fe) [0x7fed918f93fe]
 --- /localdev/bbradel/tt-metal/ttnn/ttnn/_ttnn.so(+0x93a360) [0x7fed918f9360]
 --- /localdev/bbradel/tt-metal/ttnn/ttnn/_ttnn.so(+0x93a257) [0x7fed918f9257]
 --- /localdev/bbradel/tt-metal/ttnn/ttnn/_ttnn.so(+0x93a060) [0x7fed918f9060]
 --- /localdev/bbradel/tt-metal/ttnn/ttnn/_ttnn.so(+0x939fd7) [0x7fed918f8fd7]
 --- /localdev/bbradel/tt-metal/ttnn/ttnn/_ttnn.so(+0x90dd31) [0x7fed918ccd31]
 --- ttnn::operations::data_movement::detail::permute_impl(tt::tt_metal::Tensor const&, tt::tt_metal::SmallVector<unsigned int, 8ul> const&, tt::tt_metal::MemoryConfig const&)
...

@sjameelTT
Copy link
Contributor

I think that's because the input N is padded, which we don't expect in transpose. That should be fixed independent of this change, though I'm curious about the use-case for padded N.

@bbradelTT
Copy link
Contributor

Is there an issue where that would be fixed?

It's for a reduce call. I'm hoping most of the padding can go away with the new tensor layout work.

@sjameelTT
Copy link
Contributor

do you have a python test I could use to test this out? I am curious as to how we even get to a tensor with padded N. What is the op call that creates a padded N value?

@bbradelTT
Copy link
Contributor

do you have a python test I could use to test this out? I am curious as to how we even get to a tensor with padded N. What is the op call that creates a padded N value?

I don't. I was trying to fix reduce with the inputs provided in #13361 (comment) but there were too many issues and I postponed the effort until the new tensor layout is far enough along.

sjameelTT added a commit that referenced this issue Nov 21, 2024
…ding value parameter (#15224)

### Ticket
#14308 
#14790 
#11650 

### Problem description

- Transpose along height and channel for tiled layout is currently
padding unaware, meaning that we need to pad channels to a multiple of
TILE_HEIGHT, transpose, then slice off any excess on the new
channels/height
- This results in excess copies along both channel and height that can
be optimized out
- As well, the choice of padding depends on the subsequent ops, some ops
can skip padding with specific values to save on performance, and others
may want other values padded (-inf, +inf, 1)
- Blackhole transpose HC on tiled requires a scratch pad to be able to
deal with the 64B alignment requirement for noc_async_reads, which don't
align with the size of each face line for normal tiles (32B).
- Tranpose is also dependent on reshape to be able to account for what
is the padding and what is actual data after padding and then
transposing
- BFLOAT8 transpose is unnecessary typecasted when it's supported for
both transpose CN and transpose WH

### What's changed

- Add a new multicore transpose hc tiled interleaved kernel that is
padding aware. This skips extra copies on the padded H values and
generates new padding for the channel dimension once it becomes tile
height.
- The new height is padded by default to 0, but you can use the new
pad_value parameter to set it to other values. If you explicitly set it
to None, it will not set a specific value and just copy the real data
into the buffer.
- Shift tile creation workload to writer kernel to avoid the need for a
scratch pad on Blackhole ((ake each tile and write out each value to its
new position, rather than read in a bunch of values to create each new
tile that's written out)
- Remove direct reshape call in transpose
- Re-enable BFLOAT8 transpose WH/CN 
- Add some unit tests for community test cases (these are old issues
that got solved unintentionally a few weeks ago)
- Add tricky unit tests
- Make the same changes to permute

### TODO

- In theory these should work with tiny tiles due to the way I've
written it, but I haven't tested that yet.

### Checklist
- [x] Post commit CI passes
https://github.com/tenstorrent/tt-metal/actions/runs/11918161415
- [x] Blackhole Post commit (if applicable)
https://github.com/tenstorrent/tt-metal/actions/runs/11916983288/job/33211841677
(failure matches main)
- [x] Model regression CI testing passes (if applicable)
https://github.com/tenstorrent/tt-metal/actions/runs/11902351038/job/33207292238
- [x] Device performance regression CI testing passes (if applicable)
https://github.com/tenstorrent/tt-metal/actions/runs/11918150453/job/33215940950
(matches failure on main)
- [x] New/Existing tests provide coverage for changes
@ntarafdar
Copy link
Contributor

@sjameelTT is this completed with your merge?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants