-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segformer model optimization #14790
Comments
@mbahnasTT , commenting on behalf of TMs team here. can you comment on the shard spec (height, width or block sharded, shard shape, core grid, shard orientation, and tensor shape) required for Transpose? Once you give us that we can get it to work in a unit test. @sjameelTT can look at adding that. |
hi @ntarafdar , @sjameelTT , do you like we have a brief working session? the input height sharded spec is shown in the perf sheet. |
Hey @mbahnasTT , the bfloat8_b transpose on WH should work soon, it was accidentally disabled in a previous patch. I will push a fix for that. The sharded transpose will need some additional work I believe. |
@mbahnasTT let's set up a meeting to discuss the sharded transpose spec? |
…ding value parameter (#15224) ### Ticket #14308 #14790 #11650 ### Problem description - Transpose along height and channel for tiled layout is currently padding unaware, meaning that we need to pad channels to a multiple of TILE_HEIGHT, transpose, then slice off any excess on the new channels/height - This results in excess copies along both channel and height that can be optimized out - As well, the choice of padding depends on the subsequent ops, some ops can skip padding with specific values to save on performance, and others may want other values padded (-inf, +inf, 1) - Blackhole transpose HC on tiled requires a scratch pad to be able to deal with the 64B alignment requirement for noc_async_reads, which don't align with the size of each face line for normal tiles (32B). - Tranpose is also dependent on reshape to be able to account for what is the padding and what is actual data after padding and then transposing - BFLOAT8 transpose is unnecessary typecasted when it's supported for both transpose CN and transpose WH ### What's changed - Add a new multicore transpose hc tiled interleaved kernel that is padding aware. This skips extra copies on the padded H values and generates new padding for the channel dimension once it becomes tile height. - The new height is padded by default to 0, but you can use the new pad_value parameter to set it to other values. If you explicitly set it to None, it will not set a specific value and just copy the real data into the buffer. - Shift tile creation workload to writer kernel to avoid the need for a scratch pad on Blackhole ((ake each tile and write out each value to its new position, rather than read in a bunch of values to create each new tile that's written out) - Remove direct reshape call in transpose - Re-enable BFLOAT8 transpose WH/CN - Add some unit tests for community test cases (these are old issues that got solved unintentionally a few weeks ago) - Add tricky unit tests - Make the same changes to permute ### TODO - In theory these should work with tiny tiles due to the way I've written it, but I haven't tested that yet. ### Checklist - [x] Post commit CI passes https://github.com/tenstorrent/tt-metal/actions/runs/11918161415 - [x] Blackhole Post commit (if applicable) https://github.com/tenstorrent/tt-metal/actions/runs/11916983288/job/33211841677 (failure matches main) - [x] Model regression CI testing passes (if applicable) https://github.com/tenstorrent/tt-metal/actions/runs/11902351038/job/33207292238 - [x] Device performance regression CI testing passes (if applicable) https://github.com/tenstorrent/tt-metal/actions/runs/11918150453/job/33215940950 (matches failure on main) - [x] New/Existing tests provide coverage for changes
@sjameelTT Hey Saad, for the transpose OPs in Segformer Encoder, the sharding spec of the input to transpose is:
|
Hey @ntarafdar and @nardoTT
Segformer encoder
|
Hey Jay, @jaykru-tt
|
Hey @bbradelTT , we'll need to add some features to improve the perf of Segformer model , please:
I can share more details. |
@mbahnasTT could you please share details on decoder also?? I am looking at optimising upsample for decoder. |
@nkpatel-tt here are some details:
|
The input layout above is supported by the current untilize operation, and utilize has been tested for these inputs. The MLP operation is explicitly converting the tensor layout from sharded to interleaved in line 114 in ttnn_segformer_mlp.py. If this line is commented, the to_layout operation in line 57 in ttnn_segformer_decode_head.py would execute the untilize operation in sharded mode. I don't think the issue you are facing is untilize related. |
…d do a minor refactor of ttnn::permute (#15881) ### Ticket #14790 add transpose wh sharded implementation when shard shape < height dimension #15165 add N-d permute with width dimension #15589 correct permute dimensionality when less than 4D #15750 remove the composite flag from permute #12550 re-enable some permute tests for blackhole #12349 re-enable working transpose tests for blackhole #16066 disable test uniform as it's stochastic ### Problem description This PR addresses several permute and transpose problems all at once - Transpose WH sharded does not currently work when the shard shape is less than the height - Permute on greater than 4 dimensions does not work when moving width around (for both tiled and RM) - The Permute kernel when width doesn't change is single core - Permute has an unclean API in which we have a composite flag that is not generically applicable - Permute on less than 4 dimensions gets an incorrect output shape in cases where it's a no-op - Permute tests are disabled for BH due to LLK issues - Transpose tests are disabled for BH due to LLK issues ### What's changed - Add transpose WH sharded implementation for when shard shape is less than the height dim (outputs a block sharded output) - Add an N-d permute kernel that works generically on any row major input. We have to call a global init each loop of the compute kernel as transpose sets some registers that aren't cleared (there's no transpose_uninit). This results in bad pcc when there's more than one loop. For GS/BH, even the global init doesn't solve the problem so the test is disabled. For Tiled, we need 5D untilize/tilize. This increases sweeps coverage for permute from **50%** to **86%** - For the optimized case where Permute's width dimension is not shuffled, make the kernel multicore - Remove composite flag that is default set to to make permute non-generic. This has caused forge models to have bad pcc as they were not aware of this optional argument. - Refactor ttnn::permute to add nop checks and correct shape calculations - Re-enable permute and transpose tests for blackhole When replacing variants of transpose with this RM permute kernel, a lot of tests on BH/GS failed, so I will do that in a follow-up to address. The LLK issues are causing pains there. If we get N-d untilize/tilize support and once the LLK issues are fixed, permute should have the ability to be generic. The remaining issues for the pytorch 2.0 sweeps after the untilize/tilize fix are the CB overflow on transpose wh, which should be fixed out of the box when we replace the kernel that is used (which I am not doing in this PR since it doesn't work for GS/BH atm). ### Checklist - [x] Post commit CI passes https://github.com/tenstorrent/tt-metal/actions/runs/12367177499/job/34547311782 (failing test is failing on main) - [x] Blackhole Post commit (if applicable) https://github.com/tenstorrent/tt-metal/actions/runs/12367175575 - [x] Model regression CI testing passes (if applicable) https://github.com/tenstorrent/tt-metal/actions/runs/12357119737 - [x] Device performance regression CI testing passes (if applicable) https://github.com/tenstorrent/tt-metal/actions/runs/12357115316 - [ ] **(For models and ops writers)** Full [new models](https://github.com/tenstorrent/tt-metal/actions/workflows/full-new-models-suite.yaml) tests passes - [ ] New/Existing tests provide coverage for changes
The text was updated successfully, but these errors were encountered: