Segformer model optimization #14790

mbahnasTT · 2024-11-06T05:37:44Z

Support Conv large weights on L1
Support sharded input to Conv, to avoid S2I/I2S
Conv is running on 8 cores only #14977
Transpose to support sharded and bfp8 input
Softmax to support sharded input
LayerNorm to support height sharded input
Optimize the Upsample in the Decoder module #14978
use bfloat8_b where possible
switch to LoFi fidelity

ntarafdar · 2024-11-12T15:19:17Z

@mbahnasTT , commenting on behalf of TMs team here. can you comment on the shard spec (height, width or block sharded, shard shape, core grid, shard orientation, and tensor shape) required for Transpose?

Once you give us that we can get it to work in a unit test.

@sjameelTT can look at adding that.

mbahnasTT · 2024-11-12T17:06:21Z

hi @ntarafdar , @sjameelTT , do you like we have a brief working session? the input height sharded spec is shown in the perf sheet.

#ops_perf_results_SEG_ENCODER_2024_11_01_04_39_36.xls

sjameelTT · 2024-11-12T22:11:22Z

Hey @mbahnasTT , the bfloat8_b transpose on WH should work soon, it was accidentally disabled in a previous patch. I will push a fix for that. The sharded transpose will need some additional work I believe.

sjameelTT · 2024-11-14T18:05:31Z

@mbahnasTT let's set up a meeting to discuss the sharded transpose spec?

…ding value parameter (#15224) ### Ticket #14308 #14790 #11650 ### Problem description - Transpose along height and channel for tiled layout is currently padding unaware, meaning that we need to pad channels to a multiple of TILE_HEIGHT, transpose, then slice off any excess on the new channels/height - This results in excess copies along both channel and height that can be optimized out - As well, the choice of padding depends on the subsequent ops, some ops can skip padding with specific values to save on performance, and others may want other values padded (-inf, +inf, 1) - Blackhole transpose HC on tiled requires a scratch pad to be able to deal with the 64B alignment requirement for noc_async_reads, which don't align with the size of each face line for normal tiles (32B). - Tranpose is also dependent on reshape to be able to account for what is the padding and what is actual data after padding and then transposing - BFLOAT8 transpose is unnecessary typecasted when it's supported for both transpose CN and transpose WH ### What's changed - Add a new multicore transpose hc tiled interleaved kernel that is padding aware. This skips extra copies on the padded H values and generates new padding for the channel dimension once it becomes tile height. - The new height is padded by default to 0, but you can use the new pad_value parameter to set it to other values. If you explicitly set it to None, it will not set a specific value and just copy the real data into the buffer. - Shift tile creation workload to writer kernel to avoid the need for a scratch pad on Blackhole ((ake each tile and write out each value to its new position, rather than read in a bunch of values to create each new tile that's written out) - Remove direct reshape call in transpose - Re-enable BFLOAT8 transpose WH/CN - Add some unit tests for community test cases (these are old issues that got solved unintentionally a few weeks ago) - Add tricky unit tests - Make the same changes to permute ### TODO - In theory these should work with tiny tiles due to the way I've written it, but I haven't tested that yet. ### Checklist - [x] Post commit CI passes https://github.com/tenstorrent/tt-metal/actions/runs/11918161415 - [x] Blackhole Post commit (if applicable) https://github.com/tenstorrent/tt-metal/actions/runs/11916983288/job/33211841677 (failure matches main) - [x] Model regression CI testing passes (if applicable) https://github.com/tenstorrent/tt-metal/actions/runs/11902351038/job/33207292238 - [x] Device performance regression CI testing passes (if applicable) https://github.com/tenstorrent/tt-metal/actions/runs/11918150453/job/33215940950 (matches failure on main) - [x] New/Existing tests provide coverage for changes

mbahnasTT · 2024-11-27T01:07:35Z

@sjameelTT Hey Saad, for the transpose OPs in Segformer Encoder, the sharding spec of the input to transpose is:

input tensor = [256, 32], TILE, BFLOAT8_B , shard spec = [32, 32] on 8 cores, Height-sharded

mbahnasTT · 2024-11-27T01:09:47Z

Hey @ntarafdar and @nardoTT
Segformer decoder untilize sharding spec

input = [16384, 256] --> Height sharding on 32 cores --> spec = [512, 256]
input = [4096, 256] --> Height sharding on 32 cores --> spec = [128, 256]
input = [1024, 256] --> Height sharding on 64 cores --> spec = [32, 256]
input = [256, 256] --> Block sharding on 32 cores --> spec = [32, 32]

Segformer encoder untilize sharding spec

input = [16384, 32] --> Height sharding on 64 cores --> spec = [256, 32]
input = [16384, 128] --> Height sharding on 64 cores --> spec = [256, 128]
input = [4096, 64] --> Height sharding on 64 cores --> spec = [64, 64]
input = [4096, 256] --> Block sharding on 16 cores --> spec = [512, 128]
input = [1024, 160] --> Block sharding on 40 cores --> spec = [128, 32]
input = [1024, 640] --> Block sharding on 40 cores --> spec = [128, 128]
input = [256, 1024] --> Block sharding on 64 cores --> spec = [32, 128]

mbahnasTT · 2024-11-27T01:12:02Z

Hey Jay, @jaykru-tt
@ntarafdar asked me to share with you the concat use case in Segformer:

The concat requested to have a RM/interleaved input
My input was height shared since it's the output from upsample
To the concat, there are 4 inputs, coming from 4 different upsample OPs, and different sharding schemes
So instead of S2I for the 4 inputs, can I keep them RM/height-sharded?
input = 4x [1,128,128,256] and concat at dim=3
output = [1,128,128,1024]

mbahnasTT · 2024-11-27T03:09:28Z

Hey @bbradelTT , we'll need to add some features to improve the perf of Segformer model , please:

Softmax to support sharded input
LayerNorm to support height sharded input

I can share more details.

nkpatel-tt · 2024-11-27T04:28:11Z

@mbahnasTT could you please share details on decoder also?? I am looking at optimising upsample for decoder.

mbahnasTT · 2024-11-27T18:52:05Z

@nkpatel-tt here are some details:

pytest tests/ttnn/integration_tests/segformer/test_segformer_decode_head.py
attached a recent perf sheetops_perf_results_SEG_DECODER_2024_11_03_05_01_37.xls
upsample code at https://github.com/tenstorrent/tt-metal/blob/main/models/demos/segformer/tt/ttnn_segformer_decode_head.py#L79

nardoTT · 2024-12-10T17:55:17Z

Hey @ntarafdar and @nardoTT Segformer decoder untilize sharding spec

input = [16384, 256] --> Height sharding on 32 cores --> spec = [512, 256]

input = [4096, 256] --> Height sharding on 32 cores --> spec = [128, 256]

input = [1024, 256] --> Height sharding on 64 cores --> spec = [32, 256]

input = [256, 256] --> Block sharding on 32 cores --> spec = [32, 32]

Segformer encoder untilize sharding spec

input = [16384, 32] --> Height sharding on 64 cores --> spec = [256, 32]

input = [16384, 128] --> Height sharding on 64 cores --> spec = [256, 128]

input = [4096, 64] --> Height sharding on 64 cores --> spec = [64, 64]

input = [4096, 256] --> Block sharding on 16 cores --> spec = [512, 128]

input = [1024, 160] --> Block sharding on 40 cores --> spec = [128, 32]

input = [1024, 640] --> Block sharding on 40 cores --> spec = [128, 128]

input = [256, 1024] --> Block sharding on 64 cores --> spec = [32, 128]

The input layout above is supported by the current untilize operation, and utilize has been tested for these inputs. The MLP operation is explicitly converting the tensor layout from sharded to interleaved in line 114 in ttnn_segformer_mlp.py. If this line is commented, the to_layout operation in line 57 in ttnn_segformer_decode_head.py would execute the untilize operation in sharded mode. I don't think the issue you are facing is untilize related.

…d do a minor refactor of ttnn::permute (#15881) ### Ticket #14790 add transpose wh sharded implementation when shard shape < height dimension #15165 add N-d permute with width dimension #15589 correct permute dimensionality when less than 4D #15750 remove the composite flag from permute #12550 re-enable some permute tests for blackhole #12349 re-enable working transpose tests for blackhole #16066 disable test uniform as it's stochastic ### Problem description This PR addresses several permute and transpose problems all at once - Transpose WH sharded does not currently work when the shard shape is less than the height - Permute on greater than 4 dimensions does not work when moving width around (for both tiled and RM) - The Permute kernel when width doesn't change is single core - Permute has an unclean API in which we have a composite flag that is not generically applicable - Permute on less than 4 dimensions gets an incorrect output shape in cases where it's a no-op - Permute tests are disabled for BH due to LLK issues - Transpose tests are disabled for BH due to LLK issues ### What's changed - Add transpose WH sharded implementation for when shard shape is less than the height dim (outputs a block sharded output) - Add an N-d permute kernel that works generically on any row major input. We have to call a global init each loop of the compute kernel as transpose sets some registers that aren't cleared (there's no transpose_uninit). This results in bad pcc when there's more than one loop. For GS/BH, even the global init doesn't solve the problem so the test is disabled. For Tiled, we need 5D untilize/tilize. This increases sweeps coverage for permute from **50%** to **86%** - For the optimized case where Permute's width dimension is not shuffled, make the kernel multicore - Remove composite flag that is default set to to make permute non-generic. This has caused forge models to have bad pcc as they were not aware of this optional argument. - Refactor ttnn::permute to add nop checks and correct shape calculations - Re-enable permute and transpose tests for blackhole When replacing variants of transpose with this RM permute kernel, a lot of tests on BH/GS failed, so I will do that in a follow-up to address. The LLK issues are causing pains there. If we get N-d untilize/tilize support and once the LLK issues are fixed, permute should have the ability to be generic. The remaining issues for the pytorch 2.0 sweeps after the untilize/tilize fix are the CB overflow on transpose wh, which should be fixed out of the box when we replace the kernel that is used (which I am not doing in this PR since it doesn't work for GS/BH atm). ### Checklist - [x] Post commit CI passes https://github.com/tenstorrent/tt-metal/actions/runs/12367177499/job/34547311782 (failing test is failing on main) - [x] Blackhole Post commit (if applicable) https://github.com/tenstorrent/tt-metal/actions/runs/12367175575 - [x] Model regression CI testing passes (if applicable) https://github.com/tenstorrent/tt-metal/actions/runs/12357119737 - [x] Device performance regression CI testing passes (if applicable) https://github.com/tenstorrent/tt-metal/actions/runs/12357115316 - [ ] **(For models and ops writers)** Full [new models](https://github.com/tenstorrent/tt-metal/actions/workflows/full-new-models-suite.yaml) tests passes - [ ] New/Existing tests provide coverage for changes

mbahnasTT added the bug Something isn't working label Nov 6, 2024

mbahnasTT assigned mywoodstock, eyonland and ntarafdar Nov 6, 2024

mbahnasTT added Customer_Bug P1 segformer_b0 Customer_Feature and removed Customer_Bug bug Something isn't working labels Nov 6, 2024

ntarafdar assigned sjameelTT Nov 12, 2024

sjameelTT added a commit that referenced this issue Nov 12, 2024

#14790: reenable bfloat8_b wh transpose

6c049b1

sjameelTT added a commit that referenced this issue Nov 14, 2024

#14790: adding padding awareness

0727aa3

sjameelTT added a commit that referenced this issue Nov 14, 2024

#14790: adding padding awareness multicore

ad7f588

sjameelTT added a commit that referenced this issue Nov 14, 2024

#14790: adding padding awareness fixes and optimizations

8188e38

sjameelTT added a commit that referenced this issue Nov 14, 2024

#14790: remove redundancy in override code

916b015

sjameelTT added a commit that referenced this issue Nov 18, 2024

#14790: reenable bfloat8_b wh transpose

e5ee6a7

sjameelTT added a commit that referenced this issue Nov 18, 2024

#14790: adding padding awareness

f32286d

sjameelTT added a commit that referenced this issue Nov 18, 2024

#14790: adding padding awareness multicore

63f04ba

sjameelTT added a commit that referenced this issue Nov 18, 2024

#14790: adding padding awareness fixes and optimizations

9e737e8

sjameelTT added a commit that referenced this issue Nov 18, 2024

#14790: remove redundancy in override code

a3fc9f2

sjameelTT added a commit that referenced this issue Nov 19, 2024

#14790: reenable bfloat8_b wh transpose

e43077c

sjameelTT added a commit that referenced this issue Nov 19, 2024

#14790: adding padding awareness

28b8a01

sjameelTT added a commit that referenced this issue Nov 19, 2024

#14790: adding padding awareness multicore

d7152a3

sjameelTT added a commit that referenced this issue Nov 19, 2024

#14790: adding padding awareness fixes and optimizations

df2e594

sjameelTT added a commit that referenced this issue Nov 19, 2024

#14790: remove redundancy in override code

70b7e35

sjameelTT added a commit that referenced this issue Nov 20, 2024

#14790: adding padding awareness

11abdd6

sjameelTT added a commit that referenced this issue Nov 20, 2024

#14790: adding padding awareness multicore

39a5e58

sjameelTT added a commit that referenced this issue Nov 20, 2024

#14790: adding padding awareness fixes and optimizations

8eb50cb

sjameelTT added a commit that referenced this issue Nov 20, 2024

#14790: remove redundancy in override code

3770fce

mbahnasTT assigned bbradelTT Nov 27, 2024

mbahnasTT assigned nardoTT and unassigned eyonland Nov 27, 2024

sjameelTT added a commit that referenced this issue Nov 27, 2024

#14790: add sub-height sharding support for transpose wh

41e4eef

sjameelTT added a commit that referenced this issue Nov 28, 2024

#14790: add sub-height sharding support for transpose wh

fc6560c

sjameelTT added a commit that referenced this issue Nov 28, 2024

#14790: add sub-height sharding support for transpose wh

8c7ef24

sjameelTT added a commit that referenced this issue Nov 28, 2024

#14790: add sub-height sharding support for transpose wh

970899c

sjameelTT added a commit that referenced this issue Dec 3, 2024

#14790: add sub-height sharding support for transpose wh

1299052

sjameelTT added a commit that referenced this issue Dec 6, 2024

#14790: add sub-height sharding support for transpose wh

b0a0ece

sjameelTT added a commit that referenced this issue Dec 9, 2024

#14790: add sub-height sharding support for transpose wh

25b4694

sjameelTT mentioned this issue Dec 10, 2024

Add transpose WH sharded, generalize row major permute when N > 4, and do a minor refactor of ttnn::permute #15881

Merged

6 tasks

sjameelTT added a commit that referenced this issue Dec 11, 2024

#14790: add sub-height sharding support for transpose wh

736184c

sjameelTT added a commit that referenced this issue Dec 11, 2024

#14790: add sub-height sharding support for transpose wh

1cf0e79

sjameelTT added a commit that referenced this issue Dec 12, 2024

#14790: add sub-height sharding support for transpose wh

b96bbb2

sjameelTT added a commit that referenced this issue Dec 12, 2024

#14790: add sub-height sharding support for transpose wh

7611482

sjameelTT added a commit that referenced this issue Dec 16, 2024

#14790: add sub-height sharding support for transpose wh

93e6f7b

sjameelTT added a commit that referenced this issue Dec 17, 2024

#14790: add sub-height sharding support for transpose wh

300fe91

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segformer model optimization #14790

Segformer model optimization #14790

mbahnasTT commented Nov 6, 2024 •

edited by mywoodstock

Loading

ntarafdar commented Nov 12, 2024 •

edited

Loading

mbahnasTT commented Nov 12, 2024 •

edited

Loading

sjameelTT commented Nov 12, 2024 •

edited

Loading

sjameelTT commented Nov 14, 2024

mbahnasTT commented Nov 27, 2024 •

edited

Loading

mbahnasTT commented Nov 27, 2024 •

edited

Loading

mbahnasTT commented Nov 27, 2024

mbahnasTT commented Nov 27, 2024

nkpatel-tt commented Nov 27, 2024

mbahnasTT commented Nov 27, 2024

nardoTT commented Dec 10, 2024

Segformer model optimization #14790

Segformer model optimization #14790

Comments

mbahnasTT commented Nov 6, 2024 • edited by mywoodstock Loading

ntarafdar commented Nov 12, 2024 • edited Loading

mbahnasTT commented Nov 12, 2024 • edited Loading

sjameelTT commented Nov 12, 2024 • edited Loading

sjameelTT commented Nov 14, 2024

mbahnasTT commented Nov 27, 2024 • edited Loading

mbahnasTT commented Nov 27, 2024 • edited Loading

mbahnasTT commented Nov 27, 2024

mbahnasTT commented Nov 27, 2024

nkpatel-tt commented Nov 27, 2024

mbahnasTT commented Nov 27, 2024

nardoTT commented Dec 10, 2024

mbahnasTT commented Nov 6, 2024 •

edited by mywoodstock

Loading

ntarafdar commented Nov 12, 2024 •

edited

Loading

mbahnasTT commented Nov 12, 2024 •

edited

Loading

sjameelTT commented Nov 12, 2024 •

edited

Loading

mbahnasTT commented Nov 27, 2024 •

edited

Loading

mbahnasTT commented Nov 27, 2024 •

edited

Loading