Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segformer model optimization #14790

Open
2 of 9 tasks
mbahnasTT opened this issue Nov 6, 2024 · 11 comments
Open
2 of 9 tasks

Segformer model optimization #14790

mbahnasTT opened this issue Nov 6, 2024 · 11 comments

Comments

@mbahnasTT
Copy link
Contributor

mbahnasTT commented Nov 6, 2024

@ntarafdar
Copy link
Contributor

ntarafdar commented Nov 12, 2024

@mbahnasTT , commenting on behalf of TMs team here. can you comment on the shard spec (height, width or block sharded, shard shape, core grid, shard orientation, and tensor shape) required for Transpose?

Once you give us that we can get it to work in a unit test.

@sjameelTT can look at adding that.

@mbahnasTT
Copy link
Contributor Author

mbahnasTT commented Nov 12, 2024

hi @ntarafdar , @sjameelTT , do you like we have a brief working session? the input height sharded spec is shown in the perf sheet.
Image
#ops_perf_results_SEG_ENCODER_2024_11_01_04_39_36.xls

@sjameelTT
Copy link
Contributor

sjameelTT commented Nov 12, 2024

Hey @mbahnasTT , the bfloat8_b transpose on WH should work soon, it was accidentally disabled in a previous patch. I will push a fix for that. The sharded transpose will need some additional work I believe.

@sjameelTT
Copy link
Contributor

@mbahnasTT let's set up a meeting to discuss the sharded transpose spec?

sjameelTT added a commit that referenced this issue Nov 14, 2024
sjameelTT added a commit that referenced this issue Nov 18, 2024
sjameelTT added a commit that referenced this issue Nov 19, 2024
sjameelTT added a commit that referenced this issue Nov 20, 2024
sjameelTT added a commit that referenced this issue Nov 21, 2024
…ding value parameter (#15224)

### Ticket
#14308 
#14790 
#11650 

### Problem description

- Transpose along height and channel for tiled layout is currently
padding unaware, meaning that we need to pad channels to a multiple of
TILE_HEIGHT, transpose, then slice off any excess on the new
channels/height
- This results in excess copies along both channel and height that can
be optimized out
- As well, the choice of padding depends on the subsequent ops, some ops
can skip padding with specific values to save on performance, and others
may want other values padded (-inf, +inf, 1)
- Blackhole transpose HC on tiled requires a scratch pad to be able to
deal with the 64B alignment requirement for noc_async_reads, which don't
align with the size of each face line for normal tiles (32B).
- Tranpose is also dependent on reshape to be able to account for what
is the padding and what is actual data after padding and then
transposing
- BFLOAT8 transpose is unnecessary typecasted when it's supported for
both transpose CN and transpose WH

### What's changed

- Add a new multicore transpose hc tiled interleaved kernel that is
padding aware. This skips extra copies on the padded H values and
generates new padding for the channel dimension once it becomes tile
height.
- The new height is padded by default to 0, but you can use the new
pad_value parameter to set it to other values. If you explicitly set it
to None, it will not set a specific value and just copy the real data
into the buffer.
- Shift tile creation workload to writer kernel to avoid the need for a
scratch pad on Blackhole ((ake each tile and write out each value to its
new position, rather than read in a bunch of values to create each new
tile that's written out)
- Remove direct reshape call in transpose
- Re-enable BFLOAT8 transpose WH/CN 
- Add some unit tests for community test cases (these are old issues
that got solved unintentionally a few weeks ago)
- Add tricky unit tests
- Make the same changes to permute

### TODO

- In theory these should work with tiny tiles due to the way I've
written it, but I haven't tested that yet.

### Checklist
- [x] Post commit CI passes
https://github.com/tenstorrent/tt-metal/actions/runs/11918161415
- [x] Blackhole Post commit (if applicable)
https://github.com/tenstorrent/tt-metal/actions/runs/11916983288/job/33211841677
(failure matches main)
- [x] Model regression CI testing passes (if applicable)
https://github.com/tenstorrent/tt-metal/actions/runs/11902351038/job/33207292238
- [x] Device performance regression CI testing passes (if applicable)
https://github.com/tenstorrent/tt-metal/actions/runs/11918150453/job/33215940950
(matches failure on main)
- [x] New/Existing tests provide coverage for changes
@mbahnasTT
Copy link
Contributor Author

mbahnasTT commented Nov 27, 2024

@sjameelTT Hey Saad, for the transpose OPs in Segformer Encoder, the sharding spec of the input to transpose is:

  • input tensor = [256, 32], TILE, BFLOAT8_B , shard spec = [32, 32] on 8 cores, Height-sharded

@mbahnasTT
Copy link
Contributor Author

mbahnasTT commented Nov 27, 2024

Hey @ntarafdar and @nardoTT
Segformer decoder untilize sharding spec

  • input = [16384, 256] --> Height sharding on 32 cores --> spec = [512, 256]
  • input = [4096, 256] --> Height sharding on 32 cores --> spec = [128, 256]
  • input = [1024, 256] --> Height sharding on 64 cores --> spec = [32, 256]
  • input = [256, 256] --> Block sharding on 32 cores --> spec = [32, 32]

Segformer encoder untilize sharding spec

  • input = [16384, 32] --> Height sharding on 64 cores --> spec = [256, 32]
  • input = [16384, 128] --> Height sharding on 64 cores --> spec = [256, 128]
  • input = [4096, 64] --> Height sharding on 64 cores --> spec = [64, 64]
  • input = [4096, 256] --> Block sharding on 16 cores --> spec = [512, 128]
  • input = [1024, 160] --> Block sharding on 40 cores --> spec = [128, 32]
  • input = [1024, 640] --> Block sharding on 40 cores --> spec = [128, 128]
  • input = [256, 1024] --> Block sharding on 64 cores --> spec = [32, 128]

@mbahnasTT
Copy link
Contributor Author

Hey Jay, @jaykru-tt
@ntarafdar asked me to share with you the concat use case in Segformer:

  • The concat requested to have a RM/interleaved input
  • My input was height shared since it's the output from upsample
  • To the concat, there are 4 inputs, coming from 4 different upsample OPs, and different sharding schemes
  • So instead of S2I for the 4 inputs, can I keep them RM/height-sharded?
  • input = 4x [1,128,128,256] and concat at dim=3
  • output = [1,128,128,1024]

@mbahnasTT
Copy link
Contributor Author

Hey @bbradelTT , we'll need to add some features to improve the perf of Segformer model , please:

  • Softmax to support sharded input
  • LayerNorm to support height sharded input

I can share more details.

@mbahnasTT mbahnasTT assigned nardoTT and unassigned eyonland Nov 27, 2024
@nkpatel-tt
Copy link
Contributor

@mbahnasTT could you please share details on decoder also?? I am looking at optimising upsample for decoder.

@mbahnasTT
Copy link
Contributor Author

@nkpatel-tt here are some details:

@nardoTT
Copy link
Contributor

nardoTT commented Dec 10, 2024

Hey @ntarafdar and @nardoTT Segformer decoder untilize sharding spec

  • input = [16384, 256] --> Height sharding on 32 cores --> spec = [512, 256]
  • input = [4096, 256] --> Height sharding on 32 cores --> spec = [128, 256]
  • input = [1024, 256] --> Height sharding on 64 cores --> spec = [32, 256]
  • input = [256, 256] --> Block sharding on 32 cores --> spec = [32, 32]

Segformer encoder untilize sharding spec

  • input = [16384, 32] --> Height sharding on 64 cores --> spec = [256, 32]
  • input = [16384, 128] --> Height sharding on 64 cores --> spec = [256, 128]
  • input = [4096, 64] --> Height sharding on 64 cores --> spec = [64, 64]
  • input = [4096, 256] --> Block sharding on 16 cores --> spec = [512, 128]
  • input = [1024, 160] --> Block sharding on 40 cores --> spec = [128, 32]
  • input = [1024, 640] --> Block sharding on 40 cores --> spec = [128, 128]
  • input = [256, 1024] --> Block sharding on 64 cores --> spec = [32, 128]

The input layout above is supported by the current untilize operation, and utilize has been tested for these inputs. The MLP operation is explicitly converting the tensor layout from sharded to interleaved in line 114 in ttnn_segformer_mlp.py. If this line is commented, the to_layout operation in line 57 in ttnn_segformer_decode_head.py would execute the untilize operation in sharded mode. I don't think the issue you are facing is untilize related.

sjameelTT added a commit that referenced this issue Dec 17, 2024
…d do a minor refactor of ttnn::permute (#15881)

### Ticket
#14790 add transpose wh sharded implementation when shard shape < height
dimension
#15165 add N-d permute with width dimension
#15589 correct permute dimensionality when less than 4D
#15750 remove the composite flag from permute
#12550 re-enable some permute tests for blackhole
#12349 re-enable working transpose tests for blackhole
#16066 disable test uniform as it's stochastic

### Problem description
This PR addresses several permute and transpose problems all at once

- Transpose WH sharded does not currently work when the shard shape is
less than the height
- Permute on greater than 4 dimensions does not work when moving width
around (for both tiled and RM)
- The Permute kernel when width doesn't change is single core
- Permute has an unclean API in which we have a composite flag that is
not generically applicable
- Permute on less than 4 dimensions gets an incorrect output shape in
cases where it's a no-op
- Permute tests are disabled for BH due to LLK issues
- Transpose tests are disabled for BH due to LLK issues

### What's changed
- Add transpose WH sharded implementation for when shard shape is less
than the height dim (outputs a block sharded output)
- Add an N-d permute kernel that works generically on any row major
input. We have to call a global init each loop of the compute kernel as
transpose sets some registers that aren't cleared (there's no
transpose_uninit). This results in bad pcc when there's more than one
loop. For GS/BH, even the global init doesn't solve the problem so the
test is disabled. For Tiled, we need 5D untilize/tilize. This increases
sweeps coverage for permute from **50%** to **86%**
- For the optimized case where Permute's width dimension is not
shuffled, make the kernel multicore
- Remove composite flag that is default set to to make permute
non-generic. This has caused forge models to have bad pcc as they were
not aware of this optional argument.
- Refactor ttnn::permute to add nop checks and correct shape
calculations
- Re-enable permute and transpose tests for blackhole

When replacing variants of transpose with this RM permute kernel, a lot
of tests on BH/GS failed, so I will do that in a follow-up to address.
The LLK issues are causing pains there. If we get N-d untilize/tilize
support and once the LLK issues are fixed, permute should have the
ability to be generic. The remaining issues for the pytorch 2.0 sweeps
after the untilize/tilize fix are the CB overflow on transpose wh, which
should be fixed out of the box when we replace the kernel that is used
(which I am not doing in this PR since it doesn't work for GS/BH atm).

### Checklist
- [x] Post commit CI passes
https://github.com/tenstorrent/tt-metal/actions/runs/12367177499/job/34547311782
(failing test is failing on main)
- [x] Blackhole Post commit (if applicable)
https://github.com/tenstorrent/tt-metal/actions/runs/12367175575
- [x] Model regression CI testing passes (if applicable)
https://github.com/tenstorrent/tt-metal/actions/runs/12357119737
- [x] Device performance regression CI testing passes (if applicable)
https://github.com/tenstorrent/tt-metal/actions/runs/12357115316
- [ ] **(For models and ops writers)** Full [new
models](https://github.com/tenstorrent/tt-metal/actions/workflows/full-new-models-suite.yaml)
tests passes
- [ ] New/Existing tests provide coverage for changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants