Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_uniform.py::test_uniform fails occasionally #16066

Closed
sjameelTT opened this issue Dec 16, 2024 · 9 comments
Closed

test_uniform.py::test_uniform fails occasionally #16066

sjameelTT opened this issue Dec 16, 2024 · 9 comments
Assignees

Comments

@sjameelTT
Copy link
Contributor

I noticed this when it failed multiple times on my own change. There is a torch seed, but the ttnn uniform distribution creation is not seeded, so the results change every time. The rtol=0.5 is fairly high so it ends up passing the majority of times but it's flakey. If you run it several hundred times, even with the same seed it will fail at some point.

To reproduce:

  1. Checkout branch: sjameel/test_uniform_failure and then build
  2. pytest "tests/ttnn/unit_tests/operations/test_uniform.py::test_uniform[dtype=float32-rand_range=[-5.1, 1.2]-shape=[32, 32]]"

I printed out the outputs in the test. It changes every time, and generally fails at some point. If it doesn't fail the first try you should be able to rerun it until it does. We likely need to somehow add a seed to the ttnn api here to make the test work.

@tt-aho
Copy link
Contributor

tt-aho commented Dec 16, 2024

I've also seen similar issues with the bernoulli op. Torch is seeded but the ttnn op is not seeded https://github.com/tenstorrent/tt-metal/actions/runs/12014719074/job/33492436738

@sjameelTT
Copy link
Contributor Author

sjameelTT commented Dec 16, 2024

Yeah we need some sort of seed for the ttnn side as well, is this an existing capability for the device?

sjameelTT added a commit that referenced this issue Dec 16, 2024
@sjameelTT
Copy link
Contributor Author

Seems like rand_init has a seed that is always a random value.

@BuiChiTrung
Copy link
Contributor

BuiChiTrung commented Dec 17, 2024

I set the rtol pretty high (0.5) due to the fact that tensix instruction used to generate random TTI_SFPMOV(0, 9, p_sfpu::LREG0, 8); is not pseudo-random (which is reported here, followed up issue).

To ensure consistency in the test results, I could add a seed parameter to the Uniform and Bernoulli ops. If you guys agree with this workaround, I'll open a PR for implementation.

sjameelTT added a commit that referenced this issue Dec 17, 2024
@sjameelTT
Copy link
Contributor Author

sjameelTT commented Dec 17, 2024

Yes, a seed parameter would be good for consistency. I see there already is a seed set, it's just not exposed at the API and it's always randomly initialized. You can test several seeds in your tests.

sjameelTT added a commit that referenced this issue Dec 17, 2024
…d do a minor refactor of ttnn::permute (#15881)

### Ticket
#14790 add transpose wh sharded implementation when shard shape < height
dimension
#15165 add N-d permute with width dimension
#15589 correct permute dimensionality when less than 4D
#15750 remove the composite flag from permute
#12550 re-enable some permute tests for blackhole
#12349 re-enable working transpose tests for blackhole
#16066 disable test uniform as it's stochastic

### Problem description
This PR addresses several permute and transpose problems all at once

- Transpose WH sharded does not currently work when the shard shape is
less than the height
- Permute on greater than 4 dimensions does not work when moving width
around (for both tiled and RM)
- The Permute kernel when width doesn't change is single core
- Permute has an unclean API in which we have a composite flag that is
not generically applicable
- Permute on less than 4 dimensions gets an incorrect output shape in
cases where it's a no-op
- Permute tests are disabled for BH due to LLK issues
- Transpose tests are disabled for BH due to LLK issues

### What's changed
- Add transpose WH sharded implementation for when shard shape is less
than the height dim (outputs a block sharded output)
- Add an N-d permute kernel that works generically on any row major
input. We have to call a global init each loop of the compute kernel as
transpose sets some registers that aren't cleared (there's no
transpose_uninit). This results in bad pcc when there's more than one
loop. For GS/BH, even the global init doesn't solve the problem so the
test is disabled. For Tiled, we need 5D untilize/tilize. This increases
sweeps coverage for permute from **50%** to **86%**
- For the optimized case where Permute's width dimension is not
shuffled, make the kernel multicore
- Remove composite flag that is default set to to make permute
non-generic. This has caused forge models to have bad pcc as they were
not aware of this optional argument.
- Refactor ttnn::permute to add nop checks and correct shape
calculations
- Re-enable permute and transpose tests for blackhole

When replacing variants of transpose with this RM permute kernel, a lot
of tests on BH/GS failed, so I will do that in a follow-up to address.
The LLK issues are causing pains there. If we get N-d untilize/tilize
support and once the LLK issues are fixed, permute should have the
ability to be generic. The remaining issues for the pytorch 2.0 sweeps
after the untilize/tilize fix are the CB overflow on transpose wh, which
should be fixed out of the box when we replace the kernel that is used
(which I am not doing in this PR since it doesn't work for GS/BH atm).

### Checklist
- [x] Post commit CI passes
https://github.com/tenstorrent/tt-metal/actions/runs/12367177499/job/34547311782
(failing test is failing on main)
- [x] Blackhole Post commit (if applicable)
https://github.com/tenstorrent/tt-metal/actions/runs/12367175575
- [x] Model regression CI testing passes (if applicable)
https://github.com/tenstorrent/tt-metal/actions/runs/12357119737
- [x] Device performance regression CI testing passes (if applicable)
https://github.com/tenstorrent/tt-metal/actions/runs/12357115316
- [ ] **(For models and ops writers)** Full [new
models](https://github.com/tenstorrent/tt-metal/actions/workflows/full-new-models-suite.yaml)
tests passes
- [ ] New/Existing tests provide coverage for changes
BuiChiTrung added a commit that referenced this issue Dec 19, 2024
BuiChiTrung added a commit that referenced this issue Dec 19, 2024
BuiChiTrung added a commit that referenced this issue Dec 19, 2024
BuiChiTrung added a commit that referenced this issue Dec 19, 2024
@BuiChiTrung
Copy link
Contributor

Here is the PR to fix this issue: #16179 . Plz help me ask TT reviewers to take a look at it.

BuiChiTrung added a commit that referenced this issue Dec 21, 2024
BuiChiTrung added a commit that referenced this issue Dec 21, 2024
BuiChiTrung added a commit that referenced this issue Dec 21, 2024
BuiChiTrung added a commit that referenced this issue Dec 21, 2024
BuiChiTrung added a commit that referenced this issue Dec 21, 2024
BuiChiTrung added a commit that referenced this issue Dec 21, 2024
BuiChiTrung added a commit that referenced this issue Dec 21, 2024
BuiChiTrung added a commit that referenced this issue Dec 21, 2024
BuiChiTrung added a commit that referenced this issue Dec 21, 2024
BuiChiTrung added a commit that referenced this issue Dec 21, 2024
BuiChiTrung added a commit that referenced this issue Dec 24, 2024
BuiChiTrung added a commit that referenced this issue Dec 24, 2024
BuiChiTrung added a commit that referenced this issue Dec 24, 2024
BuiChiTrung added a commit that referenced this issue Dec 24, 2024
BuiChiTrung added a commit that referenced this issue Dec 24, 2024
@mrshaw01
Copy link
Contributor

@sjameelTT Would you ping your team review Trung's PR?

@sjameelTT
Copy link
Contributor Author

I reviewed it, lgtm

BuiChiTrung added a commit that referenced this issue Dec 27, 2024
BuiChiTrung added a commit that referenced this issue Dec 27, 2024
BuiChiTrung added a commit that referenced this issue Dec 27, 2024
BuiChiTrung added a commit that referenced this issue Dec 27, 2024
BuiChiTrung added a commit that referenced this issue Dec 27, 2024
BuiChiTrung added a commit that referenced this issue Dec 27, 2024
### Ticket
Link to Github Issue: #16066 

### Problem description
The uniform and bernoulli operation unit-tests behavior are
un-consistent because a random seed is created inside the program
factory.

### What's changed

- Add seed param to uniform and bernoulli operation, which allows
passing a seed to these ops in python side. If seed = 0 or not passed, a
random seed is generated in the program factory.
- Add a custom compute program hash function for these ops so that seed
param is not included in the the program hash.
- Update pytest for these operations

### Checklist
- [x] Post commit CI passes
https://github.com/tenstorrent/tt-metal/actions/runs/12411880762
- [ ] Blackhole Post commit (if applicable)
- [ ] Model regression CI testing passes (if applicable)
- [ ] Device performance regression CI testing passes (if applicable)
- [ ] **(For models and ops writers)** Full [new
models](https://github.com/tenstorrent/tt-metal/actions/workflows/full-new-models-suite.yaml)
tests passes
- [x] New/Existing tests provide coverage for changes

---------

Co-authored-by: Michael Chiou <[email protected]>
@BuiChiTrung
Copy link
Contributor

Resolved in this #16179

arikTT pushed a commit that referenced this issue Dec 27, 2024
### Ticket
Link to Github Issue: #16066 

### Problem description
The uniform and bernoulli operation unit-tests behavior are
un-consistent because a random seed is created inside the program
factory.

### What's changed

- Add seed param to uniform and bernoulli operation, which allows
passing a seed to these ops in python side. If seed = 0 or not passed, a
random seed is generated in the program factory.
- Add a custom compute program hash function for these ops so that seed
param is not included in the the program hash.
- Update pytest for these operations

### Checklist
- [x] Post commit CI passes
https://github.com/tenstorrent/tt-metal/actions/runs/12411880762
- [ ] Blackhole Post commit (if applicable)
- [ ] Model regression CI testing passes (if applicable)
- [ ] Device performance regression CI testing passes (if applicable)
- [ ] **(For models and ops writers)** Full [new
models](https://github.com/tenstorrent/tt-metal/actions/workflows/full-new-models-suite.yaml)
tests passes
- [x] New/Existing tests provide coverage for changes

---------

Co-authored-by: Michael Chiou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants