-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_uniform.py::test_uniform fails occasionally #16066
Comments
I've also seen similar issues with the bernoulli op. Torch is seeded but the ttnn op is not seeded https://github.com/tenstorrent/tt-metal/actions/runs/12014719074/job/33492436738 |
Yeah we need some sort of seed for the ttnn side as well, is this an existing capability for the device? |
Seems like rand_init has a seed that is always a random value. |
I set the rtol pretty high (0.5) due to the fact that tensix instruction used to generate random To ensure consistency in the test results, I could add a seed parameter to the Uniform and Bernoulli ops. If you guys agree with this workaround, I'll open a PR for implementation. |
Yes, a seed parameter would be good for consistency. I see there already is a seed set, it's just not exposed at the API and it's always randomly initialized. You can test several seeds in your tests. |
…d do a minor refactor of ttnn::permute (#15881) ### Ticket #14790 add transpose wh sharded implementation when shard shape < height dimension #15165 add N-d permute with width dimension #15589 correct permute dimensionality when less than 4D #15750 remove the composite flag from permute #12550 re-enable some permute tests for blackhole #12349 re-enable working transpose tests for blackhole #16066 disable test uniform as it's stochastic ### Problem description This PR addresses several permute and transpose problems all at once - Transpose WH sharded does not currently work when the shard shape is less than the height - Permute on greater than 4 dimensions does not work when moving width around (for both tiled and RM) - The Permute kernel when width doesn't change is single core - Permute has an unclean API in which we have a composite flag that is not generically applicable - Permute on less than 4 dimensions gets an incorrect output shape in cases where it's a no-op - Permute tests are disabled for BH due to LLK issues - Transpose tests are disabled for BH due to LLK issues ### What's changed - Add transpose WH sharded implementation for when shard shape is less than the height dim (outputs a block sharded output) - Add an N-d permute kernel that works generically on any row major input. We have to call a global init each loop of the compute kernel as transpose sets some registers that aren't cleared (there's no transpose_uninit). This results in bad pcc when there's more than one loop. For GS/BH, even the global init doesn't solve the problem so the test is disabled. For Tiled, we need 5D untilize/tilize. This increases sweeps coverage for permute from **50%** to **86%** - For the optimized case where Permute's width dimension is not shuffled, make the kernel multicore - Remove composite flag that is default set to to make permute non-generic. This has caused forge models to have bad pcc as they were not aware of this optional argument. - Refactor ttnn::permute to add nop checks and correct shape calculations - Re-enable permute and transpose tests for blackhole When replacing variants of transpose with this RM permute kernel, a lot of tests on BH/GS failed, so I will do that in a follow-up to address. The LLK issues are causing pains there. If we get N-d untilize/tilize support and once the LLK issues are fixed, permute should have the ability to be generic. The remaining issues for the pytorch 2.0 sweeps after the untilize/tilize fix are the CB overflow on transpose wh, which should be fixed out of the box when we replace the kernel that is used (which I am not doing in this PR since it doesn't work for GS/BH atm). ### Checklist - [x] Post commit CI passes https://github.com/tenstorrent/tt-metal/actions/runs/12367177499/job/34547311782 (failing test is failing on main) - [x] Blackhole Post commit (if applicable) https://github.com/tenstorrent/tt-metal/actions/runs/12367175575 - [x] Model regression CI testing passes (if applicable) https://github.com/tenstorrent/tt-metal/actions/runs/12357119737 - [x] Device performance regression CI testing passes (if applicable) https://github.com/tenstorrent/tt-metal/actions/runs/12357115316 - [ ] **(For models and ops writers)** Full [new models](https://github.com/tenstorrent/tt-metal/actions/workflows/full-new-models-suite.yaml) tests passes - [ ] New/Existing tests provide coverage for changes
Here is the PR to fix this issue: #16179 . Plz help me ask TT reviewers to take a look at it. |
@sjameelTT Would you ping your team review Trung's PR? |
I reviewed it, lgtm |
### Ticket Link to Github Issue: #16066 ### Problem description The uniform and bernoulli operation unit-tests behavior are un-consistent because a random seed is created inside the program factory. ### What's changed - Add seed param to uniform and bernoulli operation, which allows passing a seed to these ops in python side. If seed = 0 or not passed, a random seed is generated in the program factory. - Add a custom compute program hash function for these ops so that seed param is not included in the the program hash. - Update pytest for these operations ### Checklist - [x] Post commit CI passes https://github.com/tenstorrent/tt-metal/actions/runs/12411880762 - [ ] Blackhole Post commit (if applicable) - [ ] Model regression CI testing passes (if applicable) - [ ] Device performance regression CI testing passes (if applicable) - [ ] **(For models and ops writers)** Full [new models](https://github.com/tenstorrent/tt-metal/actions/workflows/full-new-models-suite.yaml) tests passes - [x] New/Existing tests provide coverage for changes --------- Co-authored-by: Michael Chiou <[email protected]>
Resolved in this #16179 |
### Ticket Link to Github Issue: #16066 ### Problem description The uniform and bernoulli operation unit-tests behavior are un-consistent because a random seed is created inside the program factory. ### What's changed - Add seed param to uniform and bernoulli operation, which allows passing a seed to these ops in python side. If seed = 0 or not passed, a random seed is generated in the program factory. - Add a custom compute program hash function for these ops so that seed param is not included in the the program hash. - Update pytest for these operations ### Checklist - [x] Post commit CI passes https://github.com/tenstorrent/tt-metal/actions/runs/12411880762 - [ ] Blackhole Post commit (if applicable) - [ ] Model regression CI testing passes (if applicable) - [ ] Device performance regression CI testing passes (if applicable) - [ ] **(For models and ops writers)** Full [new models](https://github.com/tenstorrent/tt-metal/actions/workflows/full-new-models-suite.yaml) tests passes - [x] New/Existing tests provide coverage for changes --------- Co-authored-by: Michael Chiou <[email protected]>
I noticed this when it failed multiple times on my own change. There is a torch seed, but the ttnn uniform distribution creation is not seeded, so the results change every time. The
rtol=0.5
is fairly high so it ends up passing the majority of times but it's flakey. If you run it several hundred times, even with the same seed it will fail at some point.To reproduce:
pytest "tests/ttnn/unit_tests/operations/test_uniform.py::test_uniform[dtype=float32-rand_range=[-5.1, 1.2]-shape=[32, 32]]"
I printed out the outputs in the test. It changes every time, and generally fails at some point. If it doesn't fail the first try you should be able to rerun it until it does. We likely need to somehow add a seed to the ttnn api here to make the test work.
The text was updated successfully, but these errors were encountered: