test_uniform.py::test_uniform fails occasionally #16066

sjameelTT · 2024-12-16T20:10:35Z

I noticed this when it failed multiple times on my own change. There is a torch seed, but the ttnn uniform distribution creation is not seeded, so the results change every time. The rtol=0.5 is fairly high so it ends up passing the majority of times but it's flakey. If you run it several hundred times, even with the same seed it will fail at some point.

To reproduce:

Checkout branch: sjameel/test_uniform_failure and then build
pytest "tests/ttnn/unit_tests/operations/test_uniform.py::test_uniform[dtype=float32-rand_range=[-5.1, 1.2]-shape=[32, 32]]"

I printed out the outputs in the test. It changes every time, and generally fails at some point. If it doesn't fail the first try you should be able to rerun it until it does. We likely need to somehow add a seed to the ttnn api here to make the test work.

The text was updated successfully, but these errors were encountered:

tt-aho · 2024-12-16T21:50:38Z

I've also seen similar issues with the bernoulli op. Torch is seeded but the ttnn op is not seeded https://github.com/tenstorrent/tt-metal/actions/runs/12014719074/job/33492436738

sjameelTT · 2024-12-16T21:52:03Z

Yeah we need some sort of seed for the ttnn side as well, is this an existing capability for the device?

sjameelTT · 2024-12-16T22:12:55Z

Seems like rand_init has a seed that is always a random value.

BuiChiTrung · 2024-12-17T03:29:16Z

I set the rtol pretty high (0.5) due to the fact that tensix instruction used to generate random TTI_SFPMOV(0, 9, p_sfpu::LREG0, 8); is not pseudo-random (which is reported here, followed up issue).

To ensure consistency in the test results, I could add a seed parameter to the Uniform and Bernoulli ops. If you guys agree with this workaround, I'll open a PR for implementation.

sjameelTT · 2024-12-17T16:25:41Z

Yes, a seed parameter would be good for consistency. I see there already is a seed set, it's just not exposed at the API and it's always randomly initialized. You can test several seeds in your tests.

…d do a minor refactor of ttnn::permute (#15881) ### Ticket #14790 add transpose wh sharded implementation when shard shape < height dimension #15165 add N-d permute with width dimension #15589 correct permute dimensionality when less than 4D #15750 remove the composite flag from permute #12550 re-enable some permute tests for blackhole #12349 re-enable working transpose tests for blackhole #16066 disable test uniform as it's stochastic ### Problem description This PR addresses several permute and transpose problems all at once - Transpose WH sharded does not currently work when the shard shape is less than the height - Permute on greater than 4 dimensions does not work when moving width around (for both tiled and RM) - The Permute kernel when width doesn't change is single core - Permute has an unclean API in which we have a composite flag that is not generically applicable - Permute on less than 4 dimensions gets an incorrect output shape in cases where it's a no-op - Permute tests are disabled for BH due to LLK issues - Transpose tests are disabled for BH due to LLK issues ### What's changed - Add transpose WH sharded implementation for when shard shape is less than the height dim (outputs a block sharded output) - Add an N-d permute kernel that works generically on any row major input. We have to call a global init each loop of the compute kernel as transpose sets some registers that aren't cleared (there's no transpose_uninit). This results in bad pcc when there's more than one loop. For GS/BH, even the global init doesn't solve the problem so the test is disabled. For Tiled, we need 5D untilize/tilize. This increases sweeps coverage for permute from **50%** to **86%** - For the optimized case where Permute's width dimension is not shuffled, make the kernel multicore - Remove composite flag that is default set to to make permute non-generic. This has caused forge models to have bad pcc as they were not aware of this optional argument. - Refactor ttnn::permute to add nop checks and correct shape calculations - Re-enable permute and transpose tests for blackhole When replacing variants of transpose with this RM permute kernel, a lot of tests on BH/GS failed, so I will do that in a follow-up to address. The LLK issues are causing pains there. If we get N-d untilize/tilize support and once the LLK issues are fixed, permute should have the ability to be generic. The remaining issues for the pytorch 2.0 sweeps after the untilize/tilize fix are the CB overflow on transpose wh, which should be fixed out of the box when we replace the kernel that is used (which I am not doing in this PR since it doesn't work for GS/BH atm). ### Checklist - [x] Post commit CI passes https://github.com/tenstorrent/tt-metal/actions/runs/12367177499/job/34547311782 (failing test is failing on main) - [x] Blackhole Post commit (if applicable) https://github.com/tenstorrent/tt-metal/actions/runs/12367175575 - [x] Model regression CI testing passes (if applicable) https://github.com/tenstorrent/tt-metal/actions/runs/12357119737 - [x] Device performance regression CI testing passes (if applicable) https://github.com/tenstorrent/tt-metal/actions/runs/12357115316 - [ ] **(For models and ops writers)** Full [new models](https://github.com/tenstorrent/tt-metal/actions/workflows/full-new-models-suite.yaml) tests passes - [ ] New/Existing tests provide coverage for changes

BuiChiTrung · 2024-12-19T15:36:14Z

Here is the PR to fix this issue: #16179 . Plz help me ask TT reviewers to take a look at it.

mrshaw01 · 2024-12-26T18:12:43Z

@sjameelTT Would you ping your team review Trung's PR?

sjameelTT · 2024-12-26T22:29:25Z

I reviewed it, lgtm

### Ticket Link to Github Issue: #16066 ### Problem description The uniform and bernoulli operation unit-tests behavior are un-consistent because a random seed is created inside the program factory. ### What's changed - Add seed param to uniform and bernoulli operation, which allows passing a seed to these ops in python side. If seed = 0 or not passed, a random seed is generated in the program factory. - Add a custom compute program hash function for these ops so that seed param is not included in the the program hash. - Update pytest for these operations ### Checklist - [x] Post commit CI passes https://github.com/tenstorrent/tt-metal/actions/runs/12411880762 - [ ] Blackhole Post commit (if applicable) - [ ] Model regression CI testing passes (if applicable) - [ ] Device performance regression CI testing passes (if applicable) - [ ] **(For models and ops writers)** Full [new models](https://github.com/tenstorrent/tt-metal/actions/workflows/full-new-models-suite.yaml) tests passes - [x] New/Existing tests provide coverage for changes --------- Co-authored-by: Michael Chiou <[email protected]>

BuiChiTrung · 2024-12-27T06:33:41Z

Resolved in this #16179

### Ticket Link to Github Issue: #16066 ### Problem description The uniform and bernoulli operation unit-tests behavior are un-consistent because a random seed is created inside the program factory. ### What's changed - Add seed param to uniform and bernoulli operation, which allows passing a seed to these ops in python side. If seed = 0 or not passed, a random seed is generated in the program factory. - Add a custom compute program hash function for these ops so that seed param is not included in the the program hash. - Update pytest for these operations ### Checklist - [x] Post commit CI passes https://github.com/tenstorrent/tt-metal/actions/runs/12411880762 - [ ] Blackhole Post commit (if applicable) - [ ] Model regression CI testing passes (if applicable) - [ ] Device performance regression CI testing passes (if applicable) - [ ] **(For models and ops writers)** Full [new models](https://github.com/tenstorrent/tt-metal/actions/workflows/full-new-models-suite.yaml) tests passes - [x] New/Existing tests provide coverage for changes --------- Co-authored-by: Michael Chiou <[email protected]>

sjameelTT assigned BuiChiTrung Dec 16, 2024

sjameelTT added a commit that referenced this issue Dec 16, 2024

#0: skip test uniform due to #16066

8fb31d3

sjameelTT added a commit that referenced this issue Dec 16, 2024

#0: skip test uniform due to #16066

546d29d

sjameelTT mentioned this issue Dec 16, 2024

Add transpose WH sharded, generalize row major permute when N > 4, and do a minor refactor of ttnn::permute #15881

Merged

6 tasks

sjameelTT added a commit that referenced this issue Dec 16, 2024

#0: skip test uniform due to #16066

1b1fcf3

sjameelTT added a commit that referenced this issue Dec 17, 2024

#0: skip test uniform due to #16066

0401154

BuiChiTrung mentioned this issue Dec 19, 2024

#16066: Add seed param to uniform and bernoulli ops #16179

Merged

6 tasks

BuiChiTrung added a commit that referenced this issue Dec 19, 2024

#16066: rollback submodule ver

86070e0

BuiChiTrung added a commit that referenced this issue Dec 19, 2024

#16066: rollback tensor_ops

88dab0e

BuiChiTrung added a commit that referenced this issue Dec 19, 2024

#16066: refactor

48c5faf

BuiChiTrung added a commit that referenced this issue Dec 19, 2024

#16066: update test

c5c3748

BuiChiTrung added a commit that referenced this issue Dec 21, 2024

#16066: rollback submodule ver

e387f14

BuiChiTrung added a commit that referenced this issue Dec 21, 2024

#16066: rollback tensor_ops

8f76495

BuiChiTrung added a commit that referenced this issue Dec 21, 2024

#16066: refactor

64421ea

BuiChiTrung added a commit that referenced this issue Dec 21, 2024

#16066: update test

fc59551

BuiChiTrung added a commit that referenced this issue Dec 21, 2024

#16066: added cmt to test

5f00382

BuiChiTrung added a commit that referenced this issue Dec 21, 2024

#16066: rollback submodule ver

84d5a79

BuiChiTrung added a commit that referenced this issue Dec 21, 2024

#16066: rollback tensor_ops

403b3a2

BuiChiTrung added a commit that referenced this issue Dec 21, 2024

#16066: refactor

3da2643

BuiChiTrung added a commit that referenced this issue Dec 21, 2024

#16066: update test

b3e2ac0

BuiChiTrung added a commit that referenced this issue Dec 21, 2024

#16066: added cmt to test

9059e7b

BuiChiTrung added a commit that referenced this issue Dec 24, 2024

#16066: rollback submodule ver

024c66c

BuiChiTrung added a commit that referenced this issue Dec 24, 2024

#16066: rollback tensor_ops

569005e

BuiChiTrung added a commit that referenced this issue Dec 24, 2024

#16066: refactor

fe8ae94

BuiChiTrung added a commit that referenced this issue Dec 24, 2024

#16066: update test

959e2d7

BuiChiTrung added a commit that referenced this issue Dec 24, 2024

#16066: added cmt to test

d1ad697

BuiChiTrung added a commit that referenced this issue Dec 27, 2024

#16066: rollback submodule ver

d859d02

BuiChiTrung added a commit that referenced this issue Dec 27, 2024

#16066: rollback tensor_ops

e378bf4

BuiChiTrung added a commit that referenced this issue Dec 27, 2024

#16066: refactor

aeb1a45

BuiChiTrung added a commit that referenced this issue Dec 27, 2024

#16066: update test

299e919

BuiChiTrung added a commit that referenced this issue Dec 27, 2024

#16066: added cmt to test

89db1f7

BuiChiTrung closed this as completed Dec 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_uniform.py::test_uniform fails occasionally #16066

test_uniform.py::test_uniform fails occasionally #16066

sjameelTT commented Dec 16, 2024

tt-aho commented Dec 16, 2024

sjameelTT commented Dec 16, 2024 •

edited

Loading

sjameelTT commented Dec 16, 2024

BuiChiTrung commented Dec 17, 2024 •

edited

Loading

sjameelTT commented Dec 17, 2024 •

edited

Loading

BuiChiTrung commented Dec 19, 2024

mrshaw01 commented Dec 26, 2024

sjameelTT commented Dec 26, 2024

BuiChiTrung commented Dec 27, 2024

test_uniform.py::test_uniform fails occasionally #16066

test_uniform.py::test_uniform fails occasionally #16066

Comments

sjameelTT commented Dec 16, 2024

tt-aho commented Dec 16, 2024

sjameelTT commented Dec 16, 2024 • edited Loading

sjameelTT commented Dec 16, 2024

BuiChiTrung commented Dec 17, 2024 • edited Loading

sjameelTT commented Dec 17, 2024 • edited Loading

BuiChiTrung commented Dec 19, 2024

mrshaw01 commented Dec 26, 2024

sjameelTT commented Dec 26, 2024

BuiChiTrung commented Dec 27, 2024

sjameelTT commented Dec 16, 2024 •

edited

Loading

BuiChiTrung commented Dec 17, 2024 •

edited

Loading

sjameelTT commented Dec 17, 2024 •

edited

Loading