Add a more meaningful check to make sure we are not merging blocks #4186

giuseros · 2024-06-21T16:21:38Z

This is a follow-up to #4176 (comment)

I am now counting the number of blocks with (17) and without (31) block merging. I double checked to make sure this does not pass when we use an aggressive region simplification strategy.

giuseros · 2024-06-21T16:21:48Z

cc @zhanglx13

zhanglx13 · 2024-06-21T16:50:27Z

Since you are counting blocks, can we reduce the test size as a minimal to just check that we are not merging any blocks at all?

giuseros · 2024-06-21T17:16:52Z

Since you are counting blocks, can we reduce the test size as a minimal to just check that we are not merging any blocks at all?

I tried to reduce the test so that if we move to aggressive we will see the explosion (i.e., a lot of redundancy in the block). Otherwise people might not see any reason to use a normal strategy

zhanglx13 · 2024-06-21T17:56:15Z

Since you are counting blocks, can we reduce the test size as a minimal to just check that we are not merging any blocks at all?

I tried to reduce the test so that if we move to aggressive we will see the explosion (i.e., a lot of redundancy in the block). Otherwise people might not see any reason to use a normal strategy

This test is not trying to persuade people to use normal or aggressive when it comes to simply-region. This test is just to confirm we disabled block merging during builtin-to-llvm pass. And it suffices to show that w/ and w/o the patch, there are 2 blocks vs 1 (or some minimal number). We don't need 31 vs 17.

giuseros · 2024-06-22T07:14:22Z

Done. I don't have strong opinions on tests, but we are disabling an optimization (and an important one, especially on GPUs), and I think we should also explain why. It's true that I explained this in the comments, but the original test was carefully written to show that if you try to move to aggressive simplification, the block parameters would exponentially increase.

zhanglx13 · 2024-06-22T15:03:38Z

Done. I don't have strong opinions on tests, but we are disabling an optimization (and an important one, especially on GPUs), and I think we should also explain why. It's true that I explained this in the comments, but the original test was carefully written to show that if you try to move to aggressive simplification, the block parameters would exponentially increase.

Correct me if I'm wrong, but if the pass happens during canonicalization, it should not be considered as an optimization.
On the other hand, the original problem is long compilation time (if not hang). Showing a huge list of block parameters does not tell anything about how long it takes to compile.

giuseros · 2024-06-22T17:39:19Z

Correct me if I'm wrong, but if the pass happens during canonicalization, it should not be considered as an optimization.

Well, yes and no. There is long discussion about exactly this topic: https://discourse.llvm.org/t/rfc-update-to-general-design-section-of-operation-canonicalizations-in-mlir/ . I would consider block merging an optimization and not a canonicalization, but I guess it is a gray area

Showing a huge list of block parameters does not tell anything about how long it takes to compile.

I guess this is also about personal opinion. The huge list of block parameters is the reason of why it takes so much time to compile. I mean if you run the (original) test with aggressive you would see exactly the issue we are dealing with. But it's not super important, we can add the test later, once we fix the root problem (i.e., we reduce the block parameter list)

…4186) This is a follow-up to #4176 (comment) I am now counting the number of blocks with (17) and without (31) block merging. I double checked to make sure this does not pass when we use an aggressive region simplification strategy.

Update Update Update Update Add a more meaningful check to make sure we are not merging blocks (#4186) This is a follow-up to #4176 (comment) I am now counting the number of blocks with (17) and without (31) block merging. I double checked to make sure this does not pass when we use an aggressive region simplification strategy. [AMD] Skip mfma layout in maybeDuplicate (#4170) The workaround introduced in #4048 "forgot" to skip mfma layout. [TEST] Merge duplicate `max_num_imprecise_acc` tests and improve code (#4191) [DOCS][NFC] Fix doc formatting problems (#4195) 1. f-string cannot be used as docstrings in Python. 2. URLs should follow the reStructuredText format. 3. Code snippets in a code block should be indented. Tested and passed on a local machine. [BACKEND] Fix regression in pipeliner pre-checks. (#4196) During some previous refactoring we changed the logic and started pipeling cases that had incompatible shared encoding. This was missed because one of the lit test had not been updated :( Remove tl.multiple_of call from tma persistent kernel (#4198) [AMD] Guard against null in `BypassEpilogueSMEM` (#4203) `val.getDefiningOp()` can return `nullptr`. In this case, we must fail the `BypassEpilogueSMEM` rewrite pass for the given op. This prevents run-time crashes. [FRONTEND][NFC] Fix type checking, conditional logic, and loop structures for improved readability and performance (#4208) Document TRITON_HOME (#4210) Document the existence of `TRITON_HOME` environment variable. The `TRITON_HOME` variable controls the location of the `.triton` directory that stores, among other things, the files downloaded during a `pip install -e python` virtualenv build. By default, this is located in the user's home directory, at `~/.triton`. I was trying to build Triton on my system on a large local disk, but with limited network home directory space, and the `pip` command kept failing with out of disk space errors. It turned out that during installation, large files were downloaded to the `~/.triton` directory causing failure. After checking that it was not `pip` doing this, I found the `TRITON_HOME` variable which allowed me to workaround the issue and build Triton successfully. After seconding #4007, I decided to contribute this documentation fix. Co-authored-by: sree <sree@buckyball> [BACKEND] Fix regression in i1 reduction (#4215) Recent refactoring broke i1 shared memory load. [BUILD] update URL for LLVM tarballs (#4216) [BACKEND] Fix divisibility analysis for shift ops (#4221) Divisibility does not ensure that a value is not 0 therefore we cannot use divisibility as a minimum shifted values. Support FP8 constant (#4222) To unblock the compilation of kernels like below which don't operate arithmetically on FP8. ``` @triton.jit def triton_poi_fused__scaled_mm__to_copy_constant_pad_nd_lift_fresh_2(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 400624 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex % 784 x1 = (xindex // 784) x2 = xindex tmp0 = x0 tmp1 = tl.full([1], 769, tl.int64) tmp2 = tmp0 < tmp1 tmp3 = tl.load(in_ptr0 + (x0 + (769*x1)), tmp2 & xmask, other=0.0) tmp4 = tmp3.to(tl.float8e4nv) tmp5 = tl.full(tmp4.shape, 0.0, tmp4.dtype) tmp6 = tl.where(tmp2, tmp4, tmp5) tl.store(out_ptr0 + (x2), tmp6, xmask) ``` [INTERPRETER] Implement implicit tensor conversion for assignment operators (#4214) Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update

…riton-lang#4186) This is a follow-up to triton-lang#4176 (comment) I am now counting the number of blocks with (17) and without (31) block merging. I double checked to make sure this does not pass when we use an aggressive region simplification strategy.

Add a more meaningful check to make sure we are not merging blocks

28d7fc4

giuseros requested a review from ptillet as a code owner June 21, 2024 16:21

Address review feedback

7623334

giuseros force-pushed the better_block_merging_checks branch from 8dc3ed2 to 7623334 Compare June 22, 2024 07:17

zhanglx13 approved these changes Jun 22, 2024

View reviewed changes

zhanglx13 merged commit c7a37a9 into triton-lang:main Jun 22, 2024
6 checks passed

jlebar mentioned this pull request Jun 25, 2024

Pass repr=key when calling JITFunction.cache_hook #4207

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a more meaningful check to make sure we are not merging blocks #4186

Add a more meaningful check to make sure we are not merging blocks #4186

giuseros commented Jun 21, 2024

giuseros commented Jun 21, 2024

zhanglx13 commented Jun 21, 2024 •

edited

Loading

giuseros commented Jun 21, 2024 •

edited

Loading

zhanglx13 commented Jun 21, 2024

giuseros commented Jun 22, 2024

zhanglx13 commented Jun 22, 2024

giuseros commented Jun 22, 2024

Add a more meaningful check to make sure we are not merging blocks #4186

Add a more meaningful check to make sure we are not merging blocks #4186

Conversation

giuseros commented Jun 21, 2024

giuseros commented Jun 21, 2024

zhanglx13 commented Jun 21, 2024 • edited Loading

giuseros commented Jun 21, 2024 • edited Loading

zhanglx13 commented Jun 21, 2024

giuseros commented Jun 22, 2024

zhanglx13 commented Jun 22, 2024

giuseros commented Jun 22, 2024

zhanglx13 commented Jun 21, 2024 •

edited

Loading

giuseros commented Jun 21, 2024 •

edited

Loading