-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BACKEND] Fix regression in i1 reduction #4215
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Recent refactoring broke i1 shared memory load.
ptillet
approved these changes
Jun 26, 2024
Jokeren
pushed a commit
that referenced
this pull request
Jul 1, 2024
Recent refactoring broke i1 shared memory load.
Jokeren
added a commit
that referenced
this pull request
Jul 3, 2024
Update Update Update Update Add a more meaningful check to make sure we are not merging blocks (#4186) This is a follow-up to #4176 (comment) I am now counting the number of blocks with (17) and without (31) block merging. I double checked to make sure this does not pass when we use an aggressive region simplification strategy. [AMD] Skip mfma layout in maybeDuplicate (#4170) The workaround introduced in #4048 "forgot" to skip mfma layout. [TEST] Merge duplicate `max_num_imprecise_acc` tests and improve code (#4191) [DOCS][NFC] Fix doc formatting problems (#4195) 1. f-string cannot be used as docstrings in Python. 2. URLs should follow the reStructuredText format. 3. Code snippets in a code block should be indented. Tested and passed on a local machine. [BACKEND] Fix regression in pipeliner pre-checks. (#4196) During some previous refactoring we changed the logic and started pipeling cases that had incompatible shared encoding. This was missed because one of the lit test had not been updated :( Remove tl.multiple_of call from tma persistent kernel (#4198) [AMD] Guard against null in `BypassEpilogueSMEM` (#4203) `val.getDefiningOp()` can return `nullptr`. In this case, we must fail the `BypassEpilogueSMEM` rewrite pass for the given op. This prevents run-time crashes. [FRONTEND][NFC] Fix type checking, conditional logic, and loop structures for improved readability and performance (#4208) Document TRITON_HOME (#4210) Document the existence of `TRITON_HOME` environment variable. The `TRITON_HOME` variable controls the location of the `.triton` directory that stores, among other things, the files downloaded during a `pip install -e python` virtualenv build. By default, this is located in the user's home directory, at `~/.triton`. I was trying to build Triton on my system on a large local disk, but with limited network home directory space, and the `pip` command kept failing with out of disk space errors. It turned out that during installation, large files were downloaded to the `~/.triton` directory causing failure. After checking that it was not `pip` doing this, I found the `TRITON_HOME` variable which allowed me to workaround the issue and build Triton successfully. After seconding #4007, I decided to contribute this documentation fix. Co-authored-by: sree <sree@buckyball> [BACKEND] Fix regression in i1 reduction (#4215) Recent refactoring broke i1 shared memory load. [BUILD] update URL for LLVM tarballs (#4216) [BACKEND] Fix divisibility analysis for shift ops (#4221) Divisibility does not ensure that a value is not 0 therefore we cannot use divisibility as a minimum shifted values. Support FP8 constant (#4222) To unblock the compilation of kernels like below which don't operate arithmetically on FP8. ``` @triton.jit def triton_poi_fused__scaled_mm__to_copy_constant_pad_nd_lift_fresh_2(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 400624 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex % 784 x1 = (xindex // 784) x2 = xindex tmp0 = x0 tmp1 = tl.full([1], 769, tl.int64) tmp2 = tmp0 < tmp1 tmp3 = tl.load(in_ptr0 + (x0 + (769*x1)), tmp2 & xmask, other=0.0) tmp4 = tmp3.to(tl.float8e4nv) tmp5 = tl.full(tmp4.shape, 0.0, tmp4.dtype) tmp6 = tl.where(tmp2, tmp4, tmp5) tl.store(out_ptr0 + (x2), tmp6, xmask) ``` [INTERPRETER] Implement implicit tensor conversion for assignment operators (#4214) Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update
bertmaher
pushed a commit
to bertmaher/triton
that referenced
this pull request
Dec 10, 2024
Recent refactoring broke i1 shared memory load.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Recent refactoring broke i1 shared memory load.