Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Evaluation methods? it's not very clear. #6

Closed
MaryRand opened this issue Sep 14, 2016 · 5 comments
Closed

Performance Evaluation methods? it's not very clear. #6

MaryRand opened this issue Sep 14, 2016 · 5 comments
Labels

Comments

@MaryRand
Copy link

I'd like to do some performance evaluation of isaac blas library.

For example, clBLAS provides clBLAS-client that produces the following output.
./clBLAS-client
StatisticalTimer:: Pruning 0 samples from clfunc
StatisticalTimer:: Pruning 0 samples from clGemm
BLAS kernel execution time < ns >: 116170
BLAS kernel execution Gflops < 2.0_M_N*K/time >: 36.1049

On the otherhand, it is not very obvious how to do quick performance evaluation using ISAAC.

Even if there is no tool like clBLAS-client, can you provide info on how to do some evaluations on SGEMM, DGEMM for some sizes ? If you have some CPP code that does, would you provide the code or send it ? [email protected]

Thank you, I appreciate it

@ptillet
Copy link
Collaborator

ptillet commented Sep 14, 2016

Hello,

You can use the bench-blas executable included in the package. If CMake detects other BLAS implementations on your computer (clBLAS, OpenBLAS, cuBLAS...), it will benchmark against those.
USAGE: ${BUILD_DIR}/bench/bench-blas gemm

Alternatively, you can link whatever executable you want against Isaac instead of clBLAS. It'll work for BLAS1, GEMV and GEMM.

@MaryRand
Copy link
Author

Quick questions:
(1) Is there any option to specify single or double for gemm ? like sgemm, dgemm ?
(2) Can I just run one "N" instance ? like, square 5000 x 5000 ?
(3) I see that ISAAC is outperforming clBLAS by 395 vs. 123 below. I interpreted right , right ?
(4) Compared to ViennaCL, what is the high-level difference ?

Thanks ! for the helpful answers.

./bench-blas 0 gemm
#Benchmark : BLAS
#----------------
#gemm (GFLOPS)
"N" "ISAAC" "clBLAS"
"square896" 395 123
"square2560" 390 117
"conv1" 251 59
"conv2" 326 124
"conv3" 225 109
"conv4" 286 103
"conv5" 214 93
"ica32" 83 13
"ica256" 337 64
"32rank1-4096" 275 111
"32rank1-3456" 270 116
"32rank1-896" 179 97

@ptillet
Copy link
Collaborator

ptillet commented Sep 14, 2016

(1) Although DGEMM is supported, I had no time to run the auto-tuner for double precision on all existing architectures. I will make it more easily benchmarkable once everything is included.
(2) For now, you can edit bench/blas.cpp to add the shapes that you want. Ideally, I should indeed provide a config file that lets one benchmark isaac more easily for arbitrary shapes. The shapes included for this benchmark are square, those found in 5 layers of alexnet convolutions (if you do im2col), shapes found in covariance/ica computation, and shapes found in SVD (32 rank1 updated on a 896,3456 and 4096 elements square matrices).
(3) Yes, that's correct! 395GFLOPS vs 123GFLOPS
(4) Oh, many things! I wrote the BLAS3 kernels for ViennaCL, actually. The most notable difference is that ViennaCL is tuned for square matrices, while ISAAC uses a machine learning model to be tuned for any input shape. ViennaCL GEMM also has less efficient bounds checking (isaac uses tricky pointer arithmetics, ViennaCL uses cleanup kernels).
Ultimately, ViennaCL/CLBlas/etc.'s device database are datastructures that map compute devices to kernel parameters, while ISAAC's maps compute devices to a model that predicts kernel parameters given input shapes.

@MaryRand
Copy link
Author

Great. Thank you for the information !

On Wed, Sep 14, 2016 at 4:59 PM, ptillet [email protected] wrote:

(1) Although DGEMM is supported, I had no time to run the auto-tuner for
double precision on all existing architectures. I will make it more easily
benchmarkable once everything is included.
(2) For now, you can edit bench/blas.cpp to add the shapes that you want.
Ideally, I should indeed provide a config file that lets one benchmark
isaac more easily for arbitrary shapes. The shapes included for this
benchmark are square, those found in 5 layers of alexnet convolutions (if
you do im2col), shapes found in covariance/ica computation, and shapes
found in SVD (32 rank1 updated on a 896,3456 and 4096 elements square
matrices).
(3) Yes, that's correct! 395GFLOPS vs 123GFLOPS
(4) Oh, many things! I wrote the BLAS3 kernels for ViennaCL, actually. The
most notable difference is that ViennaCL is tuned for square matrices,
while ISAAC uses a machine learning model to be tuned for any input shape.
ViennaCL GEMM also has less efficient bounds checking (isaac uses tricky
pointer arithmetics, ViennaCL uses cleanup kernels).
Ultimately, ViennaCL/CLBlas/etc.'s device database are datastructures that
map compute devices to kernel parameters, while ISAAC's maps compute
devices to a model that predicts kernel parameters given input shapes.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#6 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AVK7CoR1b-0vnu9EHQt81LODeeQtd_JZks5qqCgHgaJpZM4J87MP
.

@ptillet
Copy link
Collaborator

ptillet commented Sep 14, 2016

You're welcome :)

@ptillet ptillet closed this as completed Sep 14, 2016
goostavz pushed a commit to goostavz/triton that referenced this issue Aug 4, 2023
ThomasRaoux pushed a commit that referenced this issue Mar 13, 2024
There are two tests that failed under AddressSanitizer:
* test/TritonGPU/loop-pipeline.mlir
* python/test/regression/test_functional_regressions.py

with an error: 

```
==8475==ERROR: AddressSanitizer: heap-use-after-free on address 0x50c000bd0be0 at pc 0x557b03278847 bp 0x7ffd69b2c4a0 sp 0x7ffd69b2c498
READ of size 8 at 0x50c000bd0be0 thread T0
    #0 0x557b03278846 in getNextOperandUsingThisValue [third_party/llvm/llvm-project/mlir/include/mlir/IR/UseDefLists.h:43](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/mlir/include/mlir/IR/UseDefLists.h?l=43&ws=aliia/3018&snapshot=215):58
    #1 0x557b03278846 in operator++ [third_party/llvm/llvm-project/mlir/include/mlir/IR/UseDefLists.h:322](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/mlir/include/mlir/IR/UseDefLists.h?l=322&ws=aliia/3018&snapshot=215):39
    #2 0x557b03278846 in mlir::ResultRange::UseIterator::operator++() [third_party/llvm/llvm-project/mlir/lib/IR/OperationSupport.cpp:614](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/mlir/lib/IR/OperationSupport.cpp?l=614&ws=aliia/3018&snapshot=215):5
    #3 0x557affde38c4 in operator++ [third_party/llvm/llvm-project/llvm/include/llvm/ADT/iterator.h:281](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/llvm/include/llvm/ADT/iterator.h?l=281&ws=aliia/3018&snapshot=215):5
    #4 0x557affde38c4 in createAsyncCopy [third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp:117](https://cs.corp.google.com/piper///depot/google3/third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp?l=117&ws=aliia/3018&snapshot=215):26
    #5 0x557affde38c4 in createAsyncLoad [third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp:135](https://cs.corp.google.com/piper///depot/google3/third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp?l=135&ws=aliia/3018&snapshot=215):3
    #6 0x557affde38c4 in createAsynOps [third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp:501](https://cs.corp.google.com/piper///depot/google3/third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp?l=501&ws=aliia/3018&snapshot=215):5
    #7 0x557affde38c4 in mlir::triton::preProcessLoopAndGetSchedule(mlir::scf::ForOp&, int, mlir::triton::PipeliningOption&) [third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp:740](https://cs.corp.google.com/piper///depot/google3/third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp?l=740&ws=aliia/3018&snapshot=215):7
    #8 0x557affe01c0c in pipelineLoop [third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/SoftwarePipeliner.cpp:76](https://cs.corp.google.com/piper///depot/google3/third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/SoftwarePipeliner.cpp?l=76&ws=aliia/3018&snapshot=215):19
...
```
This is likely happening due to iterator being invalidated after
`alloc.erase()`.
This PR moves erases of allocations outside of a loop and fixes
heap-use-after-free issue.

Do you know if there is an easy way to run the tests under sanitizers
upstream? It would be handy if we can automate it, so we catch this kind
of errors early on.
htyu pushed a commit to htyu/triton that referenced this issue Mar 20, 2024
There are two tests that failed under AddressSanitizer:
* test/TritonGPU/loop-pipeline.mlir
* python/test/regression/test_functional_regressions.py

with an error: 

```
==8475==ERROR: AddressSanitizer: heap-use-after-free on address 0x50c000bd0be0 at pc 0x557b03278847 bp 0x7ffd69b2c4a0 sp 0x7ffd69b2c498
READ of size 8 at 0x50c000bd0be0 thread T0
    #0 0x557b03278846 in getNextOperandUsingThisValue [third_party/llvm/llvm-project/mlir/include/mlir/IR/UseDefLists.h:43](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/mlir/include/mlir/IR/UseDefLists.h?l=43&ws=aliia/3018&snapshot=215):58
    triton-lang#1 0x557b03278846 in operator++ [third_party/llvm/llvm-project/mlir/include/mlir/IR/UseDefLists.h:322](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/mlir/include/mlir/IR/UseDefLists.h?l=322&ws=aliia/3018&snapshot=215):39
    triton-lang#2 0x557b03278846 in mlir::ResultRange::UseIterator::operator++() [third_party/llvm/llvm-project/mlir/lib/IR/OperationSupport.cpp:614](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/mlir/lib/IR/OperationSupport.cpp?l=614&ws=aliia/3018&snapshot=215):5
    triton-lang#3 0x557affde38c4 in operator++ [third_party/llvm/llvm-project/llvm/include/llvm/ADT/iterator.h:281](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/llvm/include/llvm/ADT/iterator.h?l=281&ws=aliia/3018&snapshot=215):5
    triton-lang#4 0x557affde38c4 in createAsyncCopy [third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp:117](https://cs.corp.google.com/piper///depot/google3/third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp?l=117&ws=aliia/3018&snapshot=215):26
    triton-lang#5 0x557affde38c4 in createAsyncLoad [third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp:135](https://cs.corp.google.com/piper///depot/google3/third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp?l=135&ws=aliia/3018&snapshot=215):3
    triton-lang#6 0x557affde38c4 in createAsynOps [third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp:501](https://cs.corp.google.com/piper///depot/google3/third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp?l=501&ws=aliia/3018&snapshot=215):5
    triton-lang#7 0x557affde38c4 in mlir::triton::preProcessLoopAndGetSchedule(mlir::scf::ForOp&, int, mlir::triton::PipeliningOption&) [third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp:740](https://cs.corp.google.com/piper///depot/google3/third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp?l=740&ws=aliia/3018&snapshot=215):7
    triton-lang#8 0x557affe01c0c in pipelineLoop [third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/SoftwarePipeliner.cpp:76](https://cs.corp.google.com/piper///depot/google3/third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/SoftwarePipeliner.cpp?l=76&ws=aliia/3018&snapshot=215):19
...
```
This is likely happening due to iterator being invalidated after
`alloc.erase()`.
This PR moves erases of allocations outside of a loop and fixes
heap-use-after-free issue.

Do you know if there is an easy way to run the tests under sanitizers
upstream? It would be handy if we can automate it, so we catch this kind
of errors early on.
jlebar pushed a commit that referenced this issue Jun 21, 2024
When running
[convert_blocked1d_to_slice0](https://github.com/triton-lang/triton/blob/0ba5f0c3cd029d5c3d1f01b9bf29dac32c27345e/test/Conversion/tritongpu_to_llvm.mlir#L924)
Triton ends up computing a rank of a matrix with 0 columns during linear
layout lowering, which trips up f2reduce, and causes undefined behavior,
detectable through
[UBSAN](https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html).

Fix this by returning the rank (0) early in these cases, without calling
f2reduce.

<details><summary>Stack trace</summary>
<p>

```
third_party/triton/third_party/f2reduce/f2reduce.cpp:421:30: runtime error: shift exponent 18446744073709551615 is too large for 64-bit type 'unsigned long long'
    #0 0x556ee2fea3be in inplace_rref_small third_party/triton/third_party/f2reduce/f2reduce.cpp:421:30
    #1 0x556ee2fea3be in f2reduce::inplace_rref_strided(unsigned long*, unsigned long, unsigned long, unsigned long) third_party/triton/third_party/f2reduce/f2reduce.cpp:470:9
    #2 0x556ee2ea70da in getMatrixRank third_party/triton/lib/Tools/LinearLayout.cpp:125:3
    #3 0x556ee2ea70da in mlir::triton::LinearLayout::checkInvariants(bool) third_party/triton/lib/Tools/LinearLayout.cpp:299:7
    #4 0x556ee2ea656d in mlir::triton::LinearLayout::tryCreate(llvm::MapVector<mlir::StringAttr, std::__u::vector<std::__u::vector<int, std::__u::allocator<int>>, std::__u::allocator<std::__u::vector<int, std::__u::allocator<int>>>>, llvm::DenseMap<mlir::StringAttr, unsigned int, llvm::DenseMapInfo<mlir::StringAttr, void>, llvm::detail::DenseMapPair<mlir::StringAttr, unsigned int>>, llvm::SmallVector<std::__u::pair<mlir::StringAttr, std::__u::vector<std::__u::vector<int, std::__u::allocator<int>>, std::__u::allocator<std::__u::vector<int, std::__u::allocator<int>>>>>, 0u>>, llvm::ArrayRef<std::__u::pair<mlir::StringAttr, int>>, bool) third_party/triton/lib/Tools/LinearLayout.cpp:190:41
    #5 0x556ee2eb2150 in mlir::triton::LinearLayout::divideRight(mlir::triton::LinearLayout const&) third_party/triton/lib/Tools/LinearLayout.cpp:654:51
    #6 0x556ee2ee1c39 in mlir::cvtNeedsSharedMemory(mlir::RankedTensorType, mlir::RankedTensorType) third_party/triton/lib/Analysis/Utility.cpp:652:14
    #7 0x556ee2cf38fd in mlir::triton::getRepShapeForCvtLayout(mlir::triton::gpu::ConvertLayoutOp) third_party/triton/lib/Analysis/Allocation.cpp:66:8
    #8 0x556ee2cf3efa in mlir::triton::getScratchConfigForCvtLayout(mlir::triton::gpu::ConvertLayoutOp, unsigned int&, unsigned int&) third_party/triton/lib/Analysis/Allocation.cpp:95:19
    #9 0x556ee2cf6057 in mlir::triton::AllocationAnalysis::getScratchValueSize(mlir::Operation*) third_party/triton/lib/Analysis/Allocation.cpp:272:24
    #10 0x556ee2cf5499 in operator() third_party/triton/lib/Analysis/Allocation.cpp:343:7
    #11 0x556ee2cf5499 in void llvm::function_ref<void (mlir::Operation*)>::callback_fn<mlir::triton::AllocationAnalysis::getValuesAndSizes()::'lambda'(mlir::Operation*)>(long, mlir::Operation*) third_party/llvm/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:45:12
    #12 0x556edeeee7a9 in operator() third_party/llvm/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:68:12
    #13 0x556edeeee7a9 in void mlir::detail::walk<mlir::ForwardIterator>(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) third_party/llvm/llvm-project/mlir/include/mlir/IR/Visitors.h:174:5
    #14 0x556edeeee87c in void mlir::detail::walk<mlir::ForwardIterator>(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) third_party/llvm/llvm-project/mlir/include/mlir/IR/Visitors.h:182:9
    #15 0x556ee2cf49e7 in walk<(mlir::WalkOrder)0, mlir::ForwardIterator, (lambda at third_party/triton/lib/Analysis/Allocation.cpp:341:42), mlir::Operation *, void> third_party/llvm/llvm-project/mlir/include/mlir/IR/Visitors.h:313:10
    #16 0x556ee2cf49e7 in walk<(mlir::WalkOrder)0, mlir::ForwardIterator, (lambda at third_party/triton/lib/Analysis/Allocation.cpp:341:42), void> third_party/llvm/llvm-project/mlir/include/mlir/IR/Operation.h:794:12
    #17 0x556ee2cf49e7 in mlir::triton::AllocationAnalysis::getValuesAndSizes() third_party/triton/lib/Analysis/Allocation.cpp:341:16
    #18 0x556ee2cf4852 in run third_party/triton/lib/Analysis/Allocation.cpp:182:5
    #19 0x556ee2cf4852 in AllocationAnalysis third_party/triton/lib/Analysis/Allocation.cpp:169:5
    #20 0x556ee2cf4852 in mlir::Allocation::run(llvm::DenseMap<mlir::FunctionOpInterface, mlir::Allocation, llvm::DenseMapInfo<mlir::FunctionOpInterface, void>, llvm::detail::DenseMapPair<mlir::FunctionOpInterface, mlir::Allocation>>&) third_party/triton/lib/Analysis/Allocation.cpp:627:3
    #21 0x556ee1677402 in operator() third_party/triton/include/triton/Analysis/Allocation.h:227:26
    #22 0x556ee1677402 in void mlir::CallGraph<mlir::Allocation>::doWalk<(mlir::WalkOrder)0, (mlir::WalkOrder)1, mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::CallOpInterface, mlir::FunctionOpInterface), mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::FunctionOpInterface)>(mlir::FunctionOpInterface, llvm::DenseSet<mlir::FunctionOpInterface, llvm::DenseMapInfo<mlir::FunctionOpInterface, void>>&, mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::CallOpInterface, mlir::FunctionOpInterface), mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::FunctionOpInterface)) third_party/triton/include/triton/Analysis/Utility.h:350:7
    #23 0x556ee16756b3 in walk<(mlir::WalkOrder)0, (mlir::WalkOrder)1, (lambda at third_party/triton/include/triton/Analysis/Allocation.h:222:9), (lambda at third_party/triton/include/triton/Analysis/Allocation.h:224:9)> third_party/triton/include/triton/Analysis/Utility.h:242:7
    #24 0x556ee16756b3 in mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp) third_party/triton/include/triton/Analysis/Allocation.h:220:5
    #25 0x556ee2c2bf18 in (anonymous namespace)::AllocateSharedMemory::runOnOperation() third_party/triton/lib/Conversion/TritonGPUToLLVM/AllocateSharedMemory.cpp:26:22
...
UndefinedBehaviorSanitizer: invalid-shift-exponent third_party/triton/third_party/f2reduce/f2reduce.cpp:421:30 
```
</p>
</details>
ZzEeKkAa pushed a commit to ZzEeKkAa/triton that referenced this issue Aug 5, 2024
…fset/emitIndices of dot layout. (triton-lang#1537)

Support repCluster in emitOffset/emitIndices utilis for dot layout.

---------

Signed-off-by: Tiotto, Ettore <[email protected]>
Co-authored-by: Tiotto, Ettore <[email protected]>
oraluben pushed a commit to oraluben/triton that referenced this issue Sep 11, 2024
…ion flows (triton-lang#6)

* Support basic lowering through vector dialect in CPU backend.

Signed-off-by: Ilya Enkovich <[email protected]>

* Use axis info in memory op lowering.

Signed-off-by: Ilya Enkovich <[email protected]>

* Mark test_ptx_cast as enabled for CPU.

Signed-off-by: Ilya Enkovich <[email protected]>

* Support umulhi operation.

Signed-off-by: Ilya Enkovich <[email protected]>

* Support tl.clamp, tl.minimum, tl.maximum.

Signed-off-by: Ilya Enkovich <[email protected]>

* Add enable_fp_fusion opt for CPU (only affects ASM dump now).

Signed-off-by: Ilya Enkovich <[email protected]>

* Fix kernel args passing for propagated constants.

Signed-off-by: Ilya Enkovich <[email protected]>

* Add permutations support.

Signed-off-by: Ilya Enkovich <[email protected]>

* Support 2-D transfer_read/transfer_write lowering.

Signed-off-by: Ilya Enkovich <[email protected]>

* Introduce shape info analysis and use it for loads/stores by block pointers.

Delay scalar pointers lowering.

Signed-off-by: Ilya Enkovich <[email protected]>

* Support 'other' arg for loads.

Signed-off-by: Ilya Enkovich <[email protected]>

* Support tl.join.

Signed-off-by: Ilya Enkovich <[email protected]>

* Minor renaming.

Signed-off-by: Ilya Enkovich <[email protected]>

---------

Signed-off-by: Ilya Enkovich <[email protected]>
gglin001 pushed a commit to gglin001/triton that referenced this issue Nov 13, 2024
…ion flows (triton-lang#6)

* Support basic lowering through vector dialect in CPU backend.

Signed-off-by: Ilya Enkovich <[email protected]>

* Use axis info in memory op lowering.

Signed-off-by: Ilya Enkovich <[email protected]>

* Mark test_ptx_cast as enabled for CPU.

Signed-off-by: Ilya Enkovich <[email protected]>

* Support umulhi operation.

Signed-off-by: Ilya Enkovich <[email protected]>

* Support tl.clamp, tl.minimum, tl.maximum.

Signed-off-by: Ilya Enkovich <[email protected]>

* Add enable_fp_fusion opt for CPU (only affects ASM dump now).

Signed-off-by: Ilya Enkovich <[email protected]>

* Fix kernel args passing for propagated constants.

Signed-off-by: Ilya Enkovich <[email protected]>

* Add permutations support.

Signed-off-by: Ilya Enkovich <[email protected]>

* Support 2-D transfer_read/transfer_write lowering.

Signed-off-by: Ilya Enkovich <[email protected]>

* Introduce shape info analysis and use it for loads/stores by block pointers.

Delay scalar pointers lowering.

Signed-off-by: Ilya Enkovich <[email protected]>

* Support 'other' arg for loads.

Signed-off-by: Ilya Enkovich <[email protected]>

* Support tl.join.

Signed-off-by: Ilya Enkovich <[email protected]>

* Minor renaming.

Signed-off-by: Ilya Enkovich <[email protected]>

---------

Signed-off-by: Ilya Enkovich <[email protected]>
bertmaher pushed a commit to bertmaher/triton that referenced this issue Dec 10, 2024
When running
[convert_blocked1d_to_slice0](https://github.com/triton-lang/triton/blob/0ba5f0c3cd029d5c3d1f01b9bf29dac32c27345e/test/Conversion/tritongpu_to_llvm.mlir#L924)
Triton ends up computing a rank of a matrix with 0 columns during linear
layout lowering, which trips up f2reduce, and causes undefined behavior,
detectable through
[UBSAN](https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html).

Fix this by returning the rank (0) early in these cases, without calling
f2reduce.

<details><summary>Stack trace</summary>
<p>

```
third_party/triton/third_party/f2reduce/f2reduce.cpp:421:30: runtime error: shift exponent 18446744073709551615 is too large for 64-bit type 'unsigned long long'
    #0 0x556ee2fea3be in inplace_rref_small third_party/triton/third_party/f2reduce/f2reduce.cpp:421:30
    triton-lang#1 0x556ee2fea3be in f2reduce::inplace_rref_strided(unsigned long*, unsigned long, unsigned long, unsigned long) third_party/triton/third_party/f2reduce/f2reduce.cpp:470:9
    triton-lang#2 0x556ee2ea70da in getMatrixRank third_party/triton/lib/Tools/LinearLayout.cpp:125:3
    triton-lang#3 0x556ee2ea70da in mlir::triton::LinearLayout::checkInvariants(bool) third_party/triton/lib/Tools/LinearLayout.cpp:299:7
    triton-lang#4 0x556ee2ea656d in mlir::triton::LinearLayout::tryCreate(llvm::MapVector<mlir::StringAttr, std::__u::vector<std::__u::vector<int, std::__u::allocator<int>>, std::__u::allocator<std::__u::vector<int, std::__u::allocator<int>>>>, llvm::DenseMap<mlir::StringAttr, unsigned int, llvm::DenseMapInfo<mlir::StringAttr, void>, llvm::detail::DenseMapPair<mlir::StringAttr, unsigned int>>, llvm::SmallVector<std::__u::pair<mlir::StringAttr, std::__u::vector<std::__u::vector<int, std::__u::allocator<int>>, std::__u::allocator<std::__u::vector<int, std::__u::allocator<int>>>>>, 0u>>, llvm::ArrayRef<std::__u::pair<mlir::StringAttr, int>>, bool) third_party/triton/lib/Tools/LinearLayout.cpp:190:41
    triton-lang#5 0x556ee2eb2150 in mlir::triton::LinearLayout::divideRight(mlir::triton::LinearLayout const&) third_party/triton/lib/Tools/LinearLayout.cpp:654:51
    triton-lang#6 0x556ee2ee1c39 in mlir::cvtNeedsSharedMemory(mlir::RankedTensorType, mlir::RankedTensorType) third_party/triton/lib/Analysis/Utility.cpp:652:14
    triton-lang#7 0x556ee2cf38fd in mlir::triton::getRepShapeForCvtLayout(mlir::triton::gpu::ConvertLayoutOp) third_party/triton/lib/Analysis/Allocation.cpp:66:8
    triton-lang#8 0x556ee2cf3efa in mlir::triton::getScratchConfigForCvtLayout(mlir::triton::gpu::ConvertLayoutOp, unsigned int&, unsigned int&) third_party/triton/lib/Analysis/Allocation.cpp:95:19
    triton-lang#9 0x556ee2cf6057 in mlir::triton::AllocationAnalysis::getScratchValueSize(mlir::Operation*) third_party/triton/lib/Analysis/Allocation.cpp:272:24
    triton-lang#10 0x556ee2cf5499 in operator() third_party/triton/lib/Analysis/Allocation.cpp:343:7
    triton-lang#11 0x556ee2cf5499 in void llvm::function_ref<void (mlir::Operation*)>::callback_fn<mlir::triton::AllocationAnalysis::getValuesAndSizes()::'lambda'(mlir::Operation*)>(long, mlir::Operation*) third_party/llvm/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:45:12
    triton-lang#12 0x556edeeee7a9 in operator() third_party/llvm/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:68:12
    triton-lang#13 0x556edeeee7a9 in void mlir::detail::walk<mlir::ForwardIterator>(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) third_party/llvm/llvm-project/mlir/include/mlir/IR/Visitors.h:174:5
    triton-lang#14 0x556edeeee87c in void mlir::detail::walk<mlir::ForwardIterator>(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) third_party/llvm/llvm-project/mlir/include/mlir/IR/Visitors.h:182:9
    triton-lang#15 0x556ee2cf49e7 in walk<(mlir::WalkOrder)0, mlir::ForwardIterator, (lambda at third_party/triton/lib/Analysis/Allocation.cpp:341:42), mlir::Operation *, void> third_party/llvm/llvm-project/mlir/include/mlir/IR/Visitors.h:313:10
    triton-lang#16 0x556ee2cf49e7 in walk<(mlir::WalkOrder)0, mlir::ForwardIterator, (lambda at third_party/triton/lib/Analysis/Allocation.cpp:341:42), void> third_party/llvm/llvm-project/mlir/include/mlir/IR/Operation.h:794:12
    triton-lang#17 0x556ee2cf49e7 in mlir::triton::AllocationAnalysis::getValuesAndSizes() third_party/triton/lib/Analysis/Allocation.cpp:341:16
    triton-lang#18 0x556ee2cf4852 in run third_party/triton/lib/Analysis/Allocation.cpp:182:5
    triton-lang#19 0x556ee2cf4852 in AllocationAnalysis third_party/triton/lib/Analysis/Allocation.cpp:169:5
    triton-lang#20 0x556ee2cf4852 in mlir::Allocation::run(llvm::DenseMap<mlir::FunctionOpInterface, mlir::Allocation, llvm::DenseMapInfo<mlir::FunctionOpInterface, void>, llvm::detail::DenseMapPair<mlir::FunctionOpInterface, mlir::Allocation>>&) third_party/triton/lib/Analysis/Allocation.cpp:627:3
    triton-lang#21 0x556ee1677402 in operator() third_party/triton/include/triton/Analysis/Allocation.h:227:26
    triton-lang#22 0x556ee1677402 in void mlir::CallGraph<mlir::Allocation>::doWalk<(mlir::WalkOrder)0, (mlir::WalkOrder)1, mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::CallOpInterface, mlir::FunctionOpInterface), mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::FunctionOpInterface)>(mlir::FunctionOpInterface, llvm::DenseSet<mlir::FunctionOpInterface, llvm::DenseMapInfo<mlir::FunctionOpInterface, void>>&, mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::CallOpInterface, mlir::FunctionOpInterface), mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::FunctionOpInterface)) third_party/triton/include/triton/Analysis/Utility.h:350:7
    triton-lang#23 0x556ee16756b3 in walk<(mlir::WalkOrder)0, (mlir::WalkOrder)1, (lambda at third_party/triton/include/triton/Analysis/Allocation.h:222:9), (lambda at third_party/triton/include/triton/Analysis/Allocation.h:224:9)> third_party/triton/include/triton/Analysis/Utility.h:242:7
    triton-lang#24 0x556ee16756b3 in mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp) third_party/triton/include/triton/Analysis/Allocation.h:220:5
    triton-lang#25 0x556ee2c2bf18 in (anonymous namespace)::AllocateSharedMemory::runOnOperation() third_party/triton/lib/Conversion/TritonGPUToLLVM/AllocateSharedMemory.cpp:26:22
...
UndefinedBehaviorSanitizer: invalid-shift-exponent third_party/triton/third_party/f2reduce/f2reduce.cpp:421:30 
```
</p>
</details>
stephen-huan pushed a commit to stephen-huan/triton that referenced this issue Dec 24, 2024
…ion flows (triton-lang#6)

* Support basic lowering through vector dialect in CPU backend.

Signed-off-by: Ilya Enkovich <[email protected]>

* Use axis info in memory op lowering.

Signed-off-by: Ilya Enkovich <[email protected]>

* Mark test_ptx_cast as enabled for CPU.

Signed-off-by: Ilya Enkovich <[email protected]>

* Support umulhi operation.

Signed-off-by: Ilya Enkovich <[email protected]>

* Support tl.clamp, tl.minimum, tl.maximum.

Signed-off-by: Ilya Enkovich <[email protected]>

* Add enable_fp_fusion opt for CPU (only affects ASM dump now).

Signed-off-by: Ilya Enkovich <[email protected]>

* Fix kernel args passing for propagated constants.

Signed-off-by: Ilya Enkovich <[email protected]>

* Add permutations support.

Signed-off-by: Ilya Enkovich <[email protected]>

* Support 2-D transfer_read/transfer_write lowering.

Signed-off-by: Ilya Enkovich <[email protected]>

* Introduce shape info analysis and use it for loads/stores by block pointers.

Delay scalar pointers lowering.

Signed-off-by: Ilya Enkovich <[email protected]>

* Support 'other' arg for loads.

Signed-off-by: Ilya Enkovich <[email protected]>

* Support tl.join.

Signed-off-by: Ilya Enkovich <[email protected]>

* Minor renaming.

Signed-off-by: Ilya Enkovich <[email protected]>

---------

Signed-off-by: Ilya Enkovich <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants