Assertion failure in Linear Layouts when num_warps = 8, but passes with num_warps = 4 #5265

Moerafaat · 2024-11-27T14:26:04Z

Describe the bug

To reproduce the issue, you can run the following python test:

import torch
import triton
import triton.language as tl


@triton.jit
def repro_kernel(q_ref,
               k_ref,
               v_ref,
               output_ptr,
               ):
    offsets64 = tl.arange(0, 64)
    offsets128 = tl.arange(0, 128)
    q = tl.load(q_ref + (offsets64[:, None] * 128 + offsets128[None, :]))
    k = tl.load(k_ref + (offsets128[:, None] * 64 + offsets64[None, :]))
    qk = tl.dot(q, k).to(tl.bfloat16)
    v = tl.load(v_ref + (offsets64[:, None] * 128 + offsets128[None, :]))
    o = tl.dot(qk, v)
    tl.store(output_ptr + (offsets64[:, None] * 128 + offsets128[None, :]), o.to(tl.bfloat16))

def repro(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor):
    output = torch.empty((64, 128), dtype=torch.bfloat16, device='cuda')
    grid = lambda meta: (1, 1)
    k = repro_kernel[grid](q, k, v, output, num_warps=8, num_ctas=1, num_stages=3)
    # print(k.asm['ttir'])
    return output

torch.manual_seed(0)
q = torch.ones((64, 128), dtype=torch.bfloat16, device='cuda')
k = torch.ones((128, 64), dtype=torch.bfloat16, device='cuda')
v = torch.ones((64, 128), dtype=torch.bfloat16, device='cuda')
output_torch = (q @ k) @ v
output_triton = repro(q, k, v)
print(output_torch)
print(output_triton)
print(f'The maximum difference between torch and triton is '
      f'{torch.max(torch.abs(output_torch - output_triton))}')

You will encounter the following error:

python3: /tmp/triton/lib/Tools/LinearLayout.cpp:526: mlir::triton::LinearLayout mlir::triton::LinearLayout::reshapeOuts(llvm::ArrayRef<std::pair<mlir::StringAttr, int> >) const: Assertion `getTotalOutDimSize() == std::accumulate( newOutDims.begin(), newOutDims.end(), 1, [&](int32_t acc, auto &outDim) { return acc * outDim.second; })' failed.
Aborted

I notice that there was a similar report here #4727 before the issue was re-opened. Interestingly, the failure actually started happening with the commit that was linked to that issue. The culprit commit is 49266aa

The test passes if num_warps are set to 4 instead of 8, and used to work properly before the culprit commit.

Environment details

The issue reproduces on H100 with the latest Triton main: commit 8b29bb7

The text was updated successfully, but these errors were encountered:

Jokeren · 2024-11-27T14:30:03Z

Interesting. Taking a look now.

#4727 is TMA so it's not related.

Jokeren · 2024-11-27T20:08:49Z

FYI, I have a solution works for it now with stmatrix. Will upstream soon

where out dims are: [offset (size 4096), iteration (size 1)]
tensor([[8192., 8192., 8192.,  ..., 8192., 8192., 8192.],
        [8192., 8192., 8192.,  ..., 8192., 8192., 8192.],
        [8192., 8192., 8192.,  ..., 8192., 8192., 8192.],
        ...,
        [8192., 8192., 8192.,  ..., 8192., 8192., 8192.],
        [8192., 8192., 8192.,  ..., 8192., 8192., 8192.],
        [8192., 8192., 8192.,  ..., 8192., 8192., 8192.]], device='cuda:0',
       dtype=torch.bfloat16)
tensor([[8192., 8192., 8192.,  ..., 8192., 8192., 8192.],
        [8192., 8192., 8192.,  ..., 8192., 8192., 8192.],
        [8192., 8192., 8192.,  ..., 8192., 8192., 8192.],
        ...,
        [8192., 8192., 8192.,  ..., 8192., 8192., 8192.],
        [8192., 8192., 8192.,  ..., 8192., 8192., 8192.],
        [8192., 8192., 8192.,  ..., 8192., 8192., 8192.]], device='cuda:0',
       dtype=torch.bfloat16)

Moerafaat · 2024-11-27T20:40:01Z

Thanks! Really appreciate the fast reply on this and looking forward to your fix 🙏

Jokeren · 2024-11-28T02:06:09Z

#5277 is a partial fix. More general fixes will be pushed next week

Moerafaat · 2024-11-28T11:30:21Z

#5277 is a partial fix.

Tested it and it works great! Thanks for the fast turn-around!

Moerafaat · 2024-12-02T11:45:03Z

Marking this fixed. Thanks for the assistance!

Moerafaat added the bug label Nov 27, 2024

Jokeren self-assigned this Nov 27, 2024

Moerafaat mentioned this issue Nov 28, 2024

Pallas/Triton gives incorrect result on Nvidia H100 GPU openxla/xla#19780

Closed

Moerafaat closed this as completed Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assertion failure in Linear Layouts when num_warps = 8, but passes with num_warps = 4 #5265

Assertion failure in Linear Layouts when num_warps = 8, but passes with num_warps = 4 #5265

Moerafaat commented Nov 27, 2024

Jokeren commented Nov 27, 2024

Jokeren commented Nov 27, 2024

Moerafaat commented Nov 27, 2024

Jokeren commented Nov 28, 2024

Moerafaat commented Nov 28, 2024

Moerafaat commented Dec 2, 2024

Assertion failure in Linear Layouts when num_warps = 8, but passes with num_warps = 4 #5265

Assertion failure in Linear Layouts when num_warps = 8, but passes with num_warps = 4 #5265

Comments

Moerafaat commented Nov 27, 2024

Describe the bug

Environment details

Jokeren commented Nov 27, 2024

Jokeren commented Nov 27, 2024

Moerafaat commented Nov 27, 2024

Jokeren commented Nov 28, 2024

Moerafaat commented Nov 28, 2024

Moerafaat commented Dec 2, 2024