FP16 huggingface accuracy 6 models got failed #195

mengfei25 · 2024-05-08T09:21:40Z

🐛 Describe the bug

Please refer to https://github.com/intel/torch-xpu-ops/actions/runs/8995661712/job/24710982088, there are 6 models got failed.
Error info
Run failed with return code: -11
Output: None
Error: None
============ Summary for huggingface float16 inference accuracy ============
num_total: 40 (should be 46)
num_passed: 39
num_failed: 1
pass_rate: 97.50%
============ Summary for huggingface float16 training accuracy ============
num_total: 40 (should be 46)
num_passed: 40
num_failed: 0
pass_rate: 100.00%

Versions

#188
Driver: 803.29 LTS
Bundle: 0.5.0
PyTorch: 2024-05-07 nightly release
XPU OPS: d110623

The text was updated successfully, but these errors were encountered:

etaf · 2024-05-09T05:56:24Z

The following reporducer got segmentfault on fp16 but passed on fp32:

import torch
from torch import tensor, device
import torch.fx as fx
from torch._dynamo.testing import rand_strided
from math import inf
import torch._inductor.inductor_prims

import torch._dynamo.config
import torch._inductor.config
import torch._functorch.config
import torch.fx.experimental._config

torch._inductor.config.fallback_random = True
torch._inductor.config.freezing = True
torch._inductor.config.triton.cudagraphs = True
torch._functorch.config.unlift_effect_tokens = True
torch._functorch.config.debug_partitioner = True

isolate_fails_code_str = None

from torch.nn import *
class Repro(torch.nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, arg0_1):
        isnan = torch.ops.aten.isnan.default(arg0_1);  arg0_1 = None
        any_1 = torch.ops.aten.any.default(isnan);  isnan = None
        return (any_1,)

def load_args(reader):
    buf0 = reader.storage(None, 2097152, device=device(type='xpu', index=0), dtype_hint=torch.float16)
    reader.tensor(buf0, (1, 1024, 1024), dtype=torch.float16, is_leaf=True)  # arg0_1
load_args._version = 0
mod = Repro()
if __name__ == '__main__':
    from torch._dynamo.repro.after_aot import run_repro
    with torch.no_grad():
        run_repro(mod, load_args, accuracy=False, command='run', save_dir=None, tracing_mode='real', check_str=None)

etaf · 2024-05-10T01:15:06Z

The crash happends in triton, I've submitted a issue to the team: intel/intel-xpu-backend-for-triton#1073

etaf · 2024-05-13T10:39:41Z

This issue is moved to IGC team, the JIRA: https://jira.devtools.intel.com/browse/GSD-9082

etaf · 2024-05-15T06:07:32Z

The fix for segmentfalut will be in next rolling driver and next LTS driver.

etaf · 2024-05-15T06:11:24Z

This PR pytorch/pytorch#126261 will make Inductor generator hint for triton that input tensor is divisible_by_16 , so that triton can avoid the segmentfault path.
@mengfei25 @chuanqi129 @riverliuintel @EikanWang After This PR merged, we expect the 6 failed model can pass the accuracy test.

etaf · 2024-05-15T15:58:39Z

@mengfei25 PR pytorch/pytorch#126261 has been landed, please verify.

fengyuan14 added the bug Something isn't working label May 9, 2024

fengyuan14 self-assigned this May 9, 2024

riverliuintel assigned etaf May 10, 2024

riverliuintel added this to the PT2.4 milestone May 10, 2024

fengyuan14 removed their assignment May 10, 2024

mengfei25 closed this as completed Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FP16 huggingface accuracy 6 models got failed #195

FP16 huggingface accuracy 6 models got failed #195

mengfei25 commented May 8, 2024

etaf commented May 9, 2024

etaf commented May 10, 2024

etaf commented May 13, 2024

etaf commented May 15, 2024

etaf commented May 15, 2024

etaf commented May 15, 2024

FP16 huggingface accuracy 6 models got failed #195

FP16 huggingface accuracy 6 models got failed #195

Comments

mengfei25 commented May 8, 2024

🐛 Describe the bug

Versions

etaf commented May 9, 2024

etaf commented May 10, 2024

etaf commented May 13, 2024

etaf commented May 15, 2024

etaf commented May 15, 2024

etaf commented May 15, 2024