Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FP16 huggingface accuracy 6 models got failed #195

Closed
mengfei25 opened this issue May 8, 2024 · 6 comments
Closed

FP16 huggingface accuracy 6 models got failed #195

mengfei25 opened this issue May 8, 2024 · 6 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@mengfei25
Copy link
Contributor

🐛 Describe the bug

Please refer to https://github.com/intel/torch-xpu-ops/actions/runs/8995661712/job/24710982088, there are 6 models got failed.
Error info
Run failed with return code: -11
Output: None
Error: None
============ Summary for huggingface float16 inference accuracy ============
num_total: 40 (should be 46)
num_passed: 39
num_failed: 1
pass_rate: 97.50%
============ Summary for huggingface float16 training accuracy ============
num_total: 40 (should be 46)
num_passed: 40
num_failed: 0
pass_rate: 100.00%

Versions

#188
Driver: 803.29 LTS
Bundle: 0.5.0
PyTorch: 2024-05-07 nightly release
XPU OPS: d110623

@etaf
Copy link
Contributor

etaf commented May 9, 2024

The following reporducer got segmentfault on fp16 but passed on fp32:

import torch
from torch import tensor, device
import torch.fx as fx
from torch._dynamo.testing import rand_strided
from math import inf
import torch._inductor.inductor_prims

import torch._dynamo.config
import torch._inductor.config
import torch._functorch.config
import torch.fx.experimental._config

torch._inductor.config.fallback_random = True
torch._inductor.config.freezing = True
torch._inductor.config.triton.cudagraphs = True
torch._functorch.config.unlift_effect_tokens = True
torch._functorch.config.debug_partitioner = True

isolate_fails_code_str = None

from torch.nn import *
class Repro(torch.nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, arg0_1):
        isnan = torch.ops.aten.isnan.default(arg0_1);  arg0_1 = None
        any_1 = torch.ops.aten.any.default(isnan);  isnan = None
        return (any_1,)

def load_args(reader):
    buf0 = reader.storage(None, 2097152, device=device(type='xpu', index=0), dtype_hint=torch.float16)
    reader.tensor(buf0, (1, 1024, 1024), dtype=torch.float16, is_leaf=True)  # arg0_1
load_args._version = 0
mod = Repro()
if __name__ == '__main__':
    from torch._dynamo.repro.after_aot import run_repro
    with torch.no_grad():
        run_repro(mod, load_args, accuracy=False, command='run', save_dir=None, tracing_mode='real', check_str=None)

@fengyuan14 fengyuan14 added the bug Something isn't working label May 9, 2024
@fengyuan14 fengyuan14 self-assigned this May 9, 2024
@riverliuintel riverliuintel added this to the PT2.4 milestone May 10, 2024
@etaf
Copy link
Contributor

etaf commented May 10, 2024

The crash happends in triton, I've submitted a issue to the team: intel/intel-xpu-backend-for-triton#1073

@fengyuan14 fengyuan14 removed their assignment May 10, 2024
@etaf
Copy link
Contributor

etaf commented May 13, 2024

This issue is moved to IGC team, the JIRA: https://jira.devtools.intel.com/browse/GSD-9082

@etaf
Copy link
Contributor

etaf commented May 15, 2024

The fix for segmentfalut will be in next rolling driver and next LTS driver.

@etaf
Copy link
Contributor

etaf commented May 15, 2024

This PR pytorch/pytorch#126261 will make Inductor generator hint for triton that input tensor is divisible_by_16 , so that triton can avoid the segmentfault path.
@mengfei25 @chuanqi129 @riverliuintel @EikanWang After This PR merged, we expect the 6 failed model can pass the accuracy test.

@etaf
Copy link
Contributor

etaf commented May 15, 2024

@mengfei25 PR pytorch/pytorch#126261 has been landed, please verify.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants