[SYCL][HIP] Unresolved Assert/.* tests failures #7634

againull · 2022-12-05T04:59:23Z

LIT testing on HIP backend is failing:
Unresolved tests

Unresolved Tests (5):
SYCL :: Assert/assert_in_kernels.cpp
SYCL :: Assert/assert_in_multiple_tus.cpp
SYCL :: Assert/assert_in_multiple_tus_one_ndebug.cpp
SYCL :: Assert/assert_in_one_kernel.cpp
SYCL :: Assert/assert_in_simultaneously_multiple_tus_one_ndebug.cpp

Example:
https://github.com/intel/llvm/actions/runs/3616100865/jobs/6093960264

Error:

UNRESOLVED: SYCL :: Assert/assert_in_kernels.cpp (1031 of 1031)
******************** TEST 'SYCL :: Assert/assert_in_kernels.cpp' FAILED ********************
Exception during script execution:
Traceback (most recent call last):
File "/__w/llvm/llvm/lit/lit/worker.py", line 76, in _execute_test_handle_errors
result = test.config.test_format.execute(test, lit_config)
File "/__w/llvm/llvm/lit/lit/formats/shtest.py", line 27, in execute
return lit.TestRunner.executeShTest(test, litConfig,
File "/__w/llvm/llvm/lit/lit/TestRunner.py", line 2005, in executeShTest
return _runShTest(test, litConfig, useExternalSh, script, tmpBase)
File "/__w/llvm/llvm/lit/lit/TestRunner.py", line 1966, in _runShTest
output = """Script:\n--\n%s\n--\nExit Code: %d\n""" % (
TypeError: %d format: a number is required, not NoneType

pvchupin · 2022-12-07T20:32:08Z

I couldn't reproduce this problem on the same machine outside of CI, even with the same docker image.
Runs are successful with the following in the assert_in_kernels.cpp.tmp.gpu.txt:

SYCL/Assert/assert_in_kernels.hpp:25: void kernelFunc2(int *, int): global id: [0,0,0], local id: [0,0,0] Assertion `Buf[wiID] == 0 && "from assert statement"` failed.                              
SYCL/Assert/assert_in_kernels.hpp:25: void kernelFunc2(int *, int): global id: [2,0,0], local id: [2,0,0] Assertion `Buf[wiID] == 0 && "from assert statement"` failed.                              
:0:rocdevice.cpp            :2672: 248509772177 us: 156615: [tid:0x7fa71f6f7700] Device::callbackQueue aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016

I logged into machine when CI was running the test and found that all these tests hang for a few minutes with only the first 2 lines in the assert_in_kernels.cpp.tmp.gpu.txt:

SYCL/Assert/assert_in_kernels.hpp:25: void kernelFunc2(int *, int): global id: [0,0,0], local id: [0,0,0] Assertion `Buf[wiID] == 0 && "from assert statement"` failed.                              
SYCL/Assert/assert_in_kernels.hpp:25: void kernelFunc2(int *, int): global id: [2,0,0], local id: [2,0,0] Assertion `Buf[wiID] == 0 && "from assert statement"` failed.

We run cpu, then gpu, then acc and acc log doesn't exist. So I assume execution stuck on GPU at that point.

Also there is the following in dmesg:

[580980.987697] static-buffer-d[1046286]: segfault at 39 ip 00007f4b0fc7a871 sp 00007ffefa681d40 error 4 in libamdhip64.so.5.4.50400[7f4b0fbc1000+37e000]
[580980.987703] Code: 48 85 f6 0f 84 a6 00 00 00 4c 8d 2d 99 1c 8f 01 48 63 93 b0 00 00 00 4c 8d 25 bb 86 00 00 49 8b 45 00 48 8b 04 d0 48 8b 40 68 <48> 8b 40 18 48 8b 38 48 8b 07 48 8b 80 e8 00 00 00 4c 39 e0 0f 85

It seems in some conditions test execution doesn't return normally and killed from the outside.

intel/llvm#7634

bader · 2022-12-07T23:34:21Z

These tests seem to exceed timeout limit:

Slowest Tests:

600.02s: SYCL :: Assert/assert_in_kernels.cpp
600.02s: SYCL :: Assert/assert_in_multiple_tus_one_ndebug.cpp
600.02s: SYCL :: Assert/assert_in_multiple_tus.cpp
600.02s: SYCL :: Assert/assert_in_one_kernel.cpp
600.02s: SYCL :: Assert/assert_in_simultaneously_multiple_tus_one_ndebug.cpp

600 sec is the limit for a single test - https://github.com/intel/llvm-test-suite/blob/intel/SYCL/lit.cfg.py#L430-L435.

bader · 2022-12-07T23:36:04Z

Also there is the following in dmesg:

[580980.987697] static-buffer-d[1046286]: segfault at 39 ip 00007f4b0fc7a871 sp 00007ffefa681d40 error 4 in libamdhip64.so.5.4.50400[7f4b0fbc1000+37e000]
[580980.987703] Code: 48 85 f6 0f 84 a6 00 00 00 4c 8d 2d 99 1c 8f 01 48 63 93 b0 00 00 00 4c 8d 25 bb 86 00 00 49 8b 45 00 48 8b 04 d0 48 8b 40 68 <48> 8b 40 18 48 8b 38 48 8b 07 48 8b 80 e8 00 00 00 4c 39 e0 0f 85

It seems in some conditions test execution doesn't return normally and killed from the outside.

I suggest we temporary disable asserts on HIP backend to minimize the impact on CI system while we investigate the root cause.

pvchupin · 2022-12-07T23:42:05Z

Yes, they've been disabled at intel/llvm-test-suite#1441

…e#1441) intel#7634

bader · 2023-09-23T20:36:58Z

I've added cuda label because Assert/assert_in_multiple_tus_one_ndebug.cpp failed in nightly run - https://github.com/intel/llvm/actions/runs/6281231172/job/17059577474.

JackAKirk · 2023-09-28T15:33:00Z

There's a few possible causes I've identified so far (for HIP). I also have not been able to reproduce the specific issue though.

Does anyone know which rocm version the CI was using when these tests fails were reported, and which version it is currently using? I find that there is an (with a admittedly different error output) issue with a corresponding hip assert test for rocm4, but rocm 5 versions work. BTW there was another unrelated hip driver issue that meant compiling at O0 failed, which has been fixed in rocm 5.7.0.

Also I see that it is using ubuntu 22.04: you need ROCm 5.3.0 or later to be compatible with 22.04: ROCm/ROCm#1730.

I can see from the CI that it is using a AMD Radeon RX 6700 XT (gfx1031). gfx1031 is not officially supported by ROCm on linux. I think I remember that in the past the CI used a gfx1030 which is officially supported. I don't know whether switching device could have led to some CI issues. But I think it makes sense to use an officially supported AMD device on the CI. It seems that lots of non-officially supported amd gpus work, at least to some degree, but using one for testing seems like a bad idea. Does someone know when the gtx1031 started being used on the CI?

Thanks

JackAKirk · 2023-09-28T15:41:18Z

Also depending on which subversion of 22.04 you use you may need a later rocm version. e.g. I noticed this:

"

New in version 5.7.0:

Ubuntu 22.04.3 support was added.

"

from https://docs.amd.com/en/latest/release/gpu_os_support.html#linux-supported-gpus

steffenlarsen · 2023-10-03T11:50:04Z

@aelovikov-intel - Do you know the answer to @JackAKirk's question above?

aelovikov-intel · 2023-10-03T14:44:02Z

For our amdgpu-2 runner (I assume the same should be for the others, but haven't verified):

# ls -d /opt/rocm-*
/opt/rocm-4.5.1

JackAKirk · 2023-10-03T15:01:09Z

For our amdgpu-2 runner (I assume the same should be for the others, but haven't verified):
# ls -d /opt/rocm-*
/opt/rocm-4.5.1

OK, that version of ROCm isn't supported by any version of ubuntu 22.04 (the CI is using ubuntu 22.04). I suggest upgrading the CI to ROCm 5.7. Also if possible an officially supported ROCm gpu should be used from this list https://docs.amd.com/en/latest/release/gpu_os_support.html#linux-supported-gpus.

jinz2014 · 2023-10-04T16:50:38Z

I tried to build and run the test (assert_in_kernels.cpp) on an MI100 GPU.
rocm 5.7 and centos 8

The output message:

./assert_in_kernels.hpp:25: void kernelFunc2(int *, int): global id: [0,0,0], local id: [0,0,0] Assertion Buf[wiID] == 0 && "from assert statement" failed.
./assert_in_kernels.hpp:25: void kernelFunc2(int *, int): global id: [2,0,0], local id: [2,0,0] Assertion Buf[wiID] == 0 && "from assert statement" failed.
:0:rocdevice.cpp :2692: 6204498005977 us: [pid:2254935 tid:0x153e8b359700] Callback: Queue 0x153a70a00000 aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016
Aborted (core dumped)

JackAKirk · 2023-10-23T13:06:31Z

I tried to build and run the test (assert_in_kernels.cpp) on an MI100 GPU. rocm 5.7 and centos 8

The output message:

./assert_in_kernels.hpp:25: void kernelFunc2(int *, int): global id: [0,0,0], local id: [0,0,0] Assertion Buf[wiID] == 0 && "from assert statement" failed. ./assert_in_kernels.hpp:25: void kernelFunc2(int *, int): global id: [2,0,0], local id: [2,0,0] Assertion Buf[wiID] == 0 && "from assert statement" failed. :0:rocdevice.cpp :2692: 6204498005977 us: [pid:2254935 tid:0x153e8b359700] Callback: Queue 0x153a70a00000 aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016 Aborted (core dumped)

I think this is the expected behavior right? at least the assert part.

JackAKirk · 2023-11-24T15:37:13Z

@aelovikov-intel

We think that the hangs could be due to missing pcie atomics in the CI bus.
pcie atomics are stated to be required for rocm:

https://docs.amd.com/en/docs-5.6.0/release/gpu_os_support.html#cpu-support

And we think that assert is one place where pcie atomics are used.

Would it be possible to send us the output of

lspci -vv

on the CI (with sudo) to confirm this?

Thanks

aelovikov-intel · 2023-11-27T18:14:16Z

On amdgpu-3 runner natively (not inside docker image): lspci.txt
lspci_root.txt

JackAKirk · 2023-11-28T13:38:07Z

On amdgpu-3 runner natively (not inside docker image): lspci.txt lspci_root.txt

Thanks, that's very interesting. The "Internal" pcie is marked negative for all atomics (however other pcie hardware are marked positive), which we guessed would be the relevant hardware for the assert, but we are not 100% sure about this. I've however realized that amdgpu-3 is the runner that passed for all the assert runs I made. This is also a cpu that I had expected to fully support the relevant pcie Atomics.
amdgpu-4 is the one that seems to be leading to the assert timeouts. Would it be possible for you to post a lspci_root.txt for amdgpu-4 also? It would be very useful to compare the output of the two runners.

Am I right that the only two that are used in the amd ci are amdgpu-4 and amdgpu-3?

Many thanks for your help with this.

aelovikov-intel · 2023-11-28T17:40:03Z

lspci_root_amdgpu-4.txt

JackAKirk · 2023-11-29T15:15:57Z

lspci_root_amdgpu-4.txt

It seems to be saying that both runners do not support atomics. There is little difference in their output.
Could you also post the output of

lspci -t -vv

for both runners?

Thanks

A few tests in the driver area require amdgpu or nvptx targets to be built in order to properly run. Add these requirements to the tests.

againull added bug Something isn't working hip Issues related to execution on HIP backend. labels Dec 5, 2022

againull mentioned this issue Dec 5, 2022

[SYCL] Implement queue::ext_oneapi_empty() API to get queue status #7583

Merged

AlexeySachkov mentioned this issue Dec 5, 2022

[SYCL] Refactor unit-tests to ignore fallback libs #7534

Merged

steffenlarsen mentioned this issue Dec 5, 2022

[SYCL] Fix accessor CTAD for unittests #7638

Merged

AlexeySachkov mentioned this issue Dec 6, 2022

[SYCL] Remove customizations from legacy pipeine #7392

Merged

againull mentioned this issue Dec 6, 2022

[SYCL] Add a unittest for is_compatible #7619

Merged

aelovikov-intel mentioned this issue Dec 6, 2022

[SYCL][Level Zero] Implement sycl_ext_intel_queue_index extension #7599

Merged

pvchupin mentioned this issue Dec 6, 2022

[SYCL][NFC] Fix documentation typo intel/llvm-test-suite#1437

Merged

aelovikov-intel mentioned this issue Dec 7, 2022

[NFC][SYCL] Switch to std:: equivalents for utilities in stl_type_traits.hpp #7668

Merged

v-klochkov mentioned this issue Dec 7, 2022

[SYCL] Update support of online_compiler; advance default GPU device #7674

Merged

KseniyaTikhomirova mentioned this issue Dec 7, 2022

[SYCL] Defer buffer release when no host memory to be updated #6837

Merged

aelovikov-intel added a commit to aelovikov-intel/llvm-test-suite that referenced this issue Dec 7, 2022

[SYCL] Disable failing SYCL/Assert tests on HIP

4767521

intel/llvm#7634

This was referenced Dec 7, 2022

[SYCL] Disable failing SYCL/Assert tests on HIP intel/llvm-test-suite#1441

Merged

[NFC][SYCL] Use "inline constexpr" unconditionally #7688

Merged

pvchupin pushed a commit to intel/llvm-test-suite that referenced this issue Dec 7, 2022

[SYCL] Disable failing SYCL/Assert tests on HIP (#1441)

cb4fea5

intel/llvm#7634

steffenlarsen mentioned this issue Dec 8, 2022

[SYCL][Fusion] JIT compiler kernel fusion passes #7661

Merged

aelovikov-intel mentioned this issue Dec 8, 2022

[NFC][SYCL] Remove checks for C++17 #7687

Merged

aelovikov-intel added a commit to aelovikov-intel/llvm that referenced this issue Mar 27, 2023

[SYCL] Disable failing SYCL/Assert tests on HIP (intel/llvm-test-suit…

41b269a

…e#1441) intel#7634

bader added the cuda CUDA back-end label Sep 23, 2023

JackAKirk mentioned this issue Oct 3, 2023

DeviceLib/built-ins/ftz-flag.cpp timeout on HIP backend #11378

Open

JackAKirk mentioned this issue Oct 23, 2023

[AMD] Investigate test failures for different hardware oneapi-src/unified-runtime#985

Open

npmiller referenced this issue Jan 31, 2024

[NFC][SYCL] Set some target restrictions for some tests (#12505)

a934d57

A few tests in the driver area require amdgpu or nvptx targets to be built in order to properly run. Add these requirements to the tests.

JackAKirk mentioned this issue Mar 8, 2024

[HIP] Mark assert tests unsupported / 2dmem tests Xfail. #12955

Closed

JackAKirk mentioned this issue Oct 31, 2024

[SYCL][P2P] Fix info query for P2P #15873

Merged

aarongreig mentioned this issue Nov 1, 2024

Add device info query to report support for native asserts. oneapi-src/unified-runtime#2269

Open

JackAKirk mentioned this issue Feb 11, 2025

bindless_images/sampling_3D.cpp and bindless_images/sampling_2D.cpp tests failing with UR_RESULT_ERROR_UNSUPPORTED_FEATURE on HIP/AMD #16933

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL][HIP] Unresolved Assert/.* tests failures #7634

[SYCL][HIP] Unresolved Assert/.* tests failures #7634

againull commented Dec 5, 2022

pvchupin commented Dec 7, 2022

bader commented Dec 7, 2022

Slowest Tests:

bader commented Dec 7, 2022

pvchupin commented Dec 7, 2022

bader commented Sep 23, 2023

JackAKirk commented Sep 28, 2023 •

edited

Loading

JackAKirk commented Sep 28, 2023

steffenlarsen commented Oct 3, 2023

aelovikov-intel commented Oct 3, 2023

JackAKirk commented Oct 3, 2023 •

edited

Loading

jinz2014 commented Oct 4, 2023

JackAKirk commented Oct 23, 2023

JackAKirk commented Nov 24, 2023 •

edited

Loading

aelovikov-intel commented Nov 27, 2023 •

edited

Loading

JackAKirk commented Nov 28, 2023

aelovikov-intel commented Nov 28, 2023

JackAKirk commented Nov 29, 2023

[SYCL][HIP] Unresolved Assert/.* tests failures #7634

[SYCL][HIP] Unresolved Assert/.* tests failures #7634

Comments

againull commented Dec 5, 2022

pvchupin commented Dec 7, 2022

bader commented Dec 7, 2022

Slowest Tests:

bader commented Dec 7, 2022

pvchupin commented Dec 7, 2022

bader commented Sep 23, 2023

JackAKirk commented Sep 28, 2023 • edited Loading

JackAKirk commented Sep 28, 2023

steffenlarsen commented Oct 3, 2023

aelovikov-intel commented Oct 3, 2023

JackAKirk commented Oct 3, 2023 • edited Loading

jinz2014 commented Oct 4, 2023

JackAKirk commented Oct 23, 2023

JackAKirk commented Nov 24, 2023 • edited Loading

aelovikov-intel commented Nov 27, 2023 • edited Loading

JackAKirk commented Nov 28, 2023

aelovikov-intel commented Nov 28, 2023

JackAKirk commented Nov 29, 2023

JackAKirk commented Sep 28, 2023 •

edited

Loading

JackAKirk commented Oct 3, 2023 •

edited

Loading

JackAKirk commented Nov 24, 2023 •

edited

Loading

aelovikov-intel commented Nov 27, 2023 •

edited

Loading