-
Notifications
You must be signed in to change notification settings - Fork 752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL][HIP] Unresolved Assert/.* tests failures #7634
Comments
I couldn't reproduce this problem on the same machine outside of CI, even with the same docker image.
I logged into machine when CI was running the test and found that all these tests hang for a few minutes with only the first 2 lines in the assert_in_kernels.cpp.tmp.gpu.txt:
We run cpu, then gpu, then acc and acc log doesn't exist. So I assume execution stuck on GPU at that point. Also there is the following in dmesg:
It seems in some conditions test execution doesn't return normally and killed from the outside. |
These tests seem to exceed timeout limit:
600 sec is the limit for a single test - https://github.com/intel/llvm-test-suite/blob/intel/SYCL/lit.cfg.py#L430-L435. |
I suggest we temporary disable asserts on HIP backend to minimize the impact on CI system while we investigate the root cause. |
Yes, they've been disabled at intel/llvm-test-suite#1441 |
I've added |
There's a few possible causes I've identified so far (for HIP). I also have not been able to reproduce the specific issue though. Does anyone know which rocm version the CI was using when these tests fails were reported, and which version it is currently using? I find that there is an (with a admittedly different error output) issue with a corresponding hip assert test for rocm4, but rocm 5 versions work. BTW there was another unrelated hip driver issue that meant compiling at O0 failed, which has been fixed in rocm 5.7.0. Also I see that it is using ubuntu 22.04: you need ROCm 5.3.0 or later to be compatible with 22.04: ROCm/ROCm#1730. I can see from the CI that it is using a AMD Radeon RX 6700 XT (gfx1031). gfx1031 is not officially supported by ROCm on linux. I think I remember that in the past the CI used a gfx1030 which is officially supported. I don't know whether switching device could have led to some CI issues. But I think it makes sense to use an officially supported AMD device on the CI. It seems that lots of non-officially supported amd gpus work, at least to some degree, but using one for testing seems like a bad idea. Does someone know when the gtx1031 started being used on the CI? Thanks |
Also depending on which subversion of 22.04 you use you may need a later rocm version. e.g. I noticed this: " New in version 5.7.0:
" from https://docs.amd.com/en/latest/release/gpu_os_support.html#linux-supported-gpus |
@aelovikov-intel - Do you know the answer to @JackAKirk's question above? |
For our
|
OK, that version of ROCm isn't supported by any version of ubuntu 22.04 (the CI is using ubuntu 22.04). I suggest upgrading the CI to ROCm 5.7. Also if possible an officially supported ROCm gpu should be used from this list https://docs.amd.com/en/latest/release/gpu_os_support.html#linux-supported-gpus. |
I tried to build and run the test (assert_in_kernels.cpp) on an MI100 GPU. The output message: ./assert_in_kernels.hpp:25: void kernelFunc2(int *, int): global id: [0,0,0], local id: [0,0,0] Assertion |
I think this is the expected behavior right? at least the assert part. |
We think that the hangs could be due to missing pcie atomics in the CI bus. https://docs.amd.com/en/docs-5.6.0/release/gpu_os_support.html#cpu-support And we think that Would it be possible to send us the output of
on the CI (with sudo) to confirm this? Thanks |
On amdgpu-3 runner natively (not inside docker image): lspci.txt |
Thanks, that's very interesting. The "Internal" pcie is marked negative for all atomics (however other pcie hardware are marked positive), which we guessed would be the relevant hardware for the assert, but we are not 100% sure about this. I've however realized that amdgpu-3 is the runner that passed for all the assert runs I made. This is also a cpu that I had expected to fully support the relevant pcie Atomics. Am I right that the only two that are used in the amd ci are amdgpu-4 and amdgpu-3? Many thanks for your help with this. |
It seems to be saying that both runners do not support atomics. There is little difference in their output.
for both runners? Thanks |
A few tests in the driver area require amdgpu or nvptx targets to be built in order to properly run. Add these requirements to the tests.
LIT testing on HIP backend is failing:
Unresolved tests
Unresolved Tests (5):
SYCL :: Assert/assert_in_kernels.cpp
SYCL :: Assert/assert_in_multiple_tus.cpp
SYCL :: Assert/assert_in_multiple_tus_one_ndebug.cpp
SYCL :: Assert/assert_in_one_kernel.cpp
SYCL :: Assert/assert_in_simultaneously_multiple_tus_one_ndebug.cpp
Example:
https://github.com/intel/llvm/actions/runs/3616100865/jobs/6093960264
Error:
UNRESOLVED: SYCL :: Assert/assert_in_kernels.cpp (1031 of 1031)
******************** TEST 'SYCL :: Assert/assert_in_kernels.cpp' FAILED ********************
Exception during script execution:
Traceback (most recent call last):
File "/__w/llvm/llvm/lit/lit/worker.py", line 76, in _execute_test_handle_errors
result = test.config.test_format.execute(test, lit_config)
File "/__w/llvm/llvm/lit/lit/formats/shtest.py", line 27, in execute
return lit.TestRunner.executeShTest(test, litConfig,
File "/__w/llvm/llvm/lit/lit/TestRunner.py", line 2005, in executeShTest
return _runShTest(test, litConfig, useExternalSh, script, tmpBase)
File "/__w/llvm/llvm/lit/lit/TestRunner.py", line 1966, in _runShTest
output = """Script:\n--\n%s\n--\nExit Code: %d\n""" % (
TypeError: %d format: a number is required, not NoneType
The text was updated successfully, but these errors were encountered: