SYCL NVidia build failing #6026

AidanBeltonS · 2024-03-12T17:54:13Z

NVidia SYCL build failing due to multiple GPUs support PR.
This comes from https://github.com/ggerganov/llama.cpp/pull/5806/files#diff-6af12449fa63d10882b68b8230ff092164a786e01683813a005267630ab9c0b2R3330.
CUDA versions translate the the SM number, so for SM80 the major version is 8.

I believe we should refrain from using versions if possible in the SYCL backend as it is less backend agnostic.

Steps to reproduce:

$./bin/test-backend-ops -b SYCL0

ggml_backend_register: registered backend CPU
terminate called after throwing an instance of 'sycl::_V1::invalid_parameter_error'
  what():  DeviceList is empty. -30 (PI_ERROR_INVALID_VALUE)
Aborted

The text was updated successfully, but these errors were encountered:

AidanBeltonS · 2024-03-12T17:54:34Z

PR introducing bug: #5806

NeoZhangJianyu · 2024-03-13T13:29:51Z

#5806 is already reverted.
Please check again!

NeoZhangJianyu · 2024-03-13T14:24:19Z

I guess the parameter "SYCL0" is the cause.
SYCL0 maybe not the GPU device.
It's better not set any parameter, the UT case will detect the GPU devices.

Please run following cmd to list the SYCL device:

source /opt/intel/oneapi/setvars.sh
./build/bin/ls-sycl-device

AidanBeltonS · 2024-03-13T14:48:26Z

#5806 is already reverted. Please check again!

I have checked on the current tip. It is still broken. Could you link me to the revert commit?

Commit: d8fd0cc

$ ./bin/test-backend-ops 
ggml_backend_register: registered backend CPU
terminate called after throwing an instance of 'sycl::_V1::invalid_parameter_error'
  what():  DeviceList is empty. -30 (PI_ERROR_INVALID_VALUE)
Aborted

$ ./bin/test-backend-ops -b SYCL0
ggml_backend_register: registered backend CPU
terminate called after throwing an instance of 'sycl::_V1::invalid_parameter_error'
  what():  DeviceList is empty. -30 (PI_ERROR_INVALID_VALUE)
Aborted

backtrace:

terminate called after throwing an instance of 'sycl::_V1::invalid_parameter_error'
  what():  DeviceList is empty. -30 (PI_ERROR_INVALID_VALUE)

Thread 1 "test-backend-op" received signal SIGABRT, Aborted.
__pthread_kill_implementation (no_tid=0, signo=6, threadid=140737315194752) at ./nptl/pthread_kill.c:44
44	./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737315194752) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140737315194752) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140737315194752, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffff76dc476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff76c27f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007ffff7cdda49 in __gnu_cxx::__verbose_terminate_handler () at ../../.././libstdc++-v3/libsupc++/vterminate.cc:95
#6  0x00007ffff7ce907a in __cxxabiv1::__terminate (handler=<optimized out>) at ../../.././libstdc++-v3/libsupc++/eh_terminate.cc:48
#7  0x00007ffff7ce90e5 in std::terminate () at ../../.././libstdc++-v3/libsupc++/eh_terminate.cc:58
#8  0x00007ffff7ce9337 in __cxxabiv1::__cxa_throw (obj=<optimized out>, tinfo=0x7ffff7c0a388 <typeinfo for sycl::_V1::invalid_parameter_error>, 
    dest=0x7ffff7ab0400 <sycl::_V1::invalid_parameter_error::~invalid_parameter_error()>) at ../../.././libstdc++-v3/libsupc++/eh_throw.cc:98
#9  0x00007ffff7bc0f42 in sycl::_V1::context::context(std::vector<sycl::_V1::device, std::allocator<sycl::_V1::device> > const&, std::function<void (sycl::_V1::exception_list)>, sycl::_V1::property_list const&) () from /opt/slurm/intel/oneapi/2024.0.1.46/compiler/2024.0/lib/libsycl.so.7
#10 0x00007ffff7bc0550 in sycl::_V1::context::context(std::vector<sycl::_V1::device, std::allocator<sycl::_V1::device> > const&, sycl::_V1::property_list const&) ()
   from /opt/slurm/intel/oneapi/2024.0.1.46/compiler/2024.0/lib/libsycl.so.7
#11 0x00000000005677ae in sycl_gpu_mgr::create_context_with_gpus (this=0x1d62100) at /home/aidanbelton/source/llama.cpp/ggml-sycl.cpp:3411
#12 sycl_gpu_mgr::sycl_gpu_mgr (this=0x1d62100) at /home/aidanbelton/source/llama.cpp/ggml-sycl.cpp:3407
#13 0x00000000004bcbdb in ggml_backend_sycl_reg_devices () at /home/aidanbelton/source/llama.cpp/ggml-sycl.cpp:17311
#14 0x0000000000480958 in ggml_backend_registry_init () at /home/aidanbelton/source/llama.cpp/ggml-backend.c:377
#15 0x0000000000480ae4 in ggml_backend_reg_get_count () at /home/aidanbelton/source/llama.cpp/ggml-backend.c:419
#16 0x000000000040b868 in main (argc=<optimized out>, argv=<optimized out>) at /home/aidanbelton/source/llama.cpp/tests/test-backend-ops.cpp:2254

AidanBeltonS · 2024-03-13T14:50:48Z

I guess the parameter "SYCL0" is the cause. SYCL0 maybe not the GPU device. It's better not set any parameter, the UT case will detect the GPU devices.

Please run following cmd to list the SYCL device:
source /opt/intel/oneapi/setvars.sh
./build/bin/ls-sycl-device

The problem is not tied to passing a specific device.
The device list passed to when attempting to construct the context is empty. This is because you require the major version == 1 to be added to the device list. This was introduced in PR5806 and remains in the code base.

The major version for an NVidia A100 is 8, the major version for an AMD MI210 is 90.
The version numbers are very much a backend specific value which I do not recommend using.
What are you trying to do with this query, I would like to help provide a suitable alternative?

FYI:

$ ./bin/ls-sycl-device
found 3 SYCL devices:
|ID| Name                                        |compute capability|Max compute units|Max work group|Max sub group|Global mem size|
|--|---------------------------------------------|------------------|-----------------|--------------|-------------|---------------|
| 0|                        NVIDIA A100-PCIE-40GB|               8.0|              108|          1024|           32|    42298834944|
| 1|               Intel(R) FPGA Emulation Device|               1.2|                2|      67108864|           64|    67117649920|
| 2|     Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz|               3.0|                2|          8192|           64|    67117649920|

NeoZhangJianyu · 2024-03-15T11:28:34Z

@AidanBeltonS if it's fixed, please close this issue.

AidanBeltonS added the bug-unconfirmed label Mar 12, 2024

AidanBeltonS mentioned this issue Mar 13, 2024

[SYCL] Fix non-intel device selection #6042

Merged

AidanBeltonS closed this as completed Mar 19, 2024

jiriks74 mentioned this issue May 16, 2024

[BUG]: A370M doesn't work, prevents iGPU from working #6808

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SYCL NVidia build failing #6026

SYCL NVidia build failing #6026

AidanBeltonS commented Mar 12, 2024

AidanBeltonS commented Mar 12, 2024

NeoZhangJianyu commented Mar 13, 2024

NeoZhangJianyu commented Mar 13, 2024

AidanBeltonS commented Mar 13, 2024 •

edited

Loading

AidanBeltonS commented Mar 13, 2024

NeoZhangJianyu commented Mar 15, 2024

SYCL NVidia build failing #6026

SYCL NVidia build failing #6026

Comments

AidanBeltonS commented Mar 12, 2024

Steps to reproduce:

AidanBeltonS commented Mar 12, 2024

NeoZhangJianyu commented Mar 13, 2024

NeoZhangJianyu commented Mar 13, 2024

AidanBeltonS commented Mar 13, 2024 • edited Loading

AidanBeltonS commented Mar 13, 2024

NeoZhangJianyu commented Mar 15, 2024

AidanBeltonS commented Mar 13, 2024 •

edited

Loading