Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SYCL NVidia build failing #6026

Closed
AidanBeltonS opened this issue Mar 12, 2024 · 6 comments
Closed

SYCL NVidia build failing #6026

AidanBeltonS opened this issue Mar 12, 2024 · 6 comments

Comments

@AidanBeltonS
Copy link
Contributor

NVidia SYCL build failing due to multiple GPUs support PR.
This comes from https://github.com/ggerganov/llama.cpp/pull/5806/files#diff-6af12449fa63d10882b68b8230ff092164a786e01683813a005267630ab9c0b2R3330.
CUDA versions translate the the SM number, so for SM80 the major version is 8.

I believe we should refrain from using versions if possible in the SYCL backend as it is less backend agnostic.

Steps to reproduce:

$./bin/test-backend-ops -b SYCL0

ggml_backend_register: registered backend CPU
terminate called after throwing an instance of 'sycl::_V1::invalid_parameter_error'
  what():  DeviceList is empty. -30 (PI_ERROR_INVALID_VALUE)
Aborted
@AidanBeltonS
Copy link
Contributor Author

PR introducing bug: #5806

@NeoZhangJianyu
Copy link
Collaborator

#5806 is already reverted.
Please check again!

@NeoZhangJianyu
Copy link
Collaborator

I guess the parameter "SYCL0" is the cause.
SYCL0 maybe not the GPU device.
It's better not set any parameter, the UT case will detect the GPU devices.

Please run following cmd to list the SYCL device:

source /opt/intel/oneapi/setvars.sh
./build/bin/ls-sycl-device

@AidanBeltonS
Copy link
Contributor Author

AidanBeltonS commented Mar 13, 2024

#5806 is already reverted. Please check again!

I have checked on the current tip. It is still broken. Could you link me to the revert commit?

Commit: d8fd0cc

$ ./bin/test-backend-ops 
ggml_backend_register: registered backend CPU
terminate called after throwing an instance of 'sycl::_V1::invalid_parameter_error'
  what():  DeviceList is empty. -30 (PI_ERROR_INVALID_VALUE)
Aborted
$ ./bin/test-backend-ops -b SYCL0
ggml_backend_register: registered backend CPU
terminate called after throwing an instance of 'sycl::_V1::invalid_parameter_error'
  what():  DeviceList is empty. -30 (PI_ERROR_INVALID_VALUE)
Aborted

backtrace:

terminate called after throwing an instance of 'sycl::_V1::invalid_parameter_error'
  what():  DeviceList is empty. -30 (PI_ERROR_INVALID_VALUE)

Thread 1 "test-backend-op" received signal SIGABRT, Aborted.
__pthread_kill_implementation (no_tid=0, signo=6, threadid=140737315194752) at ./nptl/pthread_kill.c:44
44	./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737315194752) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140737315194752) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140737315194752, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffff76dc476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff76c27f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007ffff7cdda49 in __gnu_cxx::__verbose_terminate_handler () at ../../.././libstdc++-v3/libsupc++/vterminate.cc:95
#6  0x00007ffff7ce907a in __cxxabiv1::__terminate (handler=<optimized out>) at ../../.././libstdc++-v3/libsupc++/eh_terminate.cc:48
#7  0x00007ffff7ce90e5 in std::terminate () at ../../.././libstdc++-v3/libsupc++/eh_terminate.cc:58
#8  0x00007ffff7ce9337 in __cxxabiv1::__cxa_throw (obj=<optimized out>, tinfo=0x7ffff7c0a388 <typeinfo for sycl::_V1::invalid_parameter_error>, 
    dest=0x7ffff7ab0400 <sycl::_V1::invalid_parameter_error::~invalid_parameter_error()>) at ../../.././libstdc++-v3/libsupc++/eh_throw.cc:98
#9  0x00007ffff7bc0f42 in sycl::_V1::context::context(std::vector<sycl::_V1::device, std::allocator<sycl::_V1::device> > const&, std::function<void (sycl::_V1::exception_list)>, sycl::_V1::property_list const&) () from /opt/slurm/intel/oneapi/2024.0.1.46/compiler/2024.0/lib/libsycl.so.7
#10 0x00007ffff7bc0550 in sycl::_V1::context::context(std::vector<sycl::_V1::device, std::allocator<sycl::_V1::device> > const&, sycl::_V1::property_list const&) ()
   from /opt/slurm/intel/oneapi/2024.0.1.46/compiler/2024.0/lib/libsycl.so.7
#11 0x00000000005677ae in sycl_gpu_mgr::create_context_with_gpus (this=0x1d62100) at /home/aidanbelton/source/llama.cpp/ggml-sycl.cpp:3411
#12 sycl_gpu_mgr::sycl_gpu_mgr (this=0x1d62100) at /home/aidanbelton/source/llama.cpp/ggml-sycl.cpp:3407
#13 0x00000000004bcbdb in ggml_backend_sycl_reg_devices () at /home/aidanbelton/source/llama.cpp/ggml-sycl.cpp:17311
#14 0x0000000000480958 in ggml_backend_registry_init () at /home/aidanbelton/source/llama.cpp/ggml-backend.c:377
#15 0x0000000000480ae4 in ggml_backend_reg_get_count () at /home/aidanbelton/source/llama.cpp/ggml-backend.c:419
#16 0x000000000040b868 in main (argc=<optimized out>, argv=<optimized out>) at /home/aidanbelton/source/llama.cpp/tests/test-backend-ops.cpp:2254

@AidanBeltonS
Copy link
Contributor Author

I guess the parameter "SYCL0" is the cause. SYCL0 maybe not the GPU device. It's better not set any parameter, the UT case will detect the GPU devices.

Please run following cmd to list the SYCL device:

source /opt/intel/oneapi/setvars.sh
./build/bin/ls-sycl-device

The problem is not tied to passing a specific device.
The device list passed to when attempting to construct the context is empty. This is because you require the major version == 1 to be added to the device list. This was introduced in PR5806 and remains in the code base.

The major version for an NVidia A100 is 8, the major version for an AMD MI210 is 90.
The version numbers are very much a backend specific value which I do not recommend using.
What are you trying to do with this query, I would like to help provide a suitable alternative?

FYI:

$ ./bin/ls-sycl-device
found 3 SYCL devices:
|ID| Name                                        |compute capability|Max compute units|Max work group|Max sub group|Global mem size|
|--|---------------------------------------------|------------------|-----------------|--------------|-------------|---------------|
| 0|                        NVIDIA A100-PCIE-40GB|               8.0|              108|          1024|           32|    42298834944|
| 1|               Intel(R) FPGA Emulation Device|               1.2|                2|      67108864|           64|    67117649920|
| 2|     Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz|               3.0|                2|          8192|           64|    67117649920|

@NeoZhangJianyu
Copy link
Collaborator

@AidanBeltonS if it's fixed, please close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants