[SYCL] - Run hang with undefined temporary symbol error on Windows 11 #596

aahouzi · 2025-02-12T16:10:22Z

Type of issue

Started by installing oneAPI 2025.0.1 and activating the environment with:

 "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64

Verified SYCL devices with sycl-ls:

[level_zero:gpu][level_zero:0] Intel(R) oneAPI Unified Runtime over Level-Zero, Intel(R) Arc(TM) B580 Graphics 20.1.0 [1.6.31896]
[opencl:cpu][opencl:0] Intel(R) OpenCL, Intel(R) Core(TM) Ultra 9 285K OpenCL 3.0 (Build 0) [2024.18.12.0.05_160000]
[opencl:gpu][opencl:1] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) B580 Graphics OpenCL 3.0 NEO  [32.0.101.6559]

Used the following CMake commands to build the project with SYCL backend on my B580 dGPU:

cmake -B build -G "Ninja" -DSD_SYCL=ON -DCMAKE_C_COMPILER=cl -DCMAKE_CXX_COMPILER=icx 
cmake --build build --config Release -j

The build completes successfully. When I attempt running the binary on sd3-medium model, the run hangs indefinitely with the following error:

C:\Users\Intel\Desktop\aahouzi\stable-diffusion.cpp>build\bin\sd.exe -m sd3_medium_incl_clips_t5xxlfp16.safetensors --cfg-scale 5 --steps 30 --sampling-method euler  -H 1024 -W 1024 --seed 42 -p "fantasy medieval village world inside a glass sphere , high detail, fantasy, realistic, light effect, hyper detail, volumetric lighting, cinematic, macro, depth of field, blur, red light and clouds from the back, highly detailed epic cinematic concept art cg render made in maya, blender and photoshop, octane render, excellent composition, dynamic dramatic cinematic lighting, aesthetic, very inspirational, world inside a glass sphere by james gurney by artgerm with james jean, joe fenton and tristan eaton by ross tran, fine details, 4k resolution"
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc B580 Graphics|   20.1|    160|    1024|   32| 12450M|            1.6.31896|
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
[INFO ] stable-diffusion.cpp:195  - loading model from 'sd3_medium_incl_clips_t5xxlfp16.safetensors'
[INFO ] model.cpp:888  - load sd3_medium_incl_clips_t5xxlfp16.safetensors using safetensors format
[INFO ] stable-diffusion.cpp:242  - Version: SD3.x
[INFO ] stable-diffusion.cpp:275  - Weight type:                 f16
[INFO ] stable-diffusion.cpp:276  - Conditioner weight type:     f16
[INFO ] stable-diffusion.cpp:277  - Diffusion model weight type: f16
[INFO ] stable-diffusion.cpp:278  - VAE weight type:             f16
[INFO ] stable-diffusion.cpp:319  - set clip_on_cpu to true
[INFO ] stable-diffusion.cpp:322  - CLIP: Using CPU backend
[INFO ] mmdit.hpp:706  - MMDiT layers: 24 (including 0 MMDiT-x layers)
  |==================================================| 1665/1668 - 0.00it/s[INFO ] model.cpp:1868 - unknown tensor 'text_encoders.t5xxl.transformer.encoder.embed_tokens.weight | f16 | 2 [4096, 32128, 1, 1, 1]' in model file
  |==================================================| 1668/1668 - 5.68it/s
[INFO ] stable-diffusion.cpp:516  - total params memory size = 14857.47MB (VRAM 4209.34MB, RAM 10648.12MB): clip 10648.12MB(RAM), unet 4114.77MB(VRAM), vae 94.57MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM)
[INFO ] stable-diffusion.cpp:520  - loading model from 'sd3_medium_incl_clips_t5xxlfp16.safetensors' completed, taking 24.17s
[INFO ] stable-diffusion.cpp:538  - running in FLOW mode
[INFO ] stable-diffusion.cpp:688  - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1241 - apply_loras completed, taking 0.00s
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 1.40 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 1.40 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 2.33 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 2.33 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 11.94 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 1.40 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 2.33 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 11.94 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 1.40 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 1.40 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 2.33 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 2.33 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 11.94 MiB
[INFO ] stable-diffusion.cpp:1374 - get_learned_condition completed, taking 44453 ms
[INFO ] stable-diffusion.cpp:1397 - sampling using Euler method
[INFO ] stable-diffusion.cpp:1434 - generating image: 1/1 - seed 42
ggml_gallocr_reserve_n: reallocating SYCL0 buffer from size 0.00 MiB to 1921.36 MiB
<unknown>:0: error: Undefined temporary symbol .L_ZL13cpy_1_i16_i16PKcPc
<unknown>:0: error: Undefined temporary symbol .L_ZL13cpy_1_f16_f16PKcPc
<unknown>:0: error: Undefined temporary symbol .L_ZL13cpy_1_i32_i32PKcPc
<unknown>:0: error: Undefined temporary symbol .L_ZL13cpy_1_f32_f32PKcPc

I tried on Intel Core Ultra 7 155H (MTL) iGPU too, but I ended up with the same issue. @zhentaoyu any guidance you can provide on this issue ?

Hardware

Intel Core Ultra 7 155H (MTL) iGPU
Intel Arc B580 dGPU

GPU Driver version

32.0.101.6559

OS

Windows 11

The text was updated successfully, but these errors were encountered:

zhentaoyu · 2025-02-13T02:45:49Z

cc @airMeng for awareness.

airMeng · 2025-02-13T13:57:18Z

@aahouzi long time no see. Can you try #597 ?

aahouzi · 2025-02-14T09:59:12Z

@airMeng FYI it's still not working on my B580, I tried with SD1.4 too, and I'm getting a different error this time:

C:\Users\Intel\Desktop\aahouzi\stable-diffusion.cpp>build\bin\sd.exe -m sd-v1-4.ckpt --cfg-scale 5 --steps 30 --sampling-method euler  -H 1024 -W 1024 --seed 42 -p "fantasy medieval village world inside a glass sphere , high detail, fantasy, realistic, light effect, hyper detail, volumetric lighting, cinematic, macro, depth of field, blur, red light and clouds from the back, highly detailed epic cinematic concept art cg render made in maya, blender and photoshop, octane render, excellent composition, dynamic dramatic cinematic lighting, aesthetic, very inspirational, world inside a glass sphere by james gurney by artgerm with james jean, joe fenton and tristan eaton by ross tran, fine details, 4k resolution"
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc B580 Graphics|   20.1|    160|    1024|   32| 12450M|            1.6.31896|
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
[INFO ] stable-diffusion.cpp:195  - loading model from 'sd-v1-4.ckpt'
[INFO ] model.cpp:891  - load sd-v1-4.ckpt using checkpoint format
ZIP 0, name = archive/data.pkl, dir = archive/
[INFO ] stable-diffusion.cpp:242  - Version: SD 1.x
[INFO ] stable-diffusion.cpp:275  - Weight type:                 f32
[INFO ] stable-diffusion.cpp:276  - Conditioner weight type:     f32
[INFO ] stable-diffusion.cpp:277  - Diffusion model weight type: f32
[INFO ] stable-diffusion.cpp:278  - VAE weight type:             f32
  |==================================================| 1131/1131 - 0.00it/s
[INFO ] stable-diffusion.cpp:516  - total params memory size = 2719.24MB (VRAM 2719.24MB, RAM 0.00MB): clip 469.44MB(VRAM), unet 2155.33MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:520  - loading model from 'sd-v1-4.ckpt' completed, taking 31.09s
[INFO ] stable-diffusion.cpp:554  - running in eps-prediction mode
[INFO ] stable-diffusion.cpp:688  - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1241 - apply_loras completed, taking 0.00s
ggml_gallocr_reserve_n: reallocating SYCL0 buffer from size 0.00 MiB to 1.40 MiB
ggml_gallocr_reserve_n: reallocating SYCL0 buffer from size 0.00 MiB to 1.40 MiB
ggml_gallocr_reserve_n: reallocating SYCL0 buffer from size 0.00 MiB to 1.40 MiB
[INFO ] stable-diffusion.cpp:1374 - get_learned_condition completed, taking 1065 ms
[INFO ] stable-diffusion.cpp:1397 - sampling using Euler method
[INFO ] stable-diffusion.cpp:1434 - generating image: 1/1 - seed 42
ggml_gallocr_reserve_n: reallocating SYCL0 buffer from size 0.00 MiB to 8368.49 MiB
Provided range is out of integer limits. Pass `-fno-sycl-id-queries-fit-in-int' to disable range check.Exception caught at file:C:\Users\Intel\Desktop\aahouzi\stable-diffusion.cpp\ggml\src\ggml-sycl\common.cpp, line:102

HeyItsBATMAN · 2025-02-14T14:29:48Z

@aahouzi The '-fno-sycl-id-queries-fit-in-int' message appears when the requested resolution is too large. If you remove the -W 1024 -H 1024 arguments, it should get further

aahouzi · 2025-02-14T17:13:17Z

@HeyItsBATMAN Thanks for the info. With SD1.4, it's running, but the GPU is still not fully utilized—only small peaks appear continuously, even though the image size is just 256x256. How long does it take on your Linux-based B580?

FYI, here is perf on my GPU, I still think it's too slow given model size and image size:

[INFO ] stable-diffusion.cpp:520  - loading model from 'sd-v1-4.ckpt' completed, taking 31.63s
[INFO ] stable-diffusion.cpp:554  - running in eps-prediction mode
[INFO ] stable-diffusion.cpp:688  - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1241 - apply_loras completed, taking 0.00s
ggml_gallocr_reserve_n: reallocating SYCL0 buffer from size 0.00 MiB to 1.40 MiB
ggml_gallocr_reserve_n: reallocating SYCL0 buffer from size 0.00 MiB to 1.40 MiB
ggml_gallocr_reserve_n: reallocating SYCL0 buffer from size 0.00 MiB to 1.40 MiB
[INFO ] stable-diffusion.cpp:1374 - get_learned_condition completed, taking 978 ms
[INFO ] stable-diffusion.cpp:1397 - sampling using Euler method
[INFO ] stable-diffusion.cpp:1434 - generating image: 1/1 - seed 42
ggml_gallocr_reserve_n: reallocating SYCL0 buffer from size 0.00 MiB to 57.13 MiB
  |==================================================| 30/30 - 20.31s/it
[INFO ] stable-diffusion.cpp:1472 - sampling completed, taking 578.49s
[INFO ] stable-diffusion.cpp:1480 - generating 1 latent images completed, taking 578.50s
[INFO ] stable-diffusion.cpp:1483 - decoding 1 latents
ggml_gallocr_reserve_n: reallocating SYCL0 buffer from size 0.00 MiB to 416.00 MiB
[INFO ] stable-diffusion.cpp:1493 - latent 1 decoded, taking 33.87s
[INFO ] stable-diffusion.cpp:1497 - decode_first_stage completed, taking 33.87s
[INFO ] stable-diffusion.cpp:1620 - txt2img completed in 613.35s
save result PNG image to 'output.png'

HeyItsBATMAN · 2025-02-14T17:44:20Z

@aahouzi
Kinda sounds like its not loading the model onto your GPU at all?

Using sd-v1-4.ckpt with euler sampling:
256x256@30steps: 5 seconds
512x512@30steps: 12 seconds

It should print a line like this:

[INFO ] stable-diffusion.cpp:516  - total params memory size = 2719.24MB (VRAM 2719.24MB, RAM 0.00MB): clip 469.44MB(VRAM), unet 2155.33MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)

You can see that everything is loaded into VRAM for me. What does it say for you?

aahouzi · 2025-02-14T18:27:57Z

@HeyItsBATMAN I already checked that, and my model is fully loaded in VRAM:

[INFO ] stable-diffusion.cpp:516  - total params memory size = 2719.24MB (VRAM 2719.24MB, RAM 0.00MB): clip 469.44MB(VRAM), unet 2155.33MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)

Not sure if PR #330 enabling the SYCL backend was even tested on Windows. Also, I just noticed that there is still a TODO task in this PR to add support for large image inputs, which explains the '-fno-sycl-id-queries-fit-in-int' issue.

NeoZhangJianyu · 2025-02-27T15:21:15Z

@aahouzi
The SD call ggml to malloc more than 4GB memory in big image size cases.

In client dGPU, there is limitation of 4GB by driver in general.
The iGPU of 11th Core and PVC have no such limitation.

I test with SD 1.4 and SD 3.x on Arc770 (16GB). Meet same issue of memory malloc.
When change the size from 1024 to 512, all are passed.

Let me check possible methods.

airMeng · 2025-03-04T07:46:32Z

@aahouzi The SD call ggml to malloc more than 4GB memory in big image size cases.

In client dGPU, there is limitation of 4GB by driver in general. The iGPU of 11th Core and PVC have no such limitation.

https://github.com/intel/compute-runtime/blob/2cad595a0de0875b5c8735e5245661305318c07b/programmers-guide/ALLOCATIONS_GREATER_THAN_4GB.md

You can try to allocate several times like llama.cpp does.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL] - Run hang with undefined temporary symbol error on Windows 11 #596

[SYCL] - Run hang with undefined temporary symbol error on Windows 11 #596

aahouzi commented Feb 12, 2025 •

edited

Loading

zhentaoyu commented Feb 13, 2025

airMeng commented Feb 13, 2025

aahouzi commented Feb 14, 2025

HeyItsBATMAN commented Feb 14, 2025

aahouzi commented Feb 14, 2025 •

edited

Loading

HeyItsBATMAN commented Feb 14, 2025

aahouzi commented Feb 14, 2025

NeoZhangJianyu commented Feb 27, 2025

airMeng commented Mar 4, 2025

[SYCL] - Run hang with undefined temporary symbol error on Windows 11 #596

[SYCL] - Run hang with undefined temporary symbol error on Windows 11 #596

Comments

aahouzi commented Feb 12, 2025 • edited Loading

Type of issue

Hardware

GPU Driver version

OS

zhentaoyu commented Feb 13, 2025

airMeng commented Feb 13, 2025

aahouzi commented Feb 14, 2025

HeyItsBATMAN commented Feb 14, 2025

aahouzi commented Feb 14, 2025 • edited Loading

HeyItsBATMAN commented Feb 14, 2025

aahouzi commented Feb 14, 2025

NeoZhangJianyu commented Feb 27, 2025

airMeng commented Mar 4, 2025

aahouzi commented Feb 12, 2025 •

edited

Loading

aahouzi commented Feb 14, 2025 •

edited

Loading