Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL] - Run hang with undefined temporary symbol error on Windows 11 #596

Open
aahouzi opened this issue Feb 12, 2025 · 9 comments
Open

Comments

@aahouzi
Copy link

aahouzi commented Feb 12, 2025

Type of issue

  • Started by installing oneAPI 2025.0.1 and activating the environment with:
 "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64
  • Verified SYCL devices with sycl-ls:
[level_zero:gpu][level_zero:0] Intel(R) oneAPI Unified Runtime over Level-Zero, Intel(R) Arc(TM) B580 Graphics 20.1.0 [1.6.31896]
[opencl:cpu][opencl:0] Intel(R) OpenCL, Intel(R) Core(TM) Ultra 9 285K OpenCL 3.0 (Build 0) [2024.18.12.0.05_160000]
[opencl:gpu][opencl:1] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) B580 Graphics OpenCL 3.0 NEO  [32.0.101.6559]
  • Used the following CMake commands to build the project with SYCL backend on my B580 dGPU:
cmake -B build -G "Ninja" -DSD_SYCL=ON -DCMAKE_C_COMPILER=cl -DCMAKE_CXX_COMPILER=icx 
cmake --build build --config Release -j
  • The build completes successfully. When I attempt running the binary on sd3-medium model, the run hangs indefinitely with the following error:
C:\Users\Intel\Desktop\aahouzi\stable-diffusion.cpp>build\bin\sd.exe -m sd3_medium_incl_clips_t5xxlfp16.safetensors --cfg-scale 5 --steps 30 --sampling-method euler  -H 1024 -W 1024 --seed 42 -p "fantasy medieval village world inside a glass sphere , high detail, fantasy, realistic, light effect, hyper detail, volumetric lighting, cinematic, macro, depth of field, blur, red light and clouds from the back, highly detailed epic cinematic concept art cg render made in maya, blender and photoshop, octane render, excellent composition, dynamic dramatic cinematic lighting, aesthetic, very inspirational, world inside a glass sphere by james gurney by artgerm with james jean, joe fenton and tristan eaton by ross tran, fine details, 4k resolution"
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc B580 Graphics|   20.1|    160|    1024|   32| 12450M|            1.6.31896|
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
[INFO ] stable-diffusion.cpp:195  - loading model from 'sd3_medium_incl_clips_t5xxlfp16.safetensors'
[INFO ] model.cpp:888  - load sd3_medium_incl_clips_t5xxlfp16.safetensors using safetensors format
[INFO ] stable-diffusion.cpp:242  - Version: SD3.x
[INFO ] stable-diffusion.cpp:275  - Weight type:                 f16
[INFO ] stable-diffusion.cpp:276  - Conditioner weight type:     f16
[INFO ] stable-diffusion.cpp:277  - Diffusion model weight type: f16
[INFO ] stable-diffusion.cpp:278  - VAE weight type:             f16
[INFO ] stable-diffusion.cpp:319  - set clip_on_cpu to true
[INFO ] stable-diffusion.cpp:322  - CLIP: Using CPU backend
[INFO ] mmdit.hpp:706  - MMDiT layers: 24 (including 0 MMDiT-x layers)
  |==================================================| 1665/1668 - 0.00it/s[INFO ] model.cpp:1868 - unknown tensor 'text_encoders.t5xxl.transformer.encoder.embed_tokens.weight | f16 | 2 [4096, 32128, 1, 1, 1]' in model file
  |==================================================| 1668/1668 - 5.68it/s
[INFO ] stable-diffusion.cpp:516  - total params memory size = 14857.47MB (VRAM 4209.34MB, RAM 10648.12MB): clip 10648.12MB(RAM), unet 4114.77MB(VRAM), vae 94.57MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM)
[INFO ] stable-diffusion.cpp:520  - loading model from 'sd3_medium_incl_clips_t5xxlfp16.safetensors' completed, taking 24.17s
[INFO ] stable-diffusion.cpp:538  - running in FLOW mode
[INFO ] stable-diffusion.cpp:688  - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1241 - apply_loras completed, taking 0.00s
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 1.40 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 1.40 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 2.33 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 2.33 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 11.94 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 1.40 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 2.33 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 11.94 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 1.40 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 1.40 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 2.33 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 2.33 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 11.94 MiB
[INFO ] stable-diffusion.cpp:1374 - get_learned_condition completed, taking 44453 ms
[INFO ] stable-diffusion.cpp:1397 - sampling using Euler method
[INFO ] stable-diffusion.cpp:1434 - generating image: 1/1 - seed 42
ggml_gallocr_reserve_n: reallocating SYCL0 buffer from size 0.00 MiB to 1921.36 MiB
<unknown>:0: error: Undefined temporary symbol .L_ZL13cpy_1_i16_i16PKcPc
<unknown>:0: error: Undefined temporary symbol .L_ZL13cpy_1_f16_f16PKcPc
<unknown>:0: error: Undefined temporary symbol .L_ZL13cpy_1_i32_i32PKcPc
<unknown>:0: error: Undefined temporary symbol .L_ZL13cpy_1_f32_f32PKcPc
  • I tried on Intel Core Ultra 7 155H (MTL) iGPU too, but I ended up with the same issue. @zhentaoyu any guidance you can provide on this issue ?

Hardware

  • Intel Core Ultra 7 155H (MTL) iGPU
  • Intel Arc B580 dGPU

GPU Driver version

32.0.101.6559

OS

Windows 11

@zhentaoyu
Copy link
Contributor

cc @airMeng for awareness.

@airMeng
Copy link
Contributor

airMeng commented Feb 13, 2025

@aahouzi long time no see. Can you try #597 ?

@aahouzi
Copy link
Author

aahouzi commented Feb 14, 2025

@airMeng FYI it's still not working on my B580, I tried with SD1.4 too, and I'm getting a different error this time:

C:\Users\Intel\Desktop\aahouzi\stable-diffusion.cpp>build\bin\sd.exe -m sd-v1-4.ckpt --cfg-scale 5 --steps 30 --sampling-method euler  -H 1024 -W 1024 --seed 42 -p "fantasy medieval village world inside a glass sphere , high detail, fantasy, realistic, light effect, hyper detail, volumetric lighting, cinematic, macro, depth of field, blur, red light and clouds from the back, highly detailed epic cinematic concept art cg render made in maya, blender and photoshop, octane render, excellent composition, dynamic dramatic cinematic lighting, aesthetic, very inspirational, world inside a glass sphere by james gurney by artgerm with james jean, joe fenton and tristan eaton by ross tran, fine details, 4k resolution"
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc B580 Graphics|   20.1|    160|    1024|   32| 12450M|            1.6.31896|
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
[INFO ] stable-diffusion.cpp:195  - loading model from 'sd-v1-4.ckpt'
[INFO ] model.cpp:891  - load sd-v1-4.ckpt using checkpoint format
ZIP 0, name = archive/data.pkl, dir = archive/
[INFO ] stable-diffusion.cpp:242  - Version: SD 1.x
[INFO ] stable-diffusion.cpp:275  - Weight type:                 f32
[INFO ] stable-diffusion.cpp:276  - Conditioner weight type:     f32
[INFO ] stable-diffusion.cpp:277  - Diffusion model weight type: f32
[INFO ] stable-diffusion.cpp:278  - VAE weight type:             f32
  |==================================================| 1131/1131 - 0.00it/s
[INFO ] stable-diffusion.cpp:516  - total params memory size = 2719.24MB (VRAM 2719.24MB, RAM 0.00MB): clip 469.44MB(VRAM), unet 2155.33MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:520  - loading model from 'sd-v1-4.ckpt' completed, taking 31.09s
[INFO ] stable-diffusion.cpp:554  - running in eps-prediction mode
[INFO ] stable-diffusion.cpp:688  - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1241 - apply_loras completed, taking 0.00s
ggml_gallocr_reserve_n: reallocating SYCL0 buffer from size 0.00 MiB to 1.40 MiB
ggml_gallocr_reserve_n: reallocating SYCL0 buffer from size 0.00 MiB to 1.40 MiB
ggml_gallocr_reserve_n: reallocating SYCL0 buffer from size 0.00 MiB to 1.40 MiB
[INFO ] stable-diffusion.cpp:1374 - get_learned_condition completed, taking 1065 ms
[INFO ] stable-diffusion.cpp:1397 - sampling using Euler method
[INFO ] stable-diffusion.cpp:1434 - generating image: 1/1 - seed 42
ggml_gallocr_reserve_n: reallocating SYCL0 buffer from size 0.00 MiB to 8368.49 MiB
Provided range is out of integer limits. Pass `-fno-sycl-id-queries-fit-in-int' to disable range check.Exception caught at file:C:\Users\Intel\Desktop\aahouzi\stable-diffusion.cpp\ggml\src\ggml-sycl\common.cpp, line:102

@HeyItsBATMAN
Copy link

@aahouzi The '-fno-sycl-id-queries-fit-in-int' message appears when the requested resolution is too large. If you remove the -W 1024 -H 1024 arguments, it should get further

@aahouzi
Copy link
Author

aahouzi commented Feb 14, 2025

@HeyItsBATMAN Thanks for the info. With SD1.4, it's running, but the GPU is still not fully utilized—only small peaks appear continuously, even though the image size is just 256x256. How long does it take on your Linux-based B580?

Image

FYI, here is perf on my GPU, I still think it's too slow given model size and image size:

[INFO ] stable-diffusion.cpp:520  - loading model from 'sd-v1-4.ckpt' completed, taking 31.63s
[INFO ] stable-diffusion.cpp:554  - running in eps-prediction mode
[INFO ] stable-diffusion.cpp:688  - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1241 - apply_loras completed, taking 0.00s
ggml_gallocr_reserve_n: reallocating SYCL0 buffer from size 0.00 MiB to 1.40 MiB
ggml_gallocr_reserve_n: reallocating SYCL0 buffer from size 0.00 MiB to 1.40 MiB
ggml_gallocr_reserve_n: reallocating SYCL0 buffer from size 0.00 MiB to 1.40 MiB
[INFO ] stable-diffusion.cpp:1374 - get_learned_condition completed, taking 978 ms
[INFO ] stable-diffusion.cpp:1397 - sampling using Euler method
[INFO ] stable-diffusion.cpp:1434 - generating image: 1/1 - seed 42
ggml_gallocr_reserve_n: reallocating SYCL0 buffer from size 0.00 MiB to 57.13 MiB
  |==================================================| 30/30 - 20.31s/it
[INFO ] stable-diffusion.cpp:1472 - sampling completed, taking 578.49s
[INFO ] stable-diffusion.cpp:1480 - generating 1 latent images completed, taking 578.50s
[INFO ] stable-diffusion.cpp:1483 - decoding 1 latents
ggml_gallocr_reserve_n: reallocating SYCL0 buffer from size 0.00 MiB to 416.00 MiB
[INFO ] stable-diffusion.cpp:1493 - latent 1 decoded, taking 33.87s
[INFO ] stable-diffusion.cpp:1497 - decode_first_stage completed, taking 33.87s
[INFO ] stable-diffusion.cpp:1620 - txt2img completed in 613.35s
save result PNG image to 'output.png'

@HeyItsBATMAN
Copy link

@aahouzi
Kinda sounds like its not loading the model onto your GPU at all?

Using sd-v1-4.ckpt with euler sampling:
256x256@30steps: 5 seconds
512x512@30steps: 12 seconds

It should print a line like this:

[INFO ] stable-diffusion.cpp:516  - total params memory size = 2719.24MB (VRAM 2719.24MB, RAM 0.00MB): clip 469.44MB(VRAM), unet 2155.33MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)

You can see that everything is loaded into VRAM for me. What does it say for you?

@aahouzi
Copy link
Author

aahouzi commented Feb 14, 2025

@HeyItsBATMAN I already checked that, and my model is fully loaded in VRAM:

[INFO ] stable-diffusion.cpp:516  - total params memory size = 2719.24MB (VRAM 2719.24MB, RAM 0.00MB): clip 469.44MB(VRAM), unet 2155.33MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)

Not sure if PR #330 enabling the SYCL backend was even tested on Windows. Also, I just noticed that there is still a TODO task in this PR to add support for large image inputs, which explains the '-fno-sycl-id-queries-fit-in-int' issue.

@NeoZhangJianyu
Copy link

@aahouzi
The SD call ggml to malloc more than 4GB memory in big image size cases.

In client dGPU, there is limitation of 4GB by driver in general.
The iGPU of 11th Core and PVC have no such limitation.

I test with SD 1.4 and SD 3.x on Arc770 (16GB). Meet same issue of memory malloc.
When change the size from 1024 to 512, all are passed.

Let me check possible methods.

@airMeng
Copy link
Contributor

airMeng commented Mar 4, 2025

@aahouzi The SD call ggml to malloc more than 4GB memory in big image size cases.

In client dGPU, there is limitation of 4GB by driver in general. The iGPU of 11th Core and PVC have no such limitation.

https://github.com/intel/compute-runtime/blob/2cad595a0de0875b5c8735e5245661305318c07b/programmers-guide/ALLOCATIONS_GREATER_THAN_4GB.md

You can try to allocate several times like llama.cpp does.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants