SYCL: Add gated linear attention kernel #11175

qnixsynapse · 2025-01-10T13:26:36Z

Following #11001 , added gated linear attention kernel based on the logic of the CUDA kernel.

This is my very first initial attempt at translating CUDA kernels to SYCL. Please excuse me for mistakes.
test-backend-ops passing for now.

Could not able test the model(which is 32B) because of lack of memory. Maybe test in an Nvidia GPU should give the results.

Alcpz

Thanks for the contribution! Overall the code looks good, though I found an issue. I will try to launch the complete model in an A100 once the barrier is fixed.

ggml/src/ggml-sycl/gla.cpp

Alcpz

I was able to run a small perf test with the model:

model	size	params	backend	ngl	sm	test	t/s
rwkv6qwen2 32B Q4_K - Medium	20.42 GiB	34.74 B	SYCL	99	none	pp512	276.94 ± 0.29
rwkv6qwen2 32B Q4_K - Medium	20.42 GiB	34.74 B	SYCL	99	none	tg128	21.67 ± 0.06

A run of llama-cli with the model seems fine. LGTM!

Alcpz · 2025-01-14T12:24:33Z

@qnixsynapse Let's wait a day for others to review. If no one else comes by, I will merge it.

qnixsynapse · 2025-01-14T12:29:20Z

@Alcpz sure.

BTW, I did some inspection because I felt like this seems a bit slow for an A100. It turns out that dequant matmul kernels aren't vectorized enough and does not make use of local_accessors. The entire thing seems converted using a tool called SYCLomatic and I do find this tool problematic tbh.

Alcpz · 2025-01-14T12:53:41Z

Yes, you are right. The original code for this backend was primarily generated using SYCLomatic, which prioritizes functionalities over optimal design due to the nature of the tool. The reference build from which this was converted has had significant improvements since then.
It's good that contributors like you improve the backend and contribute by writing proper SYCL code instead.

Regarding local memory, there’s no universal guarantee that adding it to matrix multiplication kernels will enhance performance across all devices, given the hardware differences across vendors. Achieving consistent performance could potentially require splitting code paths or adjusting kernels for specific hardware, so just have that in mind in case you want to start working there.

qnixsynapse · 2025-01-14T13:04:42Z

Regarding local memory, there’s no universal guarantee that adding it to matrix multiplication kernels will enhance performance across all devices, given the hardware differences across vendors. Achieving consistent performance could potentially require splitting code paths or adjusting kernels for specific hardware, so just have that in mind in case you want to start working there.

Actually, my aim is to reduce the number of global memory accesses during an operation by caching data that is being accessed in an operation (like scales for dequantization) in local memory. For now, I will wait for someone from the Codeplay/Intel side or the person who wrote the original code to improve it. They have better knowledge of and access to the hardware than I do..

NeoZhangJianyu

It's great job!
Since you make the unit test is passed. I think the quality should be OK. At least for function.

Thank you!

Rbiessy · 2025-01-15T15:07:17Z

Actually, my aim is to reduce the number of global memory accesses during an operation by caching data that is being accessed in an operation (like scales for dequantization) in local memory. For now, I will wait for someone from the Codeplay/Intel side or the person who wrote the original code to improve it. They have better knowledge of and access to the hardware than I do..

FYI we are not planning to optimize SYCL kernels for Nvidia devices in the short term. There may be longer term options which will allow us to compile and launch native CUDA kernels with SYCL interop mode.

qnixsynapse · 2025-01-15T15:37:37Z

FYI we are not planning to optimize SYCL kernels for Nvidia devices in the short term.

@Rbiessy It's okay. My primary focus for this backend is Intel GPUs only, since I personally own one myself.

NeoZhangJianyu · 2025-01-17T05:56:40Z

Actually, my aim is to reduce the number of global memory accesses during an operation by caching data that is being accessed in an operation (like scales for dequantization) in local memory. For now, I will wait for someone from the Codeplay/Intel side or the person who wrote the original code to improve it. They have better knowledge of and access to the hardware than I do..

FYI we are not planning to optimize SYCL kernels for Nvidia devices in the short term. There may be longer term options which will allow us to compile and launch native CUDA kernels with SYCL interop mode.

From the viewpoint to migrate CUDA to SYCL, or support CUDA device by SYCL, it's OK to use SYCL interop mode to support native CUDA kernels.
But from the project viewpoint, both SYCL -> CUDA -> NV GPU and CUDA-> NV GPU are same to end user of llama.cpp. Maybe the first path is slower than second path for more wrap.

For CUDA user, they have more choice. That means only the best solution to be used popular finally. I don't think SYCL backend could attract more the CUDA user than CUDA backend.

The value of SYCL backend is to support Intel GPU with good performance, that's why some Intel GPU user move from other tools to llama.cpp SYCL backend. SYCL backend is the first choice of Intel GPU.

In this year, my target is still to optimize the SYCL backend with good function and performance on Intel GPU only.

qnixsynapse added 2 commits January 10, 2025 18:47

SYCL: Add Gated Linear attention kernel

1af85b5

glahpp: add a space at the end of file

81d8529

github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Jan 10, 2025

Alcpz requested changes Jan 13, 2025

View reviewed changes

ggml/src/ggml-sycl/gla.cpp Outdated Show resolved Hide resolved

gla: Put the barrier inside the main logic loop

db3ded2

Alcpz approved these changes Jan 13, 2025

View reviewed changes

NeoZhangJianyu approved these changes Jan 15, 2025

View reviewed changes

NeoZhangJianyu merged commit f446c2c into ggerganov:master Jan 15, 2025
48 checks passed

qnixsynapse deleted the gla branch January 15, 2025 04:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SYCL: Add gated linear attention kernel #11175

SYCL: Add gated linear attention kernel #11175

qnixsynapse commented Jan 10, 2025 •

edited

Loading

Alcpz left a comment

Alcpz left a comment

Alcpz commented Jan 14, 2025

qnixsynapse commented Jan 14, 2025

Alcpz commented Jan 14, 2025

qnixsynapse commented Jan 14, 2025 •

edited

Loading

NeoZhangJianyu left a comment

Rbiessy commented Jan 15, 2025

qnixsynapse commented Jan 15, 2025

NeoZhangJianyu commented Jan 17, 2025

SYCL: Add gated linear attention kernel #11175

SYCL: Add gated linear attention kernel #11175

Conversation

qnixsynapse commented Jan 10, 2025 • edited Loading

Alcpz left a comment

Choose a reason for hiding this comment

Alcpz left a comment

Choose a reason for hiding this comment

Alcpz commented Jan 14, 2025

qnixsynapse commented Jan 14, 2025

Alcpz commented Jan 14, 2025

qnixsynapse commented Jan 14, 2025 • edited Loading

NeoZhangJianyu left a comment

Choose a reason for hiding this comment

Rbiessy commented Jan 15, 2025

qnixsynapse commented Jan 15, 2025

NeoZhangJianyu commented Jan 17, 2025

qnixsynapse commented Jan 10, 2025 •

edited

Loading

qnixsynapse commented Jan 14, 2025 •

edited

Loading