-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL] Align GEMM dispatch #7566
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a welcome addition since it simplifies the MatMul dispatch 😄
We spotted a couple of issues with the code, I've added comments. I hope they're helpful.
ggml-sycl.cpp
Outdated
use_mul_mat_vec_q = use_mul_mat_vec_q; // Check dp4a | ||
use_mul_mat_q = use_mul_mat_q ; // check dp4a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lines don't do anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there will be refactoring about SYCL computation capablities in #6408 keep here for reminder
ggml-sycl.cpp
Outdated
use_mul_mat_vec_q = use_mul_mat_vec_q; // Check dp4a | ||
use_mul_mat_q = use_mul_mat_q ; // check dp4a | ||
#ifdef SYCL_USE_XMX | ||
use_mul_mat_q = use_mul_mat_q && (!fp16_performance_good || src1->ne[1] <= MMQ_MAX_BATCH_SIZE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic in this block is a little confusing:
(!fp16_performance_good || src1->ne[1] <= MMQ_MAX_BATCH_SIZE)
.
It says: use MMQ if either 1) FP16 perf is bad, or 2) the number of columns in src1
is less than or equal to the maximum.
Is this the intention?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just align with CUDA logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am seeing a new test failures on Arc A770
, is this to be expected?
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1]): GGML_ASSERT: /builds/perseus-performance-libraries/llama_ci/llama.cpp/ggml-sycl.cpp:13858: false
ggml_sycl_op_dequantize_mul_mat_vec unsupported GGML_TYPE 16
ggml-sycl.cpp
Outdated
bool ggml_sycl_supports_mmq(enum ggml_type type) { | ||
// TODO: accuracy issues in MMQ | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate on what accuracy issues you are having with MMQ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the master using ggml_sycl_op_mul_mat_sycl
for these 5 cases, you can try to force using MMQ
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[1,1],nr=[1,1]): ggml_sycl_op_mul_mat_sycl
OK
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[10,1],nr=[1,1]): ggml_sycl_op_mul_mat_sycl
OK
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[10,1],nr=[2,1]): ggml_sycl_op_mul_mat_sycl
OK
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[10,10],nr=[1,1]): ggml_sycl_op_mul_mat_sycl
OK
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[10,10],nr=[2,1]): ggml_sycl_op_mul_mat_sycl
bcadd61
to
bfed283
Compare
fixed |
Co-authored-by: Neo Zhang Jianyu <[email protected]>
Co-authored-by: Neo Zhang Jianyu <[email protected]>
Co-authored-by: Neo Zhang Jianyu <[email protected]>
Co-authored-by: Neo Zhang Jianyu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The performance is increased from 34 token/s to 37 on Arc770 with llama2-7b-Q4.
It's good!
The issue is exposed during #6408
I will split #6408 into several smaller PRs for eazy reviewing, the tasks will be updated according to issues exposed