-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quantized matmul with CUDA sets the result to zero instead of properly computing it #529
Comments
Hm, maybe it is a bug... If I replace line 5545 with
|
I think I came across the same error ggerganov/llama.cpp#3202 (comment) |
@Green-Sky Thanks for pointing me to that. When building and running the file in
I'll dig deeper; will try to investigate why |
I can now force it to work with an ugly crutch: in the main
IDK how to do it more... properly. |
@JohannesGaessler Hi! Your recent comment was very helpful for me in debugging this issue. If possible, can you advise on how to properly configure CUDA archs in EDIT: Sorry for pinging you. I had Normally setting |
SOLVED! Read the thread for the investigation details and the solution.
In rwkv.cpp, I'm updating
ggml
from commit a1d0ea7 to the most recent commit 8ca2c19.After the update, FP32, FP16 and quantized inference on CPU works. FP32 and FP16 inference on GPU (CUDA) also works.
However, quantized inference on GPU (CUDA) does not work: it silently leaves result tensors filled with zeros. I'm using the same offloading method that worked fine before: set tensor's
backend
, callggml_cuda_transform_tensor
.Here's a minimal code that reproduces the behavior:
On my Windows 10 machine it prints:
I expect
Q4_0 result
when offloading to be equal to the corresponding result when offload is not performed.I'm 90% sure that this is not a bug in
ggml
, but I am doing something wrong. How the code above can be fixed?The text was updated successfully, but these errors were encountered: