-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix more int overflow during quant (PPL/CUDA). #6563
Fix more int overflow during quant (PPL/CUDA). #6563
Conversation
It would be good to have a set of tests in |
Given blockIdx.x, blockDim.x, and threadIdx.x are all basically uint32_t, we could keep some of those as uint32_t and only cast them to uint64_t or int64_t when actually necessary. |
Edit: Ignore below. I was using the wrong environment. Retesting.
|
There are two disadvantages with 64 bit integers over 32 bit integers: they need 2 registers and they are slower. But for dequantize kernels I would intuitively assume that this is not going to matter because you need very few registers and you're going to be heavily IO bound anyways. So for simplicity I would say to just use 64 bits throughout unless someone can demonstrate that this actually makes a performance difference (I'm not seeing any performance difference on my RTX 3090, my other GPUs are currently busy). |
Just saw @JohannesGaessler's comment (after I pushed the revert). I can revert the revert if decided to be the right approach. |
I personally would in this case prefer to just consistently use 64 bit ints, but ultimately I would say either way is fine. The biggest issue would have been the additional effort from actually changing the code but this has already been done anyways. |
I completely forgot about this PR. @slaren even without the tests, do you think we should just merge it, given that it seems to fix the issue for at least one backend? |
Yes absolutely, we should merge this now if it solves the immediate problem. The changes look good to me. |
ggml-cuda.cu
Outdated
@@ -1225,7 +1225,7 @@ static void ggml_cuda_op_mul_mat_cublas( | |||
|
|||
// the main device has a larger memory buffer to hold the results from all GPUs | |||
// ldc == nrows of the matrix that cuBLAS writes into | |||
int64_t ldc = id == ctx.device ? ne0 : row_diff; | |||
int ldc = id == ctx.device ? ne0 : row_diff; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, why is this being changed? I thought the problem was that certain ints had too few bits for large models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you maybe, in response to one of my earlier comments, accidentally change more places than just the ones originally touched in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JohannesGaessler This one was reverted following an earlier comment questionning why it was changed in the first place. As previously mentioned, I have limited knowledge about these vars and rely on others expertise for the review. And because of the large number of ints that was overflowing, I had to guess and change them in batches until all the crashes were fixed, but surely I most likely changed more than needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine to change more int
to int64_t
than necessary. But this is a change where a value was int64_t
on master to int
with your PR. I think this was done on accident when you reverted some of your other changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is because my previous PR was merged into master, this is a subsequent PR. I can revert them back if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The revert is in a single commit dranger003@9acb43d so if these are all fine I can delete that one commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just delete the commit I'd say. Using int64_t
has no disadvantages other than maybe slightly worse performance and I was not able to measure any performance difference whatsoever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed a rebase to remove the revert commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this particular change is still there. Revert it and I'll merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cublasGemmEx
takes an int
anyway, so this doesn't really matter. There is a 64-bit interface to cublas, but I don't think there are any cases where a single dimension is larger than 2^31-1.
ggml-cuda.cu
Outdated
int i13 = blockIdx.x * blockDim.x + threadIdx.x; | ||
int i12 = blockIdx.y * blockDim.y + threadIdx.y; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question, why the int64_t -> int change?
ggml-cuda/convert.cu
Outdated
@@ -5,16 +5,16 @@ | |||
|
|||
template <int qk, int qr, dequantize_kernel_t dequantize_kernel, typename dst_t> | |||
static __global__ void dequantize_block(const void * __restrict__ vx, dst_t * __restrict__ y, const int64_t k) { | |||
const int64_t i = 2*(blockDim.x*blockIdx.x + threadIdx.x); | |||
const int i = 2*(blockDim.x*blockIdx.x + threadIdx.x); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question.
ggml-cuda/convert.cu
Outdated
const int64_t tid = threadIdx.x; | ||
const int64_t ip = tid/32; // ip is 0 or 1 | ||
const int64_t il = tid - 32*ip; // 0...32 | ||
const int64_t is = 8*ip + il/16; | ||
const int tid = threadIdx.x; | ||
const int ip = tid/32; // ip is 0 or 1 | ||
const int il = tid - 32*ip; // 0...32 | ||
const int is = 8*ip + il/16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question.
ggml-cuda/convert.cu
Outdated
const int64_t tid = threadIdx.x; | ||
const int64_t ip = tid/16; // 0 or 1 | ||
const int64_t il = tid - 16*ip; // 0...15 | ||
const int tid = threadIdx.x; | ||
const int ip = tid/16; // 0 or 1 | ||
const int il = tid - 16*ip; // 0...15 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question.
ggml-cuda/convert.cu
Outdated
const int64_t i = (int64_t)blockDim.x*blockIdx.x + threadIdx.x; | ||
const int i = blockDim.x*blockIdx.x + threadIdx.x; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These were all originally int
and I reverted them to avoid changing more than needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No they were not. Go to the "files changed" tab and look at the combined changes of all of your commits relative to master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed them in PR #6491.
4947778
to
91c10ef
Compare
Closes #6948. |
* Fix more int overflow during quant. * Fix some more int overflow in softmax. * Revert back to int64_t.
Running perplexity on Command-R+ using CUDA is currently broken without this commit (more info here #6491 (comment)).
Although perplexity now works with all tested quants, I may have move some extra vars to
int64_t
than needed.