-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More GPU threads for dequantization #1341
Conversation
Similar PR: #1221 |
This is good if it increases performance by 14%, but your results seem to contradict the tests from @dfyz here, that showed that increasing occupancy does not always lead to increased performance. Ideall though, we would use |
I assume the discrepancy between my results and the results by @dfyz are due to the differences in hardware. I used a GTX 1070 while he used an RTX 3070 and a Tesla M40. He hypothesized that due to the latency of memory transfers in his case the faster dequantization kernel did not yield an actual performance improvement. In other words, his hardware is sufficiently fast that either version finishes in time. But my hardware is much slower so the dequantization is presumably slow enough that there is room for improvement (I'm assuming that latency does not depend on the GPU). In general, GPU performance optimizations are not 100% portable because they depend on the specifics of the hardware used; on my hardware using more GPU threads is significantly faster and because low occupancy can lead to poor performance I was assuming that this would be universally faster (I was not aware of the other PR). Of course a more sophisticated solution would be preferable but I made this PR because it seemed like a low-hanging fruit. |
This is very encouraging! I think that simply changing the block size (as opposed to what I did in the #1221) is a very non-invasive change, so as long as it results in speeding up kernels at least on some GPUs, it should be merged. I actually tried something similar to your PR here (see the "hacky patch"), but the As I said, that was a hacky patch, so it's possible that I got something wrong. If someone with a relatively recent GPU can confirm that this PR doesn't slow down the kernels, I think we can proceed with increasing the block size for all kernels and merging the PR (thought it's ultimately up to @slaren). |
ggml-cuda.cu
Outdated
@@ -227,7 +227,8 @@ static __global__ void dequantize_block_q8_0(const void * vx, float * y) { | |||
|
|||
static void dequantize_row_q4_0_cuda(const void * vx, float * y, int k, cudaStream_t stream) { | |||
const int nb = k / QK4_0; | |||
dequantize_block_q4_0<<<nb, 1, 0, stream>>>(vx, y); | |||
GGML_ASSERT(nb % 256 == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@slaren already said this, but I want to emphasize that the GPU code shouldn't assume that the number of blocks is divisible by 256. The CPU code only assumes that it is even.
I agree, and even though I appreciate the gesture, I am just a voice here, I don't "own" any code. |
FYI, getting 15% perf increase on my Vega64 as well. |
e37718b
to
5014840
Compare
With |
I did some tests, first line is the old (master) then second is the patch in this PR. Running on the perplexity tool with the first 103 lines from wiki.test.raw, 8 threads, no mmap, LLaMa 7B Q4_2. RX 6900XT, Linux 5.15, ROCm 5.2.3 (based on #1087, not master)
RTX 2080 Ti, Linux 5.8, CUDA 11.4
RTX 3090, Linux 5.15, CUDA 11.4:
The machines are different so not comparable directly to each other. |
I think I passed the wrong argument for determining the data limit. Can you re-run the test for the 3090 and check whether it works now? |
RX 6900XT:
RTX 2080 Ti:
RTX 3090:
Same setup as before. There may be something about the older software versions. I can't update anything since I don't own them, I only have access. Also, other users are running stuff on these machines, too but today seems quieter. I ran the tests a couple times and the numbers were pretty much the same. Sidenote, I should automate this somehow... |
It seems performance isn't consistently better/worse... should we make larger CUDA blocks a compilation option? |
I'd like someone else test as well, maybe on CUDA 12 it is better? |
RTX 3080, CUDA 12.1
|
I tried some hacking: It seems like the early return is liked better by the GPU: static __global__ void dequantize_block_q4_2(const void * vx, float * y, int k) {
const block_q4_2 * x = (const block_q4_2 *) vx;
const int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i >= k) {
return;
}
const float d = x[i].d;
const uint8_t * pp = x[i].qs;
for (int l = 0; l < QK4_2; l += 2) {
const uint8_t vi = pp[l/2];
const int8_t vi0 = vi & 0xf;
const int8_t vi1 = vi >> 4;
const float v0 = (vi0 - 8)*d;
const float v1 = (vi1 - 8)*d;
y[i*QK4_2 + l + 0] = v0;
y[i*QK4_2 + l + 1] = v1;
}
} Also, only check the max block size once, and only use it if it's lower than 256 like the original PR: static void dequantize_row_q4_2_cuda(const void * vx, float * y, int k, cudaStream_t stream) {
const int nb = k / QK4_2;
static int block_size = -1;
if (block_size == -1) {
int min_grid_size = -1;
CUDA_CHECK(cudaOccupancyMaxPotentialBlockSize(&min_grid_size, &block_size, dequantize_block_q4_2, 0, 0));
if (block_size > 256) block_size = 256;
}
int grid_size = (nb + block_size - 1) / block_size; // Round up.
dequantize_block_q4_2<<<grid_size, block_size, 0, stream>>>(vx, y, nb);
} With that I get to pretty much the same performance as with master, but I didn't test too exhaustively. |
Thanks everyone for the feedback. Caching the block size in particular is a good catch; on my hardware this also seems to be slightly faster though in my case the higher occupancy had presumably offset the additional overhead. I'm surprised though that the earlier return makes a difference; intuitively I would have thought that to the compiler both versions are equivalent. In any case, I integrated the suggestions. I only feel like we've kind of done a more complicated implementation of the original approach where the block size is just set to 256... |
This number, 256, is just magic, maybe it should be an option in the CMake file so the user can set it? Command line? Although, I also tried 128 and 64 and they were slower. |
I already posted this in the PR about faster quantization kernels, but I also quickly tried implementing a faster quantization kernel: #1221 (comment) . With that kernel the number of blocks is 32 (16 for q4_2) times higher. I would intuitively assume that larger blocks would then give you better performance because there are fewer idle threads and (I think?) the GPU will have more opportunity for latency hiding. |
I think it's also closer to how the OpenCL kernel does it. |
Maybe we should just wait for the quantization changes and then simultaneously optimize kernels and block size. |
There is quite a bit of duplicated code now, I recommend refactoring it with a templace such as: typedef void (*to_fp32_kernel_t)(const void * x, float * y, int k);
template<to_fp32_kernel_t F, int QK>
static void launch_kernel(const void * x, float * y, int k, cudaStream_t stream) {
const int nb = k / QK;
static int block_size = -1;
if (block_size == -1) {
int min_grid_size;
CUDA_CHECK(cudaOccupancyMaxPotentialBlockSize(&min_grid_size, &block_size, F, 0, 0));
block_size = std::min(block_size, GGML_CUDA_MAX_BLOCK_SIZE);
}
const int grid_size = (nb + block_size - 1) / block_size; // Round up.
F<<<grid_size, block_size, 0, stream>>>(x, y, nb);
}
static to_fp32_cuda_t ggml_get_to_fp32_cuda(ggml_type type) {
switch (type) {
case GGML_TYPE_Q4_0:
return launch_kernel<dequantize_block_q4_0, QK4_0>;
case GGML_TYPE_Q4_1:
return launch_kernel<dequantize_block_q4_1, QK4_1>;
case GGML_TYPE_Q4_2:
return launch_kernel<dequantize_block_q4_2, QK4_2>;
case GGML_TYPE_Q5_0:
return launch_kernel<dequantize_block_q5_0, QK5_0>;
case GGML_TYPE_Q5_1:
return launch_kernel<dequantize_block_q5_1, QK5_1>;
case GGML_TYPE_Q8_0:
return launch_kernel<dequantize_block_q8_0, QK8_0>;
case GGML_TYPE_F16:
return launch_kernel<convert_fp16_to_fp32, 1>;
default:
return nullptr;
}
} |
Closing this due to the merging of #1412 . |
Wouldn't this still be useful for the prompt processing? |
I think I'll just try to write better quantization kernels instead of trying to optimize parameters for the ones on master. The ones that I'm using in the matrix vector multiplication template should be easy to apply to just dequantization. |
On master CUDA kernels for dequantization are launched as blocks with only a single thread. This is an inefficient way of utilizing GPUs since they can execute many threads in parallel. This PR changes the way CUDA kernels are launched to launch 256 times fewer blocks but with each block containing 256 threads. On my hardware (GTX 1070) this speeds up prompt processing by ~14% for 33b and a prompt with 399 tokens (navy seals copypasta). I only implemented and tested q4_0 and q5_1 because there are pending breaking quantization changes in #1305 but the change is very simple and should be easy to adapt; just tell me how you want to proceed. Notably I was not able to 100% ensure that generations are unchanged by this PR because I am getting a bug that affects the reproducibility of CUDA-accelerated prompt processing as described in #1340 .