-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml: new gpu kernels + extends ggml_leaky_relu + ggml_pad #621
Conversation
As far as I can tell, the CPU implementations of add and mul already supported broadcasting in all dimensions except the first. Is that not enough for clip/sd? I am not sure why the CPU implementation needed to be changed, this seems more restrictive since it only supports broadcasting in one dimension. |
Conducted some broadcasting tests on the ggml_add and ggml_mul functions. Previously, it only worked with the first dimension of b (it had to have the same number of elements as the first dimension of a, thus repeating all rows). This didn't work in stable diffusion since it needed to repeat the rows in dimension 3, but the original implementation crashed with an assert if I didn't use repeat. |
The CUDA implementation didn't support broadcasting other than by repeating the entire tensor, so I think that would only work when broadcasting the highest dimension, but not any other. The CPU implementation supported broadcasting along all dimensions except for the first one: Lines 6860 to 6862 in a5e4560
Unless sd required broadcasting in dimension 0, it should already work with the CPU backend, right? I think we should look into implementing this in a similar way for CUDA, it would be simpler and would work with any dimensions. |
I will undo my broadcasting changes to the CPU backend and test it with stable diffusion, assuming that it supports broadcasting in all dimensions as you mentioned. I will also need to change "ggml_can_repeat_rows" to "ggml_can_repeat" since that assert is what is preventing me from using ggml_add without using ggml_repeat. I hope it works and we can reach a consensus. |
You must repeat the elements in dimensions 0 and 1, as the size of the bias is the same as the output channel of the ggml_conv_2d operation. Result of ggml_conv_2d: [out width, out height, out channels, N] Bias tensor should be reshaped from [out channels] to [1, 1, out_channels, 1] But the previusly implementation of broadcasting expect a tensor [out_width, 1, 1, 1] to works correctly |
Lines 2152 to 2156 in a5e4560
So, if |
.9><lora:lcm-lora:1>beautiful anime girl, short hair, red hair, red eyes, realistic, masterpiece, azur lane, 4k, high quality" --sampling-method lcm --cfg-scale 1 --steps 5 -t 1 -s 424354
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3050 Laptop GPU, compute capability 8.6
[INFO] stable-diffusion.cpp:4432 - loading model from 'AnythingV5_v5PrtRE-f16.gguf'
[INFO] stable-diffusion.cpp:4460 - Stable Diffusion 1.x | AnythingV5_v5PrtRE.safetensors
[INFO] stable-diffusion.cpp:4468 - model data type: f16
[INFO] stable-diffusion.cpp:4638 - total memory buffer size = 1877.33MB (clip 236.18MB, unet 1641.16MB, vae 0.00MB)
[INFO] stable-diffusion.cpp:4640 - loading model from 'AnythingV5_v5PrtRE-f16.gguf' completed, taking 1.56s
[INFO] stable-diffusion.cpp:4664 - running in eps-prediction mode
[INFO] stable-diffusion.cpp:3911 - loading taesd from 'taesd-model.gguf'
[INFO] stable-diffusion.cpp:3990 - taesd model loaded
[INFO] stable-diffusion.cpp:5505 - img2img 512x512
[INFO] stable-diffusion.cpp:5509 - target t_enc is 3 steps
[INFO] stable-diffusion.cpp:4005 - loading LoRA from 'Kana_Arima-10.gguf'
[INFO] stable-diffusion.cpp:4031 - LoRA Type: regular | Kana_Arima-10.safetensors
[INFO] stable-diffusion.cpp:4051 - LoRA data type: f16
[INFO] stable-diffusion.cpp:4748 - lora 'Kana_Arima-10' applied, taking 0.22s
[INFO] stable-diffusion.cpp:4005 - loading LoRA from 'lcm-lora.gguf'
[INFO] stable-diffusion.cpp:4031 - LoRA Type: regular | lcm_lora.safetensors
[INFO] stable-diffusion.cpp:4051 - LoRA data type: f16
[INFO] stable-diffusion.cpp:4748 - lora 'lcm-lora' applied, taking 0.50s
[INFO] stable-diffusion.cpp:5545 - apply_loras completed, taking 0.72s
GGML_ASSERT: C:\proyectos\stable-diffusion.cpp\ggml\src\ggml.c:3185: ggml_are_same_shape(a, b) || ggml_can_repeat_rows(b, a) |
Right, we need to implement support for broadcasting dimension 0, but that can be done by extending the current code. The hard part is doing it without either duplicating large amounts of code or severely impacting performance even for non-broadcast cases. This would be much easier to do with C++ templates. |
My current implementation works in most common cases, clip, llama, stable diffusion, and others, although I can undo my changes and continue waiting for an implementation that adapts to any change, which I see as very difficult and will only impact performance. Forcing ggml_add to broadcast in stable diffusion, changing |
This should be enough to add broadcasting in dimension 0 to add_f32: diff --git a/src/ggml.c b/src/ggml.c
index 7069542..a0f76c9 100644
--- a/src/ggml.c
+++ b/src/ggml.c
@@ -6897,7 +6897,8 @@ static void ggml_compute_forward_add_f32(
float * src0_ptr = (float *) ((char *) src0->data + i03*nb03 + i02*nb02 + i01*nb01);
for (int i0 = 0; i0 < ne0; i0++) {
- float * src1_ptr = (float *) ((char *) src1->data + i13*nb13 + i12*nb12 + i11*nb11 + i0*nb10);
+ const int64_t i10 = i0 % ne10;
+ float * src1_ptr = (float *) ((char *) src1->data + i13*nb13 + i12*nb12 + i11*nb11 + i10*nb10);
dst_ptr[i0] = src0_ptr[i0] + *src1_ptr;
} The only issue is that the additional modulus per column may be too costly. |
It doesn't work, besides, you made that change in the part where if tensor b (a. k.a src1) is not contiguous, and in stable diffusion, all tensors that arrive at ggml_add are contiguous. |
The principle is the same, just replace |
It works, I applied that for |
Something like this might be a bit more efficient in some cases at least (for the contiguous case): for (int r0 = 0; r0 < ne0 / ne10; ++r0) {
ggml_vec_add_f32(ne10, dst_ptr + r0*ne10, src0_ptr + r0*ne10, src1_ptr);
} |
Do you need to broadcast just bias tensors? If so, you can do what we do in Let me know if it is not clear |
@ggerganov the memory cost 💀, for small models is good, but stable diffusion it will be 500MB of bias |
It would be good to support full broadcasting for ease of use regardless. |
Well, then what should I do? I honestly didn't intend to delve into a super complete implementation of broadcasting in this pull request. It was just meant to be a solution for the most common cases, easy to implement in the backends, and to expedite the optimal adaptation of other models. But it seems we've reached a deadlock. The second intention, which drove me to create this pull request, was to discuss how I could optimize the kernel |
Have you tested the performance with the change I suggested in #621 (comment) ? I don't think the implementation of broadcasting in this PR can be merged as is, it removes functionality and it is more complex than it needs to be. |
The |
@slaren I can consider reverting the broadcasting changes to ggml.c, applying the modification you have suggested, which allows broadcasting for dimension 0. However, for the CUDA backend, it will remain unchanged for now. |
The CUDA backend also needs to be updated as well, I can help you with that, but it should be easier because the additional modulus is unlikely to affect performance significantly since it's all parallelized anyway. And if needed, we can create multiple versions of the kernels with and without broadcasting using templates. |
Ok, let's proceed like this. I will try to implement the Metal kernels. Btw, it would be useful to have some sort of unit tests with these kind of changes. You seem to be testing with SD, but I don't have it setup. We should make a simple test that runs 2D and 3D broadcast |
I will try to rewrite the kernel using that. I will need to create an little example to understand how |
I'm going to add some tests that I have done. |
This article is pretty good: https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/. It's a bit outdated because these primitives now have the |
I have created this CUDA kernel for flash attention that merges CUDA Kernelstatic __device__ __forceinline__ float warp_reduce_max(float x) {
#pragma unroll
for (int mask = 16; mask > 0; mask >>= 1) {
x = fmaxf(__shfl_xor_sync(0xffffffff, x, mask, 32), x);
}
return x;
}
#define CUDA_FLASH_ATTENTION_BLOCK_SIZE 1024
template<int block_size>
static __global__ void flash_attn_f32(const float* q, const float* k,const float* v, float* dst, float kq_scale,
int d_head, int seq_len, int num_heads) {
const int head = blockIdx.x / seq_len;
const int head_size = d_head * seq_len;
const int s = blockIdx.x % seq_len;
const int tid = threadIdx.x;
extern __shared__ char work_data[];
float* S = (float*)work_data; // theorical sequent length: 12848, due memory per block limit
float* warp_data = (float*)(work_data + seq_len * sizeof(float));
// QK^T
for(int is = tid; is < seq_len; is += block_size) {
S[is] = 0.0f;
int key_offset = is * d_head + head * head_size;
int query_offset = s * d_head + head * head_size;
for(int d = 0; d < d_head; d++) {
S[is] += k[key_offset + d] * q[query_offset + d];
}
S[is] *= kq_scale;
}
__syncthreads();
float max_val = -INFINITY;
// get the max
for(int is = tid; is < seq_len; is += block_size) {
max_val = fmaxf(max_val , S[is]);
}
max_val = warp_reduce_max(max_val);
{ // get max from all threads
int warp_id = threadIdx.x / WARP_SIZE;
int lane_id = threadIdx.x % WARP_SIZE;
if (lane_id == 0) {
warp_data[warp_id] = max_val;
}
__syncthreads();
max_val = warp_data[lane_id];
max_val = warp_reduce_max(max_val);
}
// softmax(QK^T)
float sum = 0.0f;
for(int is = tid; is < seq_len;is += block_size) {
const float val = expf(S[is] - max_val);
S[is] = val;
sum += val;
}
sum = warp_reduce_sum(sum);
{ // sum partials
int warp_id = threadIdx.x / WARP_SIZE;
int lane_id = threadIdx.x % WARP_SIZE;
if (lane_id == 0) {
warp_data[warp_id] = sum;
}
__syncthreads();
sum = warp_data[lane_id];
sum = warp_reduce_sum(sum);
}
float inv_sum = 1.0f / sum;
for(int is = tid; is < seq_len; is += block_size) {
S[is] *= inv_sum;
}
__syncthreads();
// softmax(QK^T)V
for (int d = tid; d < d_head; d += block_size) {
int dst_index = d + s * d_head + head * head_size;
int value_offset = d * seq_len + head * head_size;
dst[dst_index] = 0.0f;
for(int ic = 0; ic < seq_len; ic++) {
dst[dst_index] += v[value_offset + ic] * S[ic];
}
}
} Launcherstatic void flash_attn_f32_cuda(const float* q, const float* k,const float* v, float* dst, float kq_scale, const int d_head, const int seq_len, const int num_heads, cudaStream_t stream) {
int sram_memory_size = seq_len*sizeof(float) + WARP_SIZE * sizeof(float);
int num_blocks = num_heads * seq_len;
flash_attn_f32<CUDA_FLASH_ATTENTION_BLOCK_SIZE><<<num_blocks, CUDA_FLASH_ATTENTION_BLOCK_SIZE, sram_memory_size, stream>>>(
q, k, v, dst, kq_scale, d_head, seq_len, num_heads);
} |
The results between backends don't match, and it seems that 2 seconds per iteration is too much for it to be an M3 Max on the Metal backend, although I don't know much about how Metal works. |
I am not sure that we should expect identical images between backends. I tried reducing the max mse, but didn't see anything clearly wrong. The largest error usually comes from the matrix multiplications, but that's not different with CUDA. |
It's just that identical results are not expected between backends like CPU and CUDA, but the difference is only small artifacts in the images, not a complete change like this. |
@ggerganov I have been trying the Metal debugging flags, and I found some issues running
|
@slaren In stable diffusion computing, Tanh is not used, unless specified with |
It might have been the soft max src1 issue, but I am surprised that doesn't cause bigger problems if it is really a bug. |
Now the performance issue remains, which, for being an M3 Max, the processing of the UNet and VAE compute seems very slow to me, my RTX 3050 laptop I get 1.5 iterations per second. |
@ggerganov please review the changes, I am not sure that's the best solution for the problems with soft max. |
Here is a
|
Ah interesting - sorry for missing the discussion. I was so focused on the Mixtral issue that I didn't pay attention and see this just now. Let me review tomorrow first thing |
I wonder if the soft max issue is related to the early ending with Metal (in mixtral). |
@slaren I think this indeed fixes it! ❤️ Still testing Edit: the Edit2: All my failure cases with Mixtral are now resolved - the fix does seem to work |
@@ -487,6 +488,7 @@ kernel void kernel_soft_max_4( | |||
} | |||
|
|||
const float lsum = lsum4[0] + lsum4[1] + lsum4[2] + lsum4[3]; | |||
threadgroup_barrier(mem_flags::mem_threadgroup); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these barriers make a difference in your tests?
To me they seems superfluous, but I could be missing something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure why, but without the barrier this test case fails sometimes:
test_cases.emplace_back(new test_soft_max(GGML_TYPE_F32, {4096, 4096, 8, 1}));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a execution barrier only threadgroup_barrier(mem_flags::mem_none);
seems to be enough to fix the issue. My intuition when adding the barrier is to ensure that all threads finished computing their value before the simd_sum
, but I don't know enough about Metal to tell if that's actually required. In CUDA, the warp shuffle functions have an implicit synchronization, so it is not necessary.
The plan is to merge the Mixtral branch in |
I am not sure if there is a better combination of flags though. I still can't figure why some mat muls fail with these flags. The full documentation is available in |
Yup, I saw the failing mat muls. I also get some invalid loads in the F32 kernel. Will be looking to fix those |
Can I merge this pull request? |
Just squash it in a single commit when merging |
Thank you very much, everyone, for the feedback and assistance. I hope to continue contributing. I'm considering implementing Winograd (reduce memory usage and computation) in the |
@FSSRepo Thank you for your contributions - your help is very much appreciated! |
The purpose of this PR is to synchronize the changes I made in ggml while working on a PR for the stable-diffusion.cpp project. This adds new CUDA kernels that could help other projects fully support different backends.
New CUDA kernels
New Operation
ggml_pad
: add a zero padding. equivalent of PyTorch pad. Needed in stable-diffusion.cpp.Tasks:
ggml_group_norm
kernel, the current one is definitely very inefficient. @slaren can you help me?