-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow specifying p scale factor for ggml rope and rope_back ops #1967
Conversation
This adds ggml_rope_scaled, ggml_rope_scaled_inplace, ggml_rope_back_scaled ops Add LLAMA_ROPE_SCALE to Makefile (note not in cmake yet), if not specified defaults to 1.0
Bail out if p_scale != 1.0 n rope operation for the time being
Man, I'm so confused. CUDA has a rope operation and wasn't set up to handle my changes, but apparently it doesn't actually ever get called? Even just asserting |
CUDA rope gets called when the tensors are loaded onto VRAM ( |
Oh, thanks. That's the kind of thing only people with a decent GPU would notice, I guess! |
I have an old card but it can run 7B Q4_0 with its 8GB VRAM. |
Enjoying your powerful 8GB VRAM GPU, sitting back in your comfy chair, probably eating avocado toast every day. Life must be pretty sweet! But yeah, I only have 6GB and I'm pretty sure |
Yeah, my wife is Mexican, I get a lot of nice avocado everything, all the time. Maybe you can try the OpenLLaMA 3B? The model itself may not be amazing but getting a comparative baseline may be possible. I have some model files on HF available. |
Sounds nice! I love avocados. The joke is that avocados are so expensive you have to mortgage your house to eat avocado toast. :)
It wouldn't really help much currently, because I didn't (and don't have the knowledge to) write an actual implementation for Metal/CUDA. All I did was add an assert to error out if the scale factor isn't 1.0 and it ends up in the Metal or CUDA rope implementations. |
I don't know it seems pretty easy to edit ggml-cuda.cu, the theta calculation is only in one place. |
The problem is I made the change just based on the example in the original discussion, not based on actually understanding the math. So I don't know how to convert it from the iterative CPU version to the parallel GPU kernel form and be sure that it's correct. If you want to show me what it should be, I'd be happy to update the PR with the change. |
It's in const float theta = p*powf(theta_scale, col/2); I just modified it there like this: const float theta = 0.5*p*powf(theta_scale, col/2); And I got the pretty much the same result for perplexity as is posted (5.9839) in that thread. Actually now that I'm looking at it |
As @SlyEcho said, the RoPE tensor is only evaluated on the GPU if enough layers are being offloaded, specifically if the K component of the KV cache gets offloaded (num layers + 3). This is because when I tested it the overhead from |
I was able to test this with CUDA (although only with 128 context length and 35 layers). I am a lot less sure about the Metal change. It produces reasonable results with CUDA. (Also, thanks SlyEcho!) |
OpenLLaMA 3B doesn't seem to like this method... Or maybe just my power function |
I didn't do it quite the way you said, I tried to use the passing down That's just scaling |
@JohannesGaessler Since you already have a branch with this PR's changes (except maybe the GPU stuff which may well be broken) do you want to just take over adding that functionality? |
I think it's better this way. |
In the first place, I think we'll need to wait for the opinion of @ggerganov regarding the ggml changes. But I think it's fine to just have two separate PRs since the features should work independently from one another. |
Just checking. I'd generally agree, but in this case I'm playing with stuff I don't fully understand (and in the case of Metal can't even test at all) so I thought I'd offer to let someone competent take over.
Didn't you say it doesn't work, though. Or am I misunderstanding what "OpenLLaMA 3B doesn't seem to like this method..." meant? |
assert(ggml_nelements(src1) == 4); | ||
const int n_past = (int)((float *) src1->data)[0]; | ||
const int n_dims = (int)((float *) src1->data)[1]; | ||
const int mode = (int)((float *) src1->data)[2]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are UB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are UB.
Can you explain in more detail? Are you saying it's illegal to cast float
to int
in CUDA?
Also, just to be clear, those values were originally int
s that were cast to float so we know they started out as integer values. Also, none of them should be out of the range of a 32 bit float's ability to represent integers (should be ~16mil for signed).
The model just doesn't seem to support the context scaling well, same for 7B. |
It seems to work somewhat better on larger models. The small models are probably just barely clinging to sanity in the first place, stuff like this may be enough to push them over the edge. Another thing that might help is using a good sampler and parameters that help keep the model on track, like Mirostat. |
As far as I understood it, this technique was designed to take advantage of the over fit trained into the original llama models. It's unclear if this applies to open-llama as well. Theoretically it should, but in practice it's possible it might not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are suggestions to take into account if we decide to merge the rope scaling option
But, for now I think we should wait it out and see how applicable this thing is, because we don't want to add and support an option that will potentially never be used
((float *) b->data)[0] = (float)n_past; | ||
((float *) b->data)[1] = (float)n_dims; | ||
((float *) b->data)[2] = (float)mode; | ||
((float *) b->data)[3] = p_scale; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use memcpy
to store the params so we can all sleep well knowing this is not UB :)
int n_past, | ||
int n_dims, | ||
int mode, | ||
float p_scale); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to extend the API - add p_scale
to original ggml_rope_xxx()
and add comment to use p_scale == 1.0f
for regular computation. Add GGML_ASSERT(p_scale == 1.0f)
in backward call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to extend the API - add
p_scale
to originalggml_rope_xxx()
Won't this break every single thing that currently uses the llama.cpp version of GGML?
What do you think about using a define to enable the p_scale argument for rope and having it be off by default? That way existing stuff can opt in.
It might also be worth thinking about adding something like GGML_API_VERSION
which could be bumped when incompatible changes occur, so stuff building against GGML could handle API changes more gracefully.
Unfortunately my system (and internet) recently got fried by a lightning strike. I don't have the ability to make changes to this PR at the moment. If anyone wants to take over and make whatever changes are necessary (if we come to that point) then I'm fine with it. It may be several days before I have a functional system again.
That was my thinking too, which is why I didn't do stuff like add commandline options. I was trying to keep my changes pretty minimal and limited to the basic functionality. |
The issue with linearly scaling the token position is that it reduces accuracy for smaller context windows quite a bit. E.g., if I scale with I think, it would be better to have a function that is nearly linear for positions near the beginning of the context window, and then gracefully slows down as the position increases. Here is an example that behaves pretty well for context sizes of up to 3584: In function
Using this and
So, basically, we can extent the context window to
If one wants to get to a context window of 4096, these are examples of mappings that work better than
They are slightly better than linear scaling at So, overall, I think that it should be possible to pick a position transformation that works better than linear scaling that depends on the requested context size. Ideally, this should be done automatically instead of asking the user to recompile I did some experimentation for contexts beyond 4096, but there things get flaky (hitting asserts for scratch buffers not big enough, CUDA randomly giving NaNs with the exact same position mapping working fine on the CPU, etc.), so I think we need to also look into making things more stable in that regime. |
Also runs into memory limits... |
Hello |
@kaiokendev Sorry if my comment came across as being critical. Coming up with the idea of scaling the positions is really great (it would have never occurred to me to try that), and I totally get it that one can improve via fine-tuning. But the point of my comment was to try to go beyond the original idea. From the experiments I have done, it looks like one can get away without fine tuning for contexts up to 4096 using simple non-linear position mappings. Or, if one decided to fine-tune anyway, my uneducated guess would be that one would end up with a better final result having started with a better initial guess. My initial post only showed perplexities for LLaMA-7B. Looking at 13B, I get perplexity of |
@ikawrakow no I am not taking the comment critically lol I am really following everyone who is taking it to the limits and its really exciting. Am just reminding that the attention heads are not calibrated to the interpolated positions, so it is doubtful the model is fully leveraging those interpolated positions. Maybe it is the case up to a point, but the other side is that secretly it is ditching certain concepts from the interpolated portions. Im not meaning to stop the work in fact if you can show it is extrapolating even without finetuning and is really using the 3072 then it has a lot of value, many people will be blown away by that discovery since it is thought the untrained model cannot extrapolate at all. The value in the chart is higher because I use 4-bit precision models. |
I'll look into it in the next few days. |
Set to 0.125 with 16k context Lora, did have a good result... Sorry, that won't happen in openblas. I rebuild with Blas and run shit.. |
Closing in favor of #2019 which is probably a better approach. |
Based on the description in #1965
This adds
ggml_rope_scaled
,ggml_rope_scaled_inplace
,ggml_rope_back_scaled
ops. The existing rope ops just pass 1.0 to the impl for compatibility with the current behavior/API.Add
LLAMA_ROPE_SCALE
toMakefile
(note not in cmake yet), if not specified defaults to 1.0. You can run, for example, 'make LLAMA_ROPE_SCALE=0.5`.I'd guess we probably don't want to use the
LLAMA_ROPE_SCALE
approach if this actually gets merged, it's mainly just there to facilitate easier testing.edit: Theoretically this should work for CUDA and Metal now (untested). Note the GPU ops only apply when loading every possible layer into VRAM (for 7B that would be
-ngl 35
).