-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement customizable RoPE #2054
Conversation
Just came across another RoPE adjustment method on Reddit. Thought it might be helpful, so here's the link! |
This still means i will get better perplexity-performance when usingvthis pr with @TheBloke 's supehot model-variants I guess? |
Somehow exciting? I use my merged 13b model (wizard vicuña + starcoder + superhot 16k) With that 16k command.
Looks reasonable. And and , what if I want to test 32k or higher, how to set both parameters? Any ideas? |
The The
I found Empirically, without fine tune, you could try
With superhot 16k or longchat 13b, perhaps you could try (KV cache alone requires 25GB!!)
|
I used some numbers posted by @JohannesGaessler, and made changes in scratch0 size in this PR. I can rebase this PR on their PR #2056 if needed. |
I think the numbers that I determined for the VRAM scratch buffer will probably work for the RAM scratch buffer but I would still advise you to be cautious since the two types of scratch buffer work differently (the VRAM scratch buffer has tighter limits). |
I tried the
|
I can't reproduce. Is CUDA build different? I can't test CUDA build. The number 482344960 seems to be computed from
with |
I am trying this out, and it is working fine for me so far, though I've only tried:
I'll keep trying other models and context sizes and seeing how it goes. Not sure if other fixes are included, but this seems to make inference way faster (via fewer long pauses to do CPU-related tasks) on my machine as well. Possibly just due to not having to recompute context as I hit the 2048-byte mark so often? EDIT: Also working just fine for me:
|
Yes, it's a CUDA build for a 1080ti and I really should have used |
If you could give me a stack trace of when MEM_REQ_SCRATCH0 is called, I could try to figure out what is wrong with the CUDA build. Otherwise, I'll see if I can get a system somewhere with cuda. |
Can't reproduce the error today, no idea what I did exactly to trigger it... |
This looks great, but similar to #1967 - let's wait for a while before merging. |
Test with my merged 13b vicuña model(wizardvicuña + starcoder Lora + gpt4tools + 16k superhot) 16k With perplexity Base 70000 scale 0.4 [1] 5.5564 Base 57200 scale 0.5 [1] 6.7699 the chunks decrease while ctx enlarged, that might be the reason for some perplexity problem? but obviously not here. 20k cause the chunks decrease to 16. 20k 32k I believe 13b MEM_REQ_EVAL is not enough to test🤷 |
Running perplexity Openllama 3b with -C 16384 scale 0.5 base 90000 No enough space in the contexts memory pool needed 543758112 available 536870912 13b c 32768 scale 0.25 base 120000 |
Could we quantize the KV cache? |
Another solution #1955 Btw I just saw |
I think this was tried, and resulted in bad results. It should already be in f16. edit: do we use flashattention for the forward pass? |
For some reason, the server example output some random unicode characters when using |
server gives me 413 when the json data is large. We need help from those who contributed server code. |
I believe |
Yeah SlyEcho is right based on what I saw in the lib, setting |
The only thing that I know of allocating 512 MB (536870912) is from
to something like
and see if it helps? |
It looks like a simple read buffer to me, and it's separate from the overall size limit. |
|
the default is actually
|
The original RoPE has pre-defined parameters theta_i = 10000^(−2(i−1)/d), for i in [1, 2, ..., d/2] Our customizable RoPE, ggml_rope_custom_inplace, uses theta_i = scale * base^(−2(i−1)/d), for i in [1, 2, ..., d/2] with the default matches the original scale = 1.0 base = 10000 The new command line arguments --rope-freq-base --rope-freq-scale set the two new RoPE parameter. Recent researches show changing these two parameters extends the context limit with minimal loss. 1. Extending Context to 8K kaiokendev https://kaiokendev.github.io/til#extending-context-to-8k 2. Extending Context Window of Large Language Models via Positional Interpolation Shouyuan Chen, Sherman Wong, Liangjian Chen, Yuandong Tian https://arxiv.org/abs/2306.15595 3. NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation. https://www.reddit.com/user/bloc97 https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/ For the bold, try adding the following command line parameters to your favorite model: -c 16384 --rope-freq-base 80000 --rope-freq-scale 0.5
What is the latest state of this approach - is it worth merging and supporting? |
I've been using this on a Mac M1 Max since the PR was raised and it's working fine for me. I've been hoping it will get merged so I can go back to compiling from |
Let's merge and maybe then improve later. |
@@ -15759,7 +15759,7 @@ static void ggml_compute_backward(struct ggml_context * ctx, struct ggml_tensor | |||
{ | |||
if (src0->grad) { | |||
assert(src1->type == GGML_TYPE_I32); | |||
assert(ggml_nelements(src1) == 4); | |||
assert(ggml_nelements(src1) == 3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be 6? Based on the code immediately after it should be at least 4, I think, not 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through the code an I also can't see why it's 3 when the lines just below it show it clearly taking 4 elements and looks like it designed to fail the assertion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be fixed in 513f861
Is there any reason why the following lines are unmodified and still use the hardcoded 10000.0 and 1.0 rope frequency and scale? https://github.com/ggerganov/llama.cpp/blob/master/ggml-cuda.cu#L2955 |
for the |
I left the backward code untouched because I wasn't sure how I could correctly modify it and test it. I'm also not sure about cuda bits. |
The CUDA part is broken right now, it should be fixed. |
How do I implement this with RoPE and without it with current LLMs? |
You can read a bit more about RoPE use in llama.cpp in the llama.cpp/examples/main/README.md Though I would recommend you try out the new Self-Extend support added in commit #4815 which I think is better, as you don't need to retrain the model to get better results. |
Thanks @abc-nix! What about the implementation of customized RoPE |
Sorry, @bilal-aamer, I am not sure what you are trying to ask here. This PR adds customized RoPE support. Latter, YaRN RoPE scaling was added in PR #2268 and some other fixes were added after that. main's help has this to say about how the options and parameter to make use of RoPE/YaRN:
I am not sure what you are trying to achieve or what exactly you are asking. Hopefully someone else isn't as obtuse as me and can help you out. |
Is there any documentation on how to implement this or an example? I am kind of new in the field and I am fine tuning code llama 2 and I want to increase the context length. But between all these posts I am sort of confused how to implement it actually. This is my implementation: this is the error I am getting: |
The original RoPE has pre-defined parameters
theta_i = 10000^(−2(i−1)/d), for i in [1, 2, ..., d/2]
Our customizable RoPE, ggml_rope_custom_inplace, uses
theta_i = scale * base^(−2(i−1)/d), for i in [1, 2, ..., d/2]
with the default matches the original
scale = 1.0
base = 10000
The new command line arguments
--rope-freq-base
--rope-freq-scale
set the two new RoPE parameter.
Recent researches show changing these two parameters extends the context limit with minimal loss.
Extending Context to 8K kaiokendev https://kaiokendev.github.io/til#extending-context-to-8k
Extending Context Window of Large Language Models via Positional Interpolation Shouyuan Chen, Sherman Wong, Liangjian Chen, Yuandong Tian https://arxiv.org/abs/2306.15595
NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation. https://www.reddit.com/user/bloc97 https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/
For the bold, try adding the following command line parameters to your favorite model: -c 16384 --rope-freq-base 80000 --rope-freq-scale 0.5