-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removed ContextSize
from most examples
#663
Removed ContextSize
from most examples
#663
Conversation
…ved from the model, which is usually what you want!
I think that the context size is an important parameter for GPU memory usage. If it is too high, then you may not be able to upload all layers to the GPU or you may not be able to optimize for speed. You can also have some models with huge context size nowadays which may cause trouble. Maybe better to make some tests and see what happens with memory usage based on different context size. |
To be clear I'm not removing the ability to set the context size if you want to! This is just removing the redundant setting of the context size in all the examples. Since you can load any model you like into the examples that context size was almost always wrong and an extra bit of unnecessary complexity in sample code which should be as simple as possible. |
I think this is a good idea. |
As I have expected when running an example with a recent model the example crashes with:
This is a very small 3GB model, but the huge context size the model can handle crashes the model. It is not a good idea to remove the context size from the examples because when people are trying out the library they will just leave because they will think that it is garbage. The context size should be set to the value the example needs and a comment may be placed to explain it. |
50GB of context space! What's the model? I think you're right though, we'll probably need to add context size back in but with a comment on it saying that it's not necessary and that leaving it unset will use the default value from the model. We could also do with handling that error better, but I don't think that's possible without changes to llama.cpp (making the APIs fallible). |
Phi-3 128K context size (but small model), recently came out. This will be the trend, increasing context size... Saying that it is not necessary to set is not a good idea either! Unless you will manage this in your load balancing (batching) automatically. |
I had a feeling it might be Phi-3 😆
The reason I removed this originally was because in most cases you don't want to set it, and yet almost every snippet of code anyone ever pastes does include it. Unfortunately that's the power of example code, whatever you put there will be what everyone uses even if it's not quite right! In general you should only be setting ContextSize when you have a specific reason to (e.g. default model context is huge, as in your case). e.g. a few recent examples: |
Models with long context-length will be more and more popular, but I don't think there're many users using it with only CPU. I can't imagine how slow it will be! The key point is still the GPU memory instead of CPU memory. To solve the problems from the root, there are two ideal approaches.
Unfortunately, there're problems for both of them.
The setting is actually important for both examples and applying to production environment. For examples, I think it's okay to take the first approach above. We could let user input a number which indicates the GPU memory in GiB. Then we could try to automatically set the |
Yeah it's a bit of a usability pain-point for llama.cpp that it doesn't have some way to determine the best split of memory and context size etc. I've seen a lot of discussions over on the main repo about automatically trying to do it, and every time they decide it's too complex to tackle.
I really hope this happens! llama.cpp has way too many APIs that just crash the entire application with an assertion failure. At least if you could detect an allocation failure you could just have a loop shrinking context size until it fits! Back to the issue at hand - how about adding context size back into all of the examples something like this: var parameters = new ModelParams(Program.modelPath)
{
// Set the maximum amount of tokens the model can hold in memory. If this is set too high it will cause
// llama.cpp to crash with an out-of-memory error. If you do not set this parameter the default context
// size of the model will be used.
ContextSize = 1024,
// Set how many layers are moved to GPU memory. Setting this parameter too high will cause llama.cpp
// to crash with an out-of-memory error.
GpuLayerCount = 80
}; That way these are no longer just magical numbers hardcoded into the example, but they teach the user what they mean and how to use them to solve problems (too few tokens, too much memory). |
If we don't pursue maximizing the GPU memory utilization but with somewhat lower utilization rate such as 80%, things will be easier. However, unless we find a way to get the free GPU memory size, it could be only a toy in our examples. 😞
I agree. We should also tell the users what will happen if their conversation with LLM exceeds the context length. var parameters = new ModelParams(Program.modelPath)
{
// Set the maximum amount of tokens the model can hold in memory. If this is set too high it will cause
// llama.cpp to crash with an out-of-memory error. If your conversation with the LLM exceeds the context length,
// some early memory in the model will be dropped. If you do not set this parameter the default context
// size of the model will be used.
ContextSize = 1024,
// Set how many layers are moved to GPU memory. Setting this parameter too high will cause llama.cpp
// to crash with an out-of-memory error.
GpuLayerCount = 80
}; |
|
Yes, but for long context (> 16K) I'm afraid few users will use it with only CPU. Smartphone has been much more powerful on computation than 10 years ago. With OpenCL and Vulkan it could benefit from GPU computation, so smartphone != cpu-only.
It will be great if we could have these features! I'm looking forward to your work! |
What I meant Rinne is that I am doing load balancing of the GPU in my project. I will not start a new thread/work about that in LLamaSharp because you and Martin are working on the scheduling and batching already which at the end when merged should include GPU load balancing also. Since GPU load balancing is relatively simple compared to scheduling and batching I am sure that you will be able to include it. If you need any info, then please ask.
|
@zsogitbe One of the biggest problems for me is the lack of a library to do the following things you mentioned.
Is there any cross-platform library to achieve them? |
I don't think it's something that would be covered by scheduling batched work - instead I think what you're describing would happen first (model loading). Batching works within the constraints of a pre-established KV cache laid down by the loading process. That means you can work on it indepenently of any other work going on. I think you'd want to be automatically determining values for these parameters in the And perhaps also these context params: If you do manage to come up with a way to do this it would be great. The llama.cpp project have consistently written it off as too complex a problem to solve, but it would be a huge usability enhancement if it could be done! Personally I'd approach this by not even automating the process yet. Just design an algorithm to decide on those 4 model values and plug the numbers in by hand to see how well it works. That's the quickest way to validate the idea. (e.g. this solves the problem Rinne is talking about, you can just look up your GPU memory values and worry about finding a cross platform library to automates that lookup later). |
@AsakusaRinne, I am sorry, but I do not work multiplatform. My solution is Windows only, and I told Martin the way how to do this on Windows.
@martindevans, yes this is exactly how I do it on Windows and how I tried to explain it above! |
Removed
ContextSize
from most examples.If it's not set it's retrieved from the model, which is usually what you want. Leaving it in the examples is just unnecessary complexity.