Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removed ContextSize from most examples #663

Merged

Conversation

martindevans
Copy link
Member

Removed ContextSize from most examples.

If it's not set it's retrieved from the model, which is usually what you want. Leaving it in the examples is just unnecessary complexity.

…ved from the model, which is usually what you want!
@zsogitbe
Copy link
Contributor

I think that the context size is an important parameter for GPU memory usage. If it is too high, then you may not be able to upload all layers to the GPU or you may not be able to optimize for speed. You can also have some models with huge context size nowadays which may cause trouble. Maybe better to make some tests and see what happens with memory usage based on different context size.

@martindevans
Copy link
Member Author

To be clear I'm not removing the ability to set the context size if you want to!

This is just removing the redundant setting of the context size in all the examples. Since you can load any model you like into the examples that context size was almost always wrong and an extra bit of unnecessary complexity in sample code which should be as simple as possible.

@SignalRT
Copy link
Collaborator

I think this is a good idea.

@Lyrcaxis
Copy link
Contributor

You can remove it from SpeechChat.cs as well:
image

@martindevans martindevans merged commit 274ab6e into SciSharp:master Apr 15, 2024
2 checks passed
@martindevans martindevans deleted the remove_example_context_size branch April 15, 2024 13:24
@zsogitbe
Copy link
Contributor

As I have expected when running an example with a recent model the example crashes with:

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 49152.00 MiB on device 0: cudaMalloc failed: out of memory
Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.

This is a very small 3GB model, but the huge context size the model can handle crashes the model.

It is not a good idea to remove the context size from the examples because when people are trying out the library they will just leave because they will think that it is garbage. The context size should be set to the value the example needs and a comment may be placed to explain it.

@martindevans
Copy link
Member Author

50GB of context space! What's the model?

I think you're right though, we'll probably need to add context size back in but with a comment on it saying that it's not necessary and that leaving it unset will use the default value from the model.

We could also do with handling that error better, but I don't think that's possible without changes to llama.cpp (making the APIs fallible).

@zsogitbe
Copy link
Contributor

zsogitbe commented Apr 25, 2024

Phi-3 128K context size (but small model), recently came out. This will be the trend, increasing context size... Saying that it is not necessary to set is not a good idea either! Unless you will manage this in your load balancing (batching) automatically.

@martindevans
Copy link
Member Author

I had a feeling it might be Phi-3 😆

Saying that it is not necessary to set is not a good idea either

The reason I removed this originally was because in most cases you don't want to set it, and yet almost every snippet of code anyone ever pastes does include it. Unfortunately that's the power of example code, whatever you put there will be what everyone uses even if it's not quite right! In general you should only be setting ContextSize when you have a specific reason to (e.g. default model context is huge, as in your case).

e.g. a few recent examples:

@AsakusaRinne
Copy link
Collaborator

Models with long context-length will be more and more popular, but I don't think there're many users using it with only CPU. I can't imagine how slow it will be! The key point is still the GPU memory instead of CPU memory. To solve the problems from the root, there are two ideal approaches.

  • Automatically set the ContextSize and GpuLayerCount based on the information of model and machine.
  • Catch the error when the memory is not enough.

Unfortunately, there're problems for both of them.

  • Though we could calculate the approximation size of the memory needed the model will use based on model info, ContextSize and GpuLayerCount, it's hard to get the free GPU memory at multiple platforms. AFAIK there's no such .NET library for it.
  • llama.cpp just doesn't support it, though their core team had already be aware of it. Memory estimation utilities ggerganov/llama.cpp#4315

The setting is actually important for both examples and applying to production environment. For examples, I think it's okay to take the first approach above. We could let user input a number which indicates the GPU memory in GiB. Then we could try to automatically set the ContextSize and GpuLayerCount, using a conservative strategy to avoid OOM. Though users are still required to input a value themselves in this way, it's much more easy for users to understand. For production application, however, I haven't had any good idea about it yet. :(

@martindevans
Copy link
Member Author

Yeah it's a bit of a usability pain-point for llama.cpp that it doesn't have some way to determine the best split of memory and context size etc. I've seen a lot of discussions over on the main repo about automatically trying to do it, and every time they decide it's too complex to tackle.

Instead, we should return an error when an allocation fails. Let's fix that first, and then we can think about a memory estimation API.

I really hope this happens! llama.cpp has way too many APIs that just crash the entire application with an assertion failure. At least if you could detect an allocation failure you could just have a loop shrinking context size until it fits!


Back to the issue at hand - how about adding context size back into all of the examples something like this:

var parameters = new ModelParams(Program.modelPath)
{
    // Set the maximum amount of tokens the model can hold in memory. If this is set too high it will cause
    // llama.cpp to crash with an out-of-memory error. If you do not set this parameter the default context
    // size of the model will be used.
    ContextSize = 1024,

    // Set how many layers are moved to GPU memory. Setting this parameter too high will cause llama.cpp
    // to crash with an out-of-memory error.
    GpuLayerCount = 80
};

That way these are no longer just magical numbers hardcoded into the example, but they teach the user what they mean and how to use them to solve problems (too few tokens, too much memory).

@AsakusaRinne
Copy link
Collaborator

AsakusaRinne commented Apr 26, 2024

I've seen a lot of discussions over on the main repo about automatically trying to do it, and every time they decide it's too complex to tackle.

If we don't pursue maximizing the GPU memory utilization but with somewhat lower utilization rate such as 80%, things will be easier. However, unless we find a way to get the free GPU memory size, it could be only a toy in our examples. 😞

adding context size back into all of the examples something like this

I agree. We should also tell the users what will happen if their conversation with LLM exceeds the context length.

var parameters = new ModelParams(Program.modelPath)
{
    // Set the maximum amount of tokens the model can hold in memory. If this is set too high it will cause
    // llama.cpp to crash with an out-of-memory error. If your conversation with the LLM exceeds the context length, 
    // some early memory in the model will be dropped. If you do not set this parameter the default context
    // size of the model will be used.
    ContextSize = 1024,

    // Set how many layers are moved to GPU memory. Setting this parameter too high will cause llama.cpp
    // to crash with an out-of-memory error.
    GpuLayerCount = 80
};

@zsogitbe
Copy link
Contributor

zsogitbe commented Apr 26, 2024

  1. First of all, context size must be set for all applications based on the needs of the application! Some applications need only 100 tokens (e.g., instruction based short answer), but some other need 8000 tokens (e.g., chat), etc. This, in order to optimize memory use and in order to optimize speed. The context size must thus be set by the user case by case.

  2. GPU will always have priority because it is just several times faster than CPU. There are use cases for CPU only application, for example, on small devices (smartphone).

  3. It is possible to have an automatic load balancing of the (GPU) memory use. I am doing it. llama.cpp could do it, but we can also do it. You need to take into account several variables, free GPU memory at the moment you need to do inference, the available GPU's with their available memory (total memory of all devices), the speed of the GPU's (for optimizing the loading - first use the best GPU), etc... I am doing already all of this, thus it is possible.

  4. Estimating the needed GPU memory per model is possible (we just need to overestimate it). The size of the model is already a possible estimate with an extra buffer added, but this can still be more precise. Since we have the option to distribute the layers to GPU's and CPU's it does not matter if the estimate is not exact. Knowing the free GPU memory at all times is possible on Windows, I do not know about other platforms, but with the cuda toolkit it should be possible everywhere.

  5. Until we do not have automatic load balancing in the library (as an option) we need to explain the users that setting the context size and all GPU parameters is very important (for the above reasons). We should also add to the example all other parameters needed for this and not only GpuLayerCount (split mode...)! An example must be comprehensive to help new users to understand how to use the library.

@AsakusaRinne
Copy link
Collaborator

GPU will always have priority because it is just several times faster than CPU. There are use cases for CPU only application, for example, on small devices (smartphone).

Yes, but for long context (> 16K) I'm afraid few users will use it with only CPU. Smartphone has been much more powerful on computation than 10 years ago. With OpenCL and Vulkan it could benefit from GPU computation, so smartphone != cpu-only.

It is possible to have an automatic load balancing of the (GPU) memory use. I am doing it.

Estimating the needed GPU memory per model is possible (we just need to overestimate it).

It will be great if we could have these features! I'm looking forward to your work!

@zsogitbe
Copy link
Contributor

What I meant Rinne is that I am doing load balancing of the GPU in my project. I will not start a new thread/work about that in LLamaSharp because you and Martin are working on the scheduling and batching already which at the end when merged should include GPU load balancing also. Since GPU load balancing is relatively simple compared to scheduling and batching I am sure that you will be able to include it. If you need any info, then please ask.
Short summary of GPU load balancing:

  • check the free GPU memory (all GPU's)
  • check total available GPU memory (all GPU's)
  • estimate the needed GPU memory for the model inference (model+context). Estimation can be simply done by using the model size on disk + an estimate for context use (just test a model with a few context sizes and you will see how much, for example, each 1K context size uses) + small buffer.
  • decide if GPU can be used
  • decide if full offload is possible without splitting layers between GPU's and CPU
  • decide how many layers can be offloaded to all available GPU's (this can be simply calculated from the above information)
  • set LLamaSharp GPU parameters according to the above

@AsakusaRinne
Copy link
Collaborator

@zsogitbe One of the biggest problems for me is the lack of a library to do the following things you mentioned.

  • check the free GPU memory (all GPU's)
  • check total available GPU memory (all GPU's)

Is there any cross-platform library to achieve them?

@martindevans
Copy link
Member Author

martindevans commented Apr 27, 2024

I don't think it's something that would be covered by scheduling batched work - instead I think what you're describing would happen first (model loading). Batching works within the constraints of a pre-established KV cache laid down by the loading process. That means you can work on it indepenently of any other work going on.

I think you'd want to be automatically determining values for these parameters in the IModelParams:

And perhaps also these context params:

If you do manage to come up with a way to do this it would be great. The llama.cpp project have consistently written it off as too complex a problem to solve, but it would be a huge usability enhancement if it could be done!

Personally I'd approach this by not even automating the process yet. Just design an algorithm to decide on those 4 model values and plug the numbers in by hand to see how well it works. That's the quickest way to validate the idea. (e.g. this solves the problem Rinne is talking about, you can just look up your GPU memory values and worry about finding a cross platform library to automates that lookup later).

@zsogitbe
Copy link
Contributor

@zsogitbe One of the biggest problems for me is the lack of a library to do the following things you mentioned.

  • check the free GPU memory (all GPU's)
  • check total available GPU memory (all GPU's)

Is there any cross-platform library to achieve them?

@AsakusaRinne, I am sorry, but I do not work multiplatform. My solution is Windows only, and I told Martin the way how to do this on Windows.
If you want to do it multiplatform, then you will need to have a different solution. I advice you to look at the https://github.com/GPUOpen-LibrariesAndSDKs/VulkanMemoryAllocator project. You will need the C++ library and then create a C# bridge to get the info. Since you are probably also a game developer as Martin I guess that you know Vulkan. I don't know Vulkan.

I think you'd want to be automatically determining values for these parameters in the IModelParams:

MainGpu
SplitMode
GpuLayerCount
TensorSplits

@martindevans, yes this is exactly how I do it on Windows and how I tried to explain it above!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants