-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama_model_loader: support multiple split/shard GGUFs #6187
Conversation
Co-authored-by: slaren <[email protected]>
- use only one gguf_context for metadata only - store all ggml_context in a vector as the files and mappings - store all weights in a vector along with the source tensor - rename ctx_gguf to meta - rename ctx_meta to contexts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is something not right when mmap
is enabled and -ngl > 0
(at least with Metal that is). Using LLaMA 7B Q4_0
:
llm_load_tensors: ggml ctx size = 0.22 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 933.69 MiB, ( 933.75 / 147456.00)
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 933.69 MiB, ( 1867.44 / 147456.00)
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 933.69 MiB, ( 2801.12 / 147456.00)
ggml_backend_metal_buffer_from_ptr: error: failed to allocate buffer, size = 933.69 MiB
ggml_backend_metal_buffer_from_ptr: error: failed to allocate buffer, size = 933.69 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size = 70.31 MiB
llm_load_tensors: CPU buffer size = 70.31 MiB
llm_load_tensors: CPU buffer size = 70.31 MiB
llm_load_tensors: CPU buffer size = 70.31 MiB
llm_load_tensors: CPU buffer size = 70.31 MiB
llm_load_tensors: Metal buffer size = 933.69 MiB
llm_load_tensors: Metal buffer size = 933.69 MiB
llm_load_tensors: Metal buffer size = 933.69 MiB
.................................................................llama_model_load: error loading model: vector
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model './x-00001-of-00005.gguf'
main: error: unable to load model
…nsor optional Co-authored-by: Georgi Gerganov <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
So it does not manage to allocate 2 out 5 metal buffer, I think we should stop here. Then it tried to load the mapping to a buffer which does not exist. |
@ggerganov Should we accept this case when we cannot allocate |
I think there is something wrong. It should only require one CPU buffer, since there is only one tensor allocated in the CPU. The first-last range is probably wrong, and it is causing a buffer overflow. |
This should fix it: diff --git a/llama.cpp b/llama.cpp
index cd20ad7a..2b6a5e9e 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -3199,6 +3199,9 @@ struct llama_model_loader {
*addr = mapping->addr;
for (ggml_tensor * tensor = ggml_get_first_tensor(ctx); tensor; tensor = ggml_get_next_tensor(ctx, tensor)) {
const auto & w = get_weights(ggml_get_name(tensor));
+ if (w.idx != idx) {
+ continue;
+ }
*first = std::min(*first, w.offs);
*last = std::max(*last, w.offs + ggml_nbytes(tensor));
}
@@ -5145,6 +5148,9 @@ static bool llm_load_tensors(
void * addr = nullptr;
size_t first, last;
ml.get_mapping_range(&first, &last, &addr, file_no, ctx);
+ if (first >= last) {
+ continue;
+ }
ggml_backend_buffer_t buf = ggml_backend_cpu_buffer_from_ptr((char *)addr + first, last - first);
if (buf != nullptr) {
bufs.push_back(buf);
@@ -5167,6 +5173,9 @@ static bool llm_load_tensors(
void * addr = nullptr;
size_t first, last;
ml.get_mapping_range(&first, &last, &addr, file_no, ctx);
+ if (first >= last) {
+ continue;
+ }
ggml_backend_buffer_t buf = ggml_backend_metal_buffer_from_ptr((char *) addr + first, last - first, max_size);
if (buf != nullptr) {
bufs.push_back(buf); Maybe we need to add a dummy |
@ggerganov can you please pull and retry ? I have applied the same logic as before: if the allocation failed, fallback to cpu only |
@phymbert the logic is still wrong, it is asking Metal to map a buffer beyond its size. Please check the diff I posted above. |
It works with the patch. Will take an extra look at the PR tomorrow. Thank you all for helping out |
…s in split, throw an exception instead of asserting
…uffers in order to free them if an error occurs in the next allocation. Reserve the expected size.
…fore allocating backend buffer
…t dest max len. Co-authored-by: Xuan Son Nguyen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can merge after slaren's approval
Excellent, really proud of to have contribute until |
@ggerganov Could we add this line in Hot topics ? - support loading sharded model split using `gguf-split` cli #6187 |
Yes, of course |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for taking time to implement this functionality.
… allocated Co-authored-by: slaren <[email protected]>
// check if dest ends with postfix | ||
int size_prefix = str_split_path.size() - str_postfix.size(); | ||
if (size_prefix > 0 && str_split_path.find(str_postfix, size_prefix) != std::string::npos) { | ||
snprintf(dest, std::min((size_t) size_prefix, maxlen), "%s", split_path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah sorry I was quite rush this time, will be more careful. Thanks!
* split: support in llama_model_loader * avoid copying the entire vector Co-authored-by: slaren <[email protected]> * split: move llama_tensor_offset to llama_model_loader * llama_model_loader: PR feedbacks: - use only one gguf_context for metadata only - store all ggml_context in a vector as the files and mappings - store all weights in a vector along with the source tensor - rename ctx_gguf to meta - rename ctx_meta to contexts * avoid copying the entire vector * Simplify this by making these optional, switch some layer creation tensor optional Co-authored-by: Georgi Gerganov <[email protected]> * Handle optional tensors Co-authored-by: Georgi Gerganov <[email protected]> * llama_model_loader: fail if backend cannot allocate buffer * fix mmap buffer management * llama_model_loader: map file to backend buffer if the allocation succeeds only * llama_model_loader: only map tensors included in the context * llama_model_loader: minor, use same variable name for consistency, fix spacing in types cast * llama_model_loader: fail if any of backend buffer cannot be allocated * spacing Co-authored-by: slaren <[email protected]> * fix loop over pointer Co-authored-by: slaren <[email protected]> * llama_model_loader: if n_tensors declared not equals to loaded tensors in split, throw an exception instead of asserting * llama_model_loader: ensure mappings vector has the expected size * llama_model_loader: use at instead of operator[] if this should never add to the map. * llama_model_loader: immediately add the backend buffer to the model buffers in order to free them if an error occurs in the next allocation. Reserve the expected size. * llama_model_loader: be sure the model mappings has enough capacity before allocating backend buffer * llama_model_loader: fix map -> unordered map * llama_split_prefix: use a clearer version, not pass split path len but dest max len. Co-authored-by: Xuan Son Nguyen <[email protected]> * llama : minor ggml-ci * llama : introduce some typedef helpers * docs: add model shard in hot topic * llama_model_loader: put mapping in a unique_ptr from the moment it is allocated Co-authored-by: slaren <[email protected]> * fix llama_split_prefix --------- Co-authored-by: slaren <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]>
* split: support in llama_model_loader * avoid copying the entire vector Co-authored-by: slaren <[email protected]> * split: move llama_tensor_offset to llama_model_loader * llama_model_loader: PR feedbacks: - use only one gguf_context for metadata only - store all ggml_context in a vector as the files and mappings - store all weights in a vector along with the source tensor - rename ctx_gguf to meta - rename ctx_meta to contexts * avoid copying the entire vector * Simplify this by making these optional, switch some layer creation tensor optional Co-authored-by: Georgi Gerganov <[email protected]> * Handle optional tensors Co-authored-by: Georgi Gerganov <[email protected]> * llama_model_loader: fail if backend cannot allocate buffer * fix mmap buffer management * llama_model_loader: map file to backend buffer if the allocation succeeds only * llama_model_loader: only map tensors included in the context * llama_model_loader: minor, use same variable name for consistency, fix spacing in types cast * llama_model_loader: fail if any of backend buffer cannot be allocated * spacing Co-authored-by: slaren <[email protected]> * fix loop over pointer Co-authored-by: slaren <[email protected]> * llama_model_loader: if n_tensors declared not equals to loaded tensors in split, throw an exception instead of asserting * llama_model_loader: ensure mappings vector has the expected size * llama_model_loader: use at instead of operator[] if this should never add to the map. * llama_model_loader: immediately add the backend buffer to the model buffers in order to free them if an error occurs in the next allocation. Reserve the expected size. * llama_model_loader: be sure the model mappings has enough capacity before allocating backend buffer * llama_model_loader: fix map -> unordered map * llama_split_prefix: use a clearer version, not pass split path len but dest max len. Co-authored-by: Xuan Son Nguyen <[email protected]> * llama : minor ggml-ci * llama : introduce some typedef helpers * docs: add model shard in hot topic * llama_model_loader: put mapping in a unique_ptr from the moment it is allocated Co-authored-by: slaren <[email protected]> * fix llama_split_prefix --------- Co-authored-by: slaren <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]>
* split: support in llama_model_loader * avoid copying the entire vector Co-authored-by: slaren <[email protected]> * split: move llama_tensor_offset to llama_model_loader * llama_model_loader: PR feedbacks: - use only one gguf_context for metadata only - store all ggml_context in a vector as the files and mappings - store all weights in a vector along with the source tensor - rename ctx_gguf to meta - rename ctx_meta to contexts * avoid copying the entire vector * Simplify this by making these optional, switch some layer creation tensor optional Co-authored-by: Georgi Gerganov <[email protected]> * Handle optional tensors Co-authored-by: Georgi Gerganov <[email protected]> * llama_model_loader: fail if backend cannot allocate buffer * fix mmap buffer management * llama_model_loader: map file to backend buffer if the allocation succeeds only * llama_model_loader: only map tensors included in the context * llama_model_loader: minor, use same variable name for consistency, fix spacing in types cast * llama_model_loader: fail if any of backend buffer cannot be allocated * spacing Co-authored-by: slaren <[email protected]> * fix loop over pointer Co-authored-by: slaren <[email protected]> * llama_model_loader: if n_tensors declared not equals to loaded tensors in split, throw an exception instead of asserting * llama_model_loader: ensure mappings vector has the expected size * llama_model_loader: use at instead of operator[] if this should never add to the map. * llama_model_loader: immediately add the backend buffer to the model buffers in order to free them if an error occurs in the next allocation. Reserve the expected size. * llama_model_loader: be sure the model mappings has enough capacity before allocating backend buffer * llama_model_loader: fix map -> unordered map * llama_split_prefix: use a clearer version, not pass split path len but dest max len. Co-authored-by: Xuan Son Nguyen <[email protected]> * llama : minor ggml-ci * llama : introduce some typedef helpers * docs: add model shard in hot topic * llama_model_loader: put mapping in a unique_ptr from the moment it is allocated Co-authored-by: slaren <[email protected]> * fix llama_split_prefix --------- Co-authored-by: slaren <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]>
Motivation
Since we support
gguf-split
CLI in #6135, it is the good time to load model with multiple (potentially distributed) GGUFs for weights. For example, we can expect the Grok-1 weights to not easily fit inside a single GGUF.This change allows to load a model regardless if it is bundled inside a single or multiple GGUFs generated with
gguf-split
.Changes
llama_split_path
andllama_split_prefix
to allow downstream tool to generate their own GGUFs split using the same file name convention:"%s-%05d-of-%05d.gguf"
general.split
tosplit.no
,general.split_count
tosplit.count
and addsplit.tensors.count
: the previous splits created will not be loaded here. Usegguf-split
from d0d5de4 to merge first, then split again with master versionTests
cd models ../scripts/hf.sh --repo ggml-org/models --file phi-2/ggml-model-q4_0.gguf
You will notice the new:
llama_model_loader: additional 6 GGUFs metadata loaded.
References
CI Builds
Tasks
Special thanks to @slaren and @ngxson for having supporting me in this effort.