-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
split: allow --split-max-size option #6343
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great that you start looking into this. I did some attempts but resigned. Please reuse the test script I prepared:
https://github.com/phymbert/llama.cpp/blob/hp/split/max-size/examples/gguf-split/tests.sh
@phymbert Thanks for the info, yeah I did managed to get it work, but didn't test it very carefully. I'll use the script that you provided. Meanwhile, would you mind to test the version that I've pushed? |
I can confirm that my version works with your test script @phymbert (thanks again for that) - except for the test case 4 and 5 which requires |
Ah yeah, I forgot this params. Just remove it. It's not necessary for now, and it's not possible to have no tensor in gguf AFAIK. |
@phymbert With this PR, it maybe possible to have a split that does not have any tensors. I encountered this case (as a bug) if inside |
fprintf(stderr, "\033[3Ddone\n"); | ||
void copy_file_to_file(std::ifstream & f_in, std::ofstream & f_out, const size_t in_offset, const size_t len) { | ||
// TODO: detect OS and use copy_file_range() here for better performance | ||
if (read_buf.size() < len) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Artefact2 With this refactoring, it will be trivial to add copy_file_range()
as you suggested. The only thing that I'm not sure is how to detect if I can use copy_file_range()
or not (i.e. based on OS or something else?). Do you have any idea on that? Thanks.
It triggers a malloc of 0 size in ggml, with a warning. |
Oh OK I see, that's because we allocate one device buffer per file, so file with no tensor will have 0 size buffer. A quick hack is to add a dummy tensor with size of 1 |
Was it urgent to merge ? unfortunately you did not add the test file, I think it is important to add it here and in the CI to show the complete process between |
I merge it because it has been open for review for a 3 days. I agree that the test is needed, but since I initally have no intend to add test on this PR, so I though that we will do in another PR. Would you mind to open a new PR @phymbert ? Thanks. |
You always can request for review, especially here I did the I think tests are part of the devlopment and must be done in the same PR, by the Author, this is the standard development best practice. I do not plan to test other people code BTW. The shard feature is widely used now on HF and we need to carefully test it and document it, taking into account some previous issue we did: Is it complicated to add |
Sorry I think there’s some misunderstanding here:
I’ll open a new PR today when I finish my works in real-life. Will let you know when it’s done. |
Thanks for the clarification and your effort of course, I understand, merci. FYI I am uploading grok-1 to ggml-org in |
It would also be nice to update the |
* split by max size * clean up arg parse * split: ok * add dry run option * error on 0 tensors * be positive * remove next_metadata_size
@ngxson it looks |
I think there's a UI bug on huggingface. I downloaded the file and use (My internet is quite laggy so some notebook cells ran twice, but it doesn't change the result anw) Edit: gguf viewer on huggungface is quite weird. I thought that the metadata table is firstly decoded on backend and send to frontend. But turns out, they download the whole model into browser which caused my browser to crash. |
Ok, thanks for your double checks, the code is straightforward indeed: llama.cpp/examples/gguf-split/gguf-split.cpp Line 217 in f87f7b8
Wait and see |
* split by max size * clean up arg parse * split: ok * add dry run option * error on 0 tensors * be positive * remove next_metadata_size
AFAIK the browser just fetches the GGUF headers - not the entire file |
can you take a look at this @mishig25? |
It seem to be the case of Firefox - it only downloads 2MB of the file. But firefox does not have the required API to decode the file: I tried Chrome, it can read the file but does not stop downloading: @julien-c @mishig25 If you need more information for debugging, please let me know |
@ngxson could i trouble you to open an issue in https://github.com/huggingface/huggingface.js with your comment? 🙏 we'll check asap |
I remember we discussed somewhere to be able to make a split where the first shard is very small and contains primarily the metadata so that it can be downloaded quickly and then start the download of the other shards without waiting for the first to finish. I can't find an issue for this - it might be a good idea to track this and might be related to: huggingface/huggingface.js#601 (comment) We can add extra meta data in the first file that describes all tensors in the shards for example |
yes here:
EDIT: issue created: |
* split by max size * clean up arg parse * split: ok * add dry run option * error on 0 tensors * be positive * remove next_metadata_size
Closes #6259
I ended up re-write the
split_strategy
class. How it works now:ctx_out
that is used by the output file (onectx_out
per one split). Thesectx_out
are saved into astd::vector<struct gguf_context *> ctx_outs
. This step only produce a split "plan", not the actual file.split_strategy.print_info()
can be used to print out the "plan" and details for each split (n_tensors, size)split_strategy.write()
to actually write out to output files