Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support using mmap when applying LoRA #2095

Merged
merged 3 commits into from
Jul 11, 2023

Conversation

howard0su
Copy link
Collaborator

@howard0su howard0su commented Jul 4, 2023

On Linux, when mmap with MAP_PRIVATE, the modification to the mmap buffer will not write to disk.
On Windows, when MapViewOfFile with FILE_MAP_COPY, the modification will not write to disk.

@howard0su
Copy link
Collaborator Author

Modify the code based on documentation. Needs testings.

@Green-Sky
Copy link
Collaborator

Based on documentation, on linux we probably need to set MAP_PRIVATE instead of MAP_SHARED

@howard0su
Copy link
Collaborator Author

You are right.

@howard0su
Copy link
Collaborator Author

The intention behind this PR is that I want to futher refector loading code:

llm_model_fil -> represent a file, it can be a model file or a lora file.
llm_model -> represent a model, has virtual function to override to support multi models like llama. This class will also responsible to move tensor to GPU based on tensor backend preference.

offload_trait -> a set of class to represent how to offload for CUDA, OpenCL, CUDA_SPLIT, etc..

mmap and non-mmap makes code a bit complex. any suggestions on how we can further reduce the complexity?

@slaren
Copy link
Member

slaren commented Jul 5, 2023

Are there any possible performance considerations or any other consequences of enabling copy-on-write with mmap by default?

@howard0su
Copy link
Collaborator Author

Based on my understanding, performance should be same. but I need to test. Didn't find a lora model yet.

@Green-Sky
Copy link
Collaborator

Based on my understanding, performance should be same. but I need to test. Didn't find a lora model yet.

https://huggingface.co/tloen/alpaca-lora-7b
try this classic

@howard0su
Copy link
Collaborator Author

Based on my understanding, performance should be same. but I need to test. Didn't find a lora model yet.

https://huggingface.co/tloen/alpaca-lora-7b try this classic

The file doesn't have the correct magic in the header. Anyway, thank you.

@Green-Sky
Copy link
Collaborator

Green-Sky commented Jul 7, 2023

ohh, you dont know. There is a convert-lora-to-ggml.py script that converts it. :)
it creates a ggla file

@howard0su howard0su marked this pull request as ready for review July 11, 2023 09:07
@howard0su
Copy link
Collaborator Author

Tested on Windows on functionality. Also use md5hash to verify the original weights file is unchanged.

@Green-Sky
Copy link
Collaborator

Are there any possible performance considerations or any other consequences of enabling copy-on-write with mmap by default?

I think you need to run 2 llama.cpp, with atleast 1 lora to see duplication in memory.

Copy link
Member

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good, and the overhead of always mapping files as copy-on-write should be minimal unless the pages are actually modified, so I think we can try this.

@howard0su
Copy link
Collaborator Author

No noticeable diff:

main:
llama_print_timings:        load time =  1355.03 ms
llama_print_timings:      sample time =     9.58 ms /    19 runs   (    0.50 ms per token,  1984.33 tokens per second)
llama_print_timings: prompt eval time =   815.69 ms /     5 tokens (  163.14 ms per token,     6.13 tokens per second)
llama_print_timings:        eval time =  7154.28 ms /    18 runs   (  397.46 ms per token,     2.52 tokens per second)
llama_print_timings:       total time =  7982.96 ms

llama_print_timings:        load time =  1678.13 ms
llama_print_timings:      sample time =     9.70 ms /    19 runs   (    0.51 ms per token,  1959.17 tokens per second)
llama_print_timings: prompt eval time =   889.12 ms /     5 tokens (  177.82 ms per token,     5.62 tokens per second)
llama_print_timings:        eval time =  7327.71 ms /    18 runs   (  407.09 ms per token,     2.46 tokens per second)
llama_print_timings:       total time =  8229.86 ms



this branch without Lora:
llama_print_timings:        load time =  1366.86 ms
llama_print_timings:      sample time =     9.56 ms /    19 runs   (    0.50 ms per token,  1987.45 tokens per second)
llama_print_timings: prompt eval time =   884.49 ms /     5 tokens (  176.90 ms per token,     5.65 tokens per second)
llama_print_timings:        eval time =  7270.45 ms /    18 runs   (  403.91 ms per token,     2.48 tokens per second)
llama_print_timings:       total time =  8168.13 ms


with Lora:
llama_print_timings:        load time = 23749.24 ms
llama_print_timings:      sample time =    16.09 ms /    33 runs   (    0.49 ms per token,  2050.33 tokens per second)
llama_print_timings: prompt eval time =   824.12 ms /     5 tokens (  164.82 ms per token,     6.07 tokens per second)
llama_print_timings:        eval time = 10654.57 ms /    32 runs   (  332.96 ms per token,     3.00 tokens per second)
llama_print_timings:       total time = 11500.75 ms

@howard0su howard0su merged commit 2347463 into ggml-org:master Jul 11, 2023
howard0su added a commit to howard0su/llama.cpp that referenced this pull request Jul 12, 2023
Has perf regression when mlock is used.

This reverts commit 2347463.
howard0su added a commit that referenced this pull request Jul 13, 2023
Has perf regression when mlock is used.

This reverts commit 2347463.
@Green-Sky Green-Sky mentioned this pull request Aug 4, 2023
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants