Support using mmap when applying LoRA #2095

howard0su · 2023-07-04T08:07:30Z

On Linux, when mmap with MAP_PRIVATE, the modification to the mmap buffer will not write to disk.
On Windows, when MapViewOfFile with FILE_MAP_COPY, the modification will not write to disk.

howard0su · 2023-07-04T08:07:52Z

Modify the code based on documentation. Needs testings.

Green-Sky · 2023-07-04T08:25:05Z

Based on documentation, on linux we probably need to set MAP_PRIVATE instead of MAP_SHARED

howard0su · 2023-07-04T10:16:43Z

You are right.

howard0su · 2023-07-04T10:20:19Z

The intention behind this PR is that I want to futher refector loading code:

llm_model_fil -> represent a file, it can be a model file or a lora file.
llm_model -> represent a model, has virtual function to override to support multi models like llama. This class will also responsible to move tensor to GPU based on tensor backend preference.

offload_trait -> a set of class to represent how to offload for CUDA, OpenCL, CUDA_SPLIT, etc..

mmap and non-mmap makes code a bit complex. any suggestions on how we can further reduce the complexity?

slaren · 2023-07-05T20:40:56Z

Are there any possible performance considerations or any other consequences of enabling copy-on-write with mmap by default?

howard0su · 2023-07-05T23:08:42Z

Based on my understanding, performance should be same. but I need to test. Didn't find a lora model yet.

Green-Sky · 2023-07-06T09:02:01Z

Based on my understanding, performance should be same. but I need to test. Didn't find a lora model yet.

https://huggingface.co/tloen/alpaca-lora-7b
try this classic

howard0su · 2023-07-06T12:43:44Z

Based on my understanding, performance should be same. but I need to test. Didn't find a lora model yet.

https://huggingface.co/tloen/alpaca-lora-7b try this classic

The file doesn't have the correct magic in the header. Anyway, thank you.

Green-Sky · 2023-07-07T10:17:11Z

ohh, you dont know. There is a convert-lora-to-ggml.py script that converts it. :)
it creates a ggla file

howard0su · 2023-07-11T09:08:00Z

Tested on Windows on functionality. Also use md5hash to verify the original weights file is unchanged.

Green-Sky · 2023-07-11T09:27:18Z

Are there any possible performance considerations or any other consequences of enabling copy-on-write with mmap by default?

I think you need to run 2 llama.cpp, with atleast 1 lora to see duplication in memory.

slaren

The changes look good, and the overhead of always mapping files as copy-on-write should be minimal unless the pages are actually modified, so I think we can try this.

howard0su · 2023-07-11T12:51:59Z

No noticeable diff:

main:
llama_print_timings:        load time =  1355.03 ms
llama_print_timings:      sample time =     9.58 ms /    19 runs   (    0.50 ms per token,  1984.33 tokens per second)
llama_print_timings: prompt eval time =   815.69 ms /     5 tokens (  163.14 ms per token,     6.13 tokens per second)
llama_print_timings:        eval time =  7154.28 ms /    18 runs   (  397.46 ms per token,     2.52 tokens per second)
llama_print_timings:       total time =  7982.96 ms

llama_print_timings:        load time =  1678.13 ms
llama_print_timings:      sample time =     9.70 ms /    19 runs   (    0.51 ms per token,  1959.17 tokens per second)
llama_print_timings: prompt eval time =   889.12 ms /     5 tokens (  177.82 ms per token,     5.62 tokens per second)
llama_print_timings:        eval time =  7327.71 ms /    18 runs   (  407.09 ms per token,     2.46 tokens per second)
llama_print_timings:       total time =  8229.86 ms



this branch without Lora:
llama_print_timings:        load time =  1366.86 ms
llama_print_timings:      sample time =     9.56 ms /    19 runs   (    0.50 ms per token,  1987.45 tokens per second)
llama_print_timings: prompt eval time =   884.49 ms /     5 tokens (  176.90 ms per token,     5.65 tokens per second)
llama_print_timings:        eval time =  7270.45 ms /    18 runs   (  403.91 ms per token,     2.48 tokens per second)
llama_print_timings:       total time =  8168.13 ms


with Lora:
llama_print_timings:        load time = 23749.24 ms
llama_print_timings:      sample time =    16.09 ms /    33 runs   (    0.49 ms per token,  2050.33 tokens per second)
llama_print_timings: prompt eval time =   824.12 ms /     5 tokens (  164.82 ms per token,     6.07 tokens per second)
llama_print_timings:        eval time = 10654.57 ms /    32 runs   (  332.96 ms per token,     3.00 tokens per second)
llama_print_timings:       total time = 11500.75 ms

Has perf regression when mlock is used. This reverts commit 2347463.

howard0su added 3 commits July 11, 2023 17:05

Support using mmap when applying LoRA

d4e58cb

Fix Linux

1d4b687

Update comment to reflect the support lora with mmap

2ab2da2

howard0su force-pushed the mmap_with_lora branch from 1e30512 to 2ab2da2 Compare July 11, 2023 09:05

howard0su marked this pull request as ready for review July 11, 2023 09:07

slaren approved these changes Jul 11, 2023

View reviewed changes

Green-Sky approved these changes Jul 11, 2023

View reviewed changes

howard0su merged commit 2347463 into ggml-org:master Jul 11, 2023

howard0su added a commit to howard0su/llama.cpp that referenced this pull request Jul 12, 2023

Revert "Support using mmap when applying LoRA (ggml-org#2095)"

183de43

Has perf regression when mlock is used. This reverts commit 2347463.

howard0su added a commit that referenced this pull request Jul 13, 2023

Revert "Support using mmap when applying LoRA (#2095)" (#2206)

32c5411

Has perf regression when mlock is used. This reverts commit 2347463.

Green-Sky mentioned this pull request Aug 4, 2023

Tracking: LoRA #964

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support using mmap when applying LoRA #2095

Support using mmap when applying LoRA #2095

howard0su commented Jul 4, 2023 •

edited

Loading

howard0su commented Jul 4, 2023

Green-Sky commented Jul 4, 2023

howard0su commented Jul 4, 2023

howard0su commented Jul 4, 2023

slaren commented Jul 5, 2023

howard0su commented Jul 5, 2023

Green-Sky commented Jul 6, 2023

howard0su commented Jul 6, 2023

Green-Sky commented Jul 7, 2023 •

edited

Loading

howard0su commented Jul 11, 2023

Green-Sky commented Jul 11, 2023

slaren left a comment

howard0su commented Jul 11, 2023

Support using mmap when applying LoRA #2095

Support using mmap when applying LoRA #2095

Conversation

howard0su commented Jul 4, 2023 • edited Loading

howard0su commented Jul 4, 2023

Green-Sky commented Jul 4, 2023

howard0su commented Jul 4, 2023

howard0su commented Jul 4, 2023

slaren commented Jul 5, 2023

howard0su commented Jul 5, 2023

Green-Sky commented Jul 6, 2023

howard0su commented Jul 6, 2023

Green-Sky commented Jul 7, 2023 • edited Loading

howard0su commented Jul 11, 2023

Green-Sky commented Jul 11, 2023

slaren left a comment

Choose a reason for hiding this comment

howard0su commented Jul 11, 2023

howard0su commented Jul 4, 2023 •

edited

Loading

Green-Sky commented Jul 7, 2023 •

edited

Loading