June 2024 Binary Update #751

martindevans · 2024-05-24T00:58:59Z

LLama.cpp commit: 1debe72737ea131cb52975da3d53ed3a835df3a6
Build Action: https://github.com/SciSharp/LLamaSharp/actions/runs/9216317481
Rosetta Build Action: https://github.com/SciSharp/LLamaSharp/actions/runs/9340131648

Testing:

Hyp3rSoniX · 2024-05-25T12:26:12Z

Works on MacOS (Macbook Pro M1 Max).
Tested with Phi-3-medium-128k-instruct-Q6_K.gguf

abhiaagarwal · 2024-05-25T13:16:47Z

Works on MacOS (Macbook Pro M1 Max). Tested with Phi-3-medium-128k-instruct-Q6_K.gguf

Have you tested >4k context lengths? They just added support with ggerganov/llama.cpp#7225

Hyp3rSoniX · 2024-05-25T13:22:29Z

Works on MacOS (Macbook Pro M1 Max). Tested with Phi-3-medium-128k-instruct-Q6_K.gguf

Have you tested >4k context lengths? They just added support with ggerganov/llama.cpp#7225

I'm currently using a context length of 30k on the 128k model. So I would say it's working with contexts higher than 4k.

The commit that Martin chose is also newer than the commit you linked, so the functionality should be included.

Hyp3rSoniX · 2024-05-25T13:36:05Z

Works on Linux x64 Cuda 12 (RTX 3090)
Tested with Phi-3-medium-128k-instruct-Q6_K.gguf

root@lxdocker:/AI/model-configs# nvidia-smi
Sat May 25 15:33:58 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:2E:00.0 Off |                  N/A |
|  0%   40C    P8             12W /  350W |   11072MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

root@lxdocker:/AI/model-configs# uname -a
Linux lxdocker 6.8.4-3-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.4-3 (2024-05-02T11:55Z) x86_64 x86_64 x86_64 GNU/Linux

m0nsky · 2024-05-25T14:02:04Z

All tests passed on Windows 10 CUDA 12.
Upgraded my projects to Windows 10 OpenCL from this PR and they are working fine also.

abhiaagarwal · 2024-05-25T14:51:10Z

Works on MacOS (Macbook Pro M1 Max). Tested with Phi-3-medium-128k-instruct-Q6_K.gguf

Have you tested >4k context lengths? They just added support with ggerganov/llama.cpp#7225

I'm currently using a context length of 30k on the 128k model. So I would say it's working with contexts higher than 4k.

The commit that Martin chose is also newer than the commit you linked, so the functionality should be included.

Yep, I just wanted to see if it was actually working given it's relatively new. Slightly off topic, but out of curiosity, have you observed any degradation above >32k context? I've personally seen it fall apart at ~50k.

Hyp3rSoniX · 2024-05-26T09:20:22Z

Works on MacOS (Macbook Pro M1 Max). Tested with Phi-3-medium-128k-instruct-Q6_K.gguf

Have you tested >4k context lengths? They just added support with ggerganov/llama.cpp#7225

I'm currently using a context length of 30k on the 128k model. So I would say it's working with contexts higher than 4k.
The commit that Martin chose is also newer than the commit you linked, so the functionality should be included.

Yep, I just wanted to see if it was actually working given it's relatively new. Slightly off topic, but out of curiosity, have you observed any degradation above >32k context? I've personally seen it fall apart at ~50k.

On Mac? I had it fall apart at about 50k to 60k context, but I have LLamaSharp native libs logging on, and there I could see that it got broken the moment it got out of memory.

The out of memory happens while inferencing though, so it doesn't error our. Instead it starts to badly hallucinate in different languages.

You'll probably need a 64GB Mac to be able to run the model with a context size of 50k+

On my linux-server system I have the model running on CPU + GPU mode at 100k context, and it works fine. The RAM usage is insane though.

AsakusaRinne · 2024-05-26T09:24:59Z

I had it fall apart at about 50k to 60k context.

@Hyp3rSoniX Hi, does the model perform good when you had a 50k to 60k context with it? May I ask the model you were using? There was a user complaining about the model has bad behavior after about 4k context last week but I'm not sure what's the problem.

Hyp3rSoniX · 2024-05-26T09:53:03Z

I had it fall apart at about 50k to 60k context.

@Hyp3rSoniX Hi, does the model perform good when you had a 50k to 60k context with it? May I ask the model you were using? There was a user complaining about the model has bad behavior after about 4k context last week but I'm not sure what's the problem.

_modelParams = new ModelParams(modelPathName)
                {
                    SeqMax = 2,
                    BatchSize = 512,
                    ContextSize = 100000,
                    Seed = 1337,
                    GpuLayerCount = gpuLayers,
                    Threads = 4,
                    UseMemorymap = false,
                    FlashAttention = true,
                    UBatchSize = 1024
                };

Is working fine for me.

Small part of a conversation with the settings above:

Or do you mean specifically between 50 to 60k? Though 50k did work on my local machine. I had to reduce the n_batch a bit, but I could communicate with the model normally.

This is the model I use (download link): https://huggingface.co/bartowski/Phi-3-medium-128k-instruct-GGUF/resolve/main/Phi-3-medium-128k-instruct-Q6_K.gguf?download=true

Build Action: https://github.com/SciSharp/LLamaSharp/actions/runs/9216317481

…ciSharp/LLamaSharp/actions/runs/9340131648

martindevans · 2024-06-02T18:28:34Z

MacOS Rosetta binaries have been added, using a separate build run from the rest of the binaries (to avoid changing everything and needing to re-test). The Rosetta binaries use the same commit as all the other binaries.

AsakusaRinne added break change backend native lib benchmark Trigger benchmark workflow labels May 25, 2024

martindevans mentioned this pull request May 28, 2024

Can not intercept all the log messages with llama_log_set #645

Closed

AsakusaRinne mentioned this pull request May 29, 2024

refactor: the directory structure. #763

Open

This was referenced May 29, 2024

Phi-3-medium-128k-instruct - error due to tensor shape expected 245, got 243 #765

Closed

Add Rosetta2 Binaries #755

Merged

martindevans added 2 commits June 2, 2024 19:20

LLama.cpp commit: 1debe72737ea131cb52975da3d53ed3a835df3a6

58af810

Build Action: https://github.com/SciSharp/LLamaSharp/actions/runs/9216317481

Added Rosetta binaries. Using this build action: https://github.com/S…

ac02af9

…ciSharp/LLamaSharp/actions/runs/9340131648

martindevans force-pushed the binary_update_june_2024 branch from bdc6c57 to ac02af9 Compare June 2, 2024 18:25

Updated versions and readme

0e90ea6

martindevans merged commit 47fb5e8 into SciSharp:master Jun 3, 2024
7 checks passed

martindevans deleted the binary_update_june_2024 branch June 3, 2024 22:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

June 2024 Binary Update #751

June 2024 Binary Update #751

martindevans commented May 24, 2024 •

edited

Loading

Hyp3rSoniX commented May 25, 2024

abhiaagarwal commented May 25, 2024

Hyp3rSoniX commented May 25, 2024

Hyp3rSoniX commented May 25, 2024

m0nsky commented May 25, 2024 •

edited

Loading

abhiaagarwal commented May 25, 2024

Hyp3rSoniX commented May 26, 2024

AsakusaRinne commented May 26, 2024

Hyp3rSoniX commented May 26, 2024

martindevans commented Jun 2, 2024

June 2024 Binary Update #751

June 2024 Binary Update #751

Conversation

martindevans commented May 24, 2024 • edited Loading

Hyp3rSoniX commented May 25, 2024

abhiaagarwal commented May 25, 2024

Hyp3rSoniX commented May 25, 2024

Hyp3rSoniX commented May 25, 2024

m0nsky commented May 25, 2024 • edited Loading

abhiaagarwal commented May 25, 2024

Hyp3rSoniX commented May 26, 2024

AsakusaRinne commented May 26, 2024

Hyp3rSoniX commented May 26, 2024

martindevans commented Jun 2, 2024

martindevans commented May 24, 2024 •

edited

Loading

m0nsky commented May 25, 2024 •

edited

Loading