Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

June 2024 Binary Update #751

Merged
merged 3 commits into from
Jun 3, 2024

Conversation

martindevans
Copy link
Member

@martindevans martindevans commented May 24, 2024

LLama.cpp commit: 1debe72737ea131cb52975da3d53ed3a835df3a6
Build Action: https://github.com/SciSharp/LLamaSharp/actions/runs/9216317481
Rosetta Build Action: https://github.com/SciSharp/LLamaSharp/actions/runs/9340131648

Testing:

  • Windows (CPU)
  • Windows (CUDA 11)
  • Windows (CUDA 12)
  • Windows (OpenCL)
  • Linux (CPU)
  • Linux (CUDA 11)
  • Linux (CUDA 12)
  • Linux (OpenCL)
  • MacOS (ARM64)
  • MacOS (Rosetta)

@Hyp3rSoniX
Copy link

Works on MacOS (Macbook Pro M1 Max).
Tested with Phi-3-medium-128k-instruct-Q6_K.gguf

@abhiaagarwal
Copy link
Contributor

Works on MacOS (Macbook Pro M1 Max). Tested with Phi-3-medium-128k-instruct-Q6_K.gguf

Have you tested >4k context lengths? They just added support with ggerganov/llama.cpp#7225

@Hyp3rSoniX
Copy link

Works on MacOS (Macbook Pro M1 Max). Tested with Phi-3-medium-128k-instruct-Q6_K.gguf

Have you tested >4k context lengths? They just added support with ggerganov/llama.cpp#7225

I'm currently using a context length of 30k on the 128k model. So I would say it's working with contexts higher than 4k.

The commit that Martin chose is also newer than the commit you linked, so the functionality should be included.

@Hyp3rSoniX
Copy link

Works on Linux x64 Cuda 12 (RTX 3090)
Tested with Phi-3-medium-128k-instruct-Q6_K.gguf

root@lxdocker:/AI/model-configs# nvidia-smi
Sat May 25 15:33:58 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:2E:00.0 Off |                  N/A |
|  0%   40C    P8             12W /  350W |   11072MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
root@lxdocker:/AI/model-configs# uname -a
Linux lxdocker 6.8.4-3-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.4-3 (2024-05-02T11:55Z) x86_64 x86_64 x86_64 GNU/Linux

@m0nsky
Copy link
Contributor

m0nsky commented May 25, 2024

All tests passed on Windows 10 CUDA 12.
Upgraded my projects to Windows 10 OpenCL from this PR and they are working fine also.

@abhiaagarwal
Copy link
Contributor

Works on MacOS (Macbook Pro M1 Max). Tested with Phi-3-medium-128k-instruct-Q6_K.gguf

Have you tested >4k context lengths? They just added support with ggerganov/llama.cpp#7225

I'm currently using a context length of 30k on the 128k model. So I would say it's working with contexts higher than 4k.

The commit that Martin chose is also newer than the commit you linked, so the functionality should be included.

Yep, I just wanted to see if it was actually working given it's relatively new. Slightly off topic, but out of curiosity, have you observed any degradation above >32k context? I've personally seen it fall apart at ~50k.

@Hyp3rSoniX
Copy link

Works on MacOS (Macbook Pro M1 Max). Tested with Phi-3-medium-128k-instruct-Q6_K.gguf

Have you tested >4k context lengths? They just added support with ggerganov/llama.cpp#7225

I'm currently using a context length of 30k on the 128k model. So I would say it's working with contexts higher than 4k.
The commit that Martin chose is also newer than the commit you linked, so the functionality should be included.

Yep, I just wanted to see if it was actually working given it's relatively new. Slightly off topic, but out of curiosity, have you observed any degradation above >32k context? I've personally seen it fall apart at ~50k.

On Mac? I had it fall apart at about 50k to 60k context, but I have LLamaSharp native libs logging on, and there I could see that it got broken the moment it got out of memory.

The out of memory happens while inferencing though, so it doesn't error our. Instead it starts to badly hallucinate in different languages.

You'll probably need a 64GB Mac to be able to run the model with a context size of 50k+

On my linux-server system I have the model running on CPU + GPU mode at 100k context, and it works fine. The RAM usage is insane though.

@AsakusaRinne
Copy link
Collaborator

I had it fall apart at about 50k to 60k context.

@Hyp3rSoniX Hi, does the model perform good when you had a 50k to 60k context with it? May I ask the model you were using? There was a user complaining about the model has bad behavior after about 4k context last week but I'm not sure what's the problem.

@Hyp3rSoniX
Copy link

I had it fall apart at about 50k to 60k context.

@Hyp3rSoniX Hi, does the model perform good when you had a 50k to 60k context with it? May I ask the model you were using? There was a user complaining about the model has bad behavior after about 4k context last week but I'm not sure what's the problem.

_modelParams = new ModelParams(modelPathName)
                {
                    SeqMax = 2,
                    BatchSize = 512,
                    ContextSize = 100000,
                    Seed = 1337,
                    GpuLayerCount = gpuLayers,
                    Threads = 4,
                    UseMemorymap = false,
                    FlashAttention = true,
                    UBatchSize = 1024
                };

Is working fine for me.

Small part of a conversation with the settings above:

Screenshot 2024-05-26 at 11 48 09

Or do you mean specifically between 50 to 60k? Though 50k did work on my local machine. I had to reduce the n_batch a bit, but I could communicate with the model normally.

This is the model I use (download link): https://huggingface.co/bartowski/Phi-3-medium-128k-instruct-GGUF/resolve/main/Phi-3-medium-128k-instruct-Q6_K.gguf?download=true

@martindevans martindevans force-pushed the binary_update_june_2024 branch from bdc6c57 to ac02af9 Compare June 2, 2024 18:25
@martindevans
Copy link
Member Author

MacOS Rosetta binaries have been added, using a separate build run from the rest of the binaries (to avoid changing everything and needing to re-test). The Rosetta binaries use the same commit as all the other binaries.

@martindevans martindevans merged commit 47fb5e8 into SciSharp:master Jun 3, 2024
7 checks passed
@martindevans martindevans deleted the binary_update_june_2024 branch June 3, 2024 22:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants