-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
June 2024 Binary Update #751
June 2024 Binary Update #751
Conversation
Works on MacOS (Macbook Pro M1 Max). |
Have you tested >4k context lengths? They just added support with ggerganov/llama.cpp#7225 |
I'm currently using a context length of 30k on the 128k model. So I would say it's working with contexts higher than 4k. The commit that Martin chose is also newer than the commit you linked, so the functionality should be included. |
Works on Linux x64 Cuda 12 (RTX 3090)
|
All tests passed on Windows 10 CUDA 12. |
Yep, I just wanted to see if it was actually working given it's relatively new. Slightly off topic, but out of curiosity, have you observed any degradation above >32k context? I've personally seen it fall apart at ~50k. |
On Mac? I had it fall apart at about 50k to 60k context, but I have LLamaSharp native libs logging on, and there I could see that it got broken the moment it got out of memory. The out of memory happens while inferencing though, so it doesn't error our. Instead it starts to badly hallucinate in different languages. You'll probably need a 64GB Mac to be able to run the model with a context size of 50k+ On my linux-server system I have the model running on CPU + GPU mode at 100k context, and it works fine. The RAM usage is insane though. |
@Hyp3rSoniX Hi, does the model perform good when you had a 50k to 60k context with it? May I ask the model you were using? There was a user complaining about the model has bad behavior after about 4k context last week but I'm not sure what's the problem. |
Is working fine for me. Small part of a conversation with the settings above: Or do you mean specifically between 50 to 60k? Though 50k did work on my local machine. I had to reduce the n_batch a bit, but I could communicate with the model normally. This is the model I use (download link): https://huggingface.co/bartowski/Phi-3-medium-128k-instruct-GGUF/resolve/main/Phi-3-medium-128k-instruct-Q6_K.gguf?download=true |
bdc6c57
to
ac02af9
Compare
MacOS Rosetta binaries have been added, using a separate build run from the rest of the binaries (to avoid changing everything and needing to re-test). The Rosetta binaries use the same commit as all the other binaries. |
LLama.cpp commit: 1debe72737ea131cb52975da3d53ed3a835df3a6
Build Action: https://github.com/SciSharp/LLamaSharp/actions/runs/9216317481
Rosetta Build Action: https://github.com/SciSharp/LLamaSharp/actions/runs/9340131648
Testing: