-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running a Vicuna-13B 4it model ? #771
Comments
Have a look here --> #643 |
I've had the most success with this model with the following patch to the instruct mode. diff --git a/examples/main/main.cpp b/examples/main/main.cpp
index 453450a..70b4f45 100644
--- a/examples/main/main.cpp
+++ b/examples/main/main.cpp
@@ -152,13 +152,13 @@ int main(int argc, char ** argv) {
}
// prefix & suffix for instruct mode
- const auto inp_pfx = ::llama_tokenize(ctx, "\n\n### Instruction:\n\n", true);
- const auto inp_sfx = ::llama_tokenize(ctx, "\n\n### Response:\n\n", false);
+ const auto inp_pfx = ::llama_tokenize(ctx, "\n\n### Human:\n\n", true);
+ const auto inp_sfx = ::llama_tokenize(ctx, "\n\n### Assistant:\n\n", false);
// in instruct mode, we inject a prefix and a suffix to each input by the user
if (params.instruct) {
params.interactive_start = true;
- params.antiprompt.push_back("### Instruction:\n\n");
+ params.antiprompt.push_back("### Human:\n\n");
}
// enable interactive mode if reverse prompt or interactive start is specified And then running the model with the following options. If there are better options, please let me know. ./main \
--model ./models/ggml-vicuna-13b-4bit/ggml-vicuna-13b-4bit.bin \
--color \
--threads 7 \
--batch_size 256 \
--n_predict -1 \
--top_k 12 \
--top_p 1 \
--temp 0.36 \
--repeat_penalty 1.05 \
--ctx_size 2048 \
--instruct \
--reverse-prompt '### Human:' \
--file prompts/vicuna.txt And my prompt file
Example output
I've made this change to align with FastChat and the roles it uses. Someone who knows C better than I, could they make a prompt suffix flag? |
Vicuna is a pretty strict model in terms of following that ### Human/### Assistant format when compared to alpaca and gpt4all. Less flexible but fairly impressive in how it mimics ChatGPT responses. |
It's extremely slow on my M1 MacBook (unusable), quite usable on my 4 yr old i7 workstation. And doesn't work at all on the same workstation inside docker. Found #767, adding --mlock solved the slowness issue on Macbook. |
Although I'm not proficient in C, I was able to make some modifications to llama.cpp by recompiling main.cpp with the changes, renaming the resulting main.exe to vicuna.exe, and moving it into my main llama.cpp folder. To choose a model, I created a bat file that prompts me to select a model, and if I choose the vicuna model, the bat file runs vicuna.exe instead of main.exe. I've included the bat file below for reference:
I hope this helps anyone who may be interested in trying this out! |
Tried these settings and it's really nice! It really has learned the ChatGPT style well, and the 13b model seems to have good underlying knowledge.
But there is a problem that it doesn't seem to stop by itself, it will continue to generate the next line of |
I've been able to compile latest standard llama.cpp with cmake under the Windows 10, then run ggml-vicuna-7b-4bit-rev1.bin , and even ggml-vicuna-13b-4bit-rev1.bin, with this command-line code (assuming that your .bin in the same folder as main.exe): main -i --interactive-first -r "### Human:" --temp 0 -c 2048 -n -1 --ignore-eos --repeat_penalty 1.2 --instruct -m ggml-vicuna-7b-4bit-rev1.bin And that one kinda works even faster on my 8-core CPU: main --color --threads 7 --batch_size 256 --n_predict -1 --top_k 12 --top_p 1 --temp 0.36 --repeat_penalty 1.05 --ctx_size 2048 --instruct --reverse-prompt "### Human:" --model ggml-vicuna-13b-4bit-rev1.bin It works on Win laptop with 16 Gb RAM and looks almost like ChatGPT! (Slower, of course, but with speed almost same as human will type!) I agree that it may be the best LLM to run locally! And it seems that it can write much more correct and longer program code than gpt4all! It's just amazing! But sometimes, after a few answers, it just freezes forever while continuing to load the CPU. Has anyone noticed this? Why it may be so? |
Context swap. The context fills up and then the first half of it is deleted to make more room, but that means that the whole context has to be reevaluated to catch up. The OpenBLAS option is supposed to accelerate it, but don't know how easy it is to make it work on Windows, vcpkg seems to have some BLAS packages. |
So it's just trying to compress overfilled context so that it would be possible to continue conversation without loosing any important details? And it is normal, and I just should take a cup of tee in that time and not restarts it as I did? :-) |
You can use |
can someone explain to me what the difference between these two options is? (Both options work fine)
|
About temperature read here. As I preliminary think, the higher temperature - the more stochastic/chaotic choice of words. The lower temperature - the more deterministic would be your result, and at temperature = 0 your result would be always the same. So you can tune that parameter for your application. If you write a code, may be better temp =0, if you write a poem - may be better temp=1 or even more... (If i'm wrong - correct me!) What about threads, I intuit that you can use as many threads as your CPU support bar 1 or 2 (so that other apps and system will not hang). I think bottleneck is not CPU but RAM throughput. Who has another opinion - please correct! |
Como é baixar o modelo,como app comum? |
I use the following trick to partly overcome this problem:
|
There is a vicuña model rev1 with some kind of stop fix on 🤗 . Maybe that solves your issue? |
Yes, I'm talking about rev1, so we need to change llama_token_eos() or what? |
This is just the same as the |
Yes, that's right. Thank you. You mean --reverse-prompt (-r) option. I use a prompt file to start generation in this way: Content of input.txt file:
-r option switches the program into interactive mode, so it will not exit at the end and keeps waiting. Therefore I made the following quick fix for vicuna:
and
|
I've repeated your modifications, but nothing changes - it still shows "### Human:" each time... Who knows how to make clean answers? ( So that only '>' instead of '### Human:') |
You can try my additional quick hack (it removes "### Human:" from the end of each response):
|
The vicuna v1.1 model used a different setup. See https://github.com/lm-sys/FastChat/blob/f85f489f2d5e48c37cceb2f00c3edc075c5d3711/fastchat/conversation.py#L115-L124 and https://github.com/lm-sys/FastChat/blob/f85f489f2d5e48c37cceb2f00c3edc075c5d3711/fastchat/conversation.py#L37-L44 IIUC, the prompt in Borne shell string is Their doc says this https://github.com/lm-sys/FastChat/blob/f85f489f2d5e48c37cceb2f00c3edc075c5d3711/docs/weights_version.md#example-prompt-weight-v11 I think the |
Uhhh... Such a mess... Definitely needed some standardization for peple teaching LLMs! At least, with tokens such assistant/human/eos it should be possible, `cos it's just technicalities not connected directly with LLM functionality... Or, at a side of a software, there should be easy way to adapt any token without editing C++ code... |
Since #863 may not happen soon, I tested this working on 1.1: // prefix & suffix for instruct mode
const auto inp_pfx = ::llama_tokenize(ctx, "\nUSER:", true);
const auto inp_sfx = ::llama_tokenize(ctx, "\nASSISTANT:", false);
// in instruct mode, we inject a prefix and a suffix to each input by the user
if (params.instruct) {
params.interactive_start = true;
params.antiprompt.push_back("USER:");
} |
this is my settings, and run well on my mac (only '>' instead of '### Human:'): |
@chakflying I have the same issue when using GPT4ALL with this model, after starting my first prompt, I lost control over them. |
I found this model :
[ggml-vicuna-13b-4bit](https://huggingface.co/eachadea/ggml-vicuna-13b-4bit/tree/main) and judging by their online demo it's very impressive.
I tried to run it with llama.cpp latest version - the model loads fine, but as soon as it loads it starts hallucinating and quits by itself.
Do I need to have it converted or something like that ?
The text was updated successfully, but these errors were encountered: