-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Server: completion_probabilities (tok_str and prob) seem to be broken #7197
Comments
When |
Thank you, @ggerganov. I just saw that the change happened in af0a5b6. It is, however, a bit strange that the tokens for |
Yes, it's confusing. Should improve this - PRs welcome |
Sorry, I must have made a mistake while testing.
The intended behavior is that with temperature 0 the top tokens are still being returned but that all tokens other than the top one have 0 probability. Essentially you should be getting the same thing as with a temperature of 0.001. |
@JohannesGaessler, I see. |
I don't understand why the top token should have 100% probability, and the others 0% probability assigned. I mean their appearance and position in the top n tokens is defined by their respective logits. I would expect the server response to reflect the actual model output. Is the reason for your approach that you save the softmax calculation? |
If you sample with 0 temperature that are simple the probabilities with which the tokens are sampled. It is the correct way to continue the probabilities as the temperature goes towards 0, you would have discontinuities otherwise. Internally llama.cpp does not calculate logits at all with temperature == 0.0f, hence the need to manually set the values. For temperatures < 0.0f the tokens are also sampled greedily but the backend still calculates token probabilities as you would get them with 1.0 temperature and no other samplers.
It saves you not just the softmax but all other samplers as well. |
It was a five minute fix so I also opened a PR to yield the intended behavior: #7203 . Varying the number of returned tokens is I think not a good solution because it leads to weird behavior for temperatures that are almost but not quite 0. With temperature 0 and temperature |
Hello,
I am using the llama.cpp server and noticed strange behavior in the server responses.
When starting a server on commit 637e9a8 using
./server -m ../models/llama-2-7b-chat.Q4_K_M.gguf -c 4096 -ngl 1000 -np 1 -cb
, and using this curl command:I get the following json response:
But when running the same command on the latest commit on master (f89fe27), I get
The returned probs are strange, and the tokens seem to be the first n tokens of the tokenizer vocabulary.
What happened here?
Best
Leon
The text was updated successfully, but these errors were encountered: