Server: completion_probabilities (tok_str and prob) seem to be broken #7197

reuank · 2024-05-10T11:06:21Z

Hello,

I am using the llama.cpp server and noticed strange behavior in the server responses.

When starting a server on commit 637e9a8 using ./server -m ../models/llama-2-7b-chat.Q4_K_M.gguf -c 4096 -ngl 1000 -np 1 -cb, and using this curl command:

curl --request POST \
    --url http://localhost:8080/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": "Choose between A, B and C.\n\n","n_predict": 1, "n_probs": 10, "temperature": 0}'

I get the following json response:

// commit hash 637e9a86
{
    "content": "A",
    "id_slot": 0,
    "stop": true,
    "model": "../models/llama-2-7b-chat.Q4_K_M.gguf",
    "tokens_predicted": 1,
    "tokens_evaluated": 12,
    "generation_settings":
    {
        ...
    },
    "prompt": "Choose between A, B and C.\n\n",
    "truncated": false,
    "stopped_eos": false,
    "stopped_word": false,
    "stopped_limit": true,
    "stopping_word": "",
    "tokens_cached": 12,
    "timings":
    {
        "prompt_n": 12,
        "prompt_ms": 280.894,
        "prompt_per_token_ms": 23.407833333333333,
        "prompt_per_second": 42.720741632074734,
        "predicted_n": 1,
        "predicted_ms": 1.734,
        "predicted_per_token_ms": 1.734,
        "predicted_per_second": 576.7012687427913
    },
    "completion_probabilities":
    [
        {
            "content": "A",
            "probs":
            [
                {
                    "tok_str": "A",
                    "prob": 0.6929230093955994
                },
                {
                    "tok_str": "Option",
                    "prob": 0.04242830350995064
                },
                {
                    "tok_str": "Wh",
                    "prob": 0.035371895879507065
                },
                {
                    "tok_str": "What",
                    "prob": 0.021582460030913353
                },
                {
                    "tok_str": "The",
                    "prob": 0.020988475531339645
                },
                {
                    "tok_str": "Your",
                    "prob": 0.009944385848939419
                },
                {
                    "tok_str": "In",
                    "prob": 0.007504411973059177
                },
                {
                    "tok_str": "You",
                    "prob": 0.0066000730730593204
                },
                {
                    "tok_str": "Question",
                    "prob": 0.006469167303293943
                },
                {
                    "tok_str": "If",
                    "prob": 0.006083796266466379
                }
            ]
        }
    ]
}

But when running the same command on the latest commit on master (f89fe27), I get

// commit hash f89fe273
{
    "content": "A",
    "id_slot": 0,
    "stop": true,
    "model": "../models/llama-2-7b-chat.Q4_K_M.gguf",
    "tokens_predicted": 1,
    "tokens_evaluated": 12,
    "generation_settings":
    {
        ...
    },
    "prompt": "Choose between A, B and C.\n\n",
    "truncated": false,
    "stopped_eos": false,
    "stopped_word": false,
    "stopped_limit": true,
    "stopping_word": "",
    "tokens_cached": 12,
    "timings":
    {
        "prompt_n": 12,
        "prompt_ms": 298.66,
        "prompt_per_token_ms": 24.888333333333335,
        "prompt_per_second": 40.17946829170294,
        "predicted_n": 1,
        "predicted_ms": 0.021,
        "predicted_per_token_ms": 0.021,
        "predicted_per_second": 47619.04761904762
    },
    "completion_probabilities":
    [
        {
            "content": "A",
            "probs":
            [
                {
                    "tok_str": "▅",
                    "prob": 1.0
                },
                {
                    "tok_str": "<s>",
                    "prob": 0.0
                },
                {
                    "tok_str": "</s>",
                    "prob": 0.0
                },
                {
                    "tok_str": "\u0000",
                    "prob": 0.0
                },
                {
                    "tok_str": "\u0001",
                    "prob": 0.0
                },
                {
                    "tok_str": "\u0002",
                    "prob": 0.0
                },
                {
                    "tok_str": "\u0003",
                    "prob": 0.0
                },
                {
                    "tok_str": "\u0004",
                    "prob": 0.0
                },
                {
                    "tok_str": "\u0005",
                    "prob": 0.0
                },
                {
                    "tok_str": "\u0006",
                    "prob": 0.0
                }
            ]
        }
    ]
}

The returned probs are strange, and the tokens seem to be the first n tokens of the tokenizer vocabulary.

What happened here?

Best
Leon

The text was updated successfully, but these errors were encountered:

ggerganov · 2024-05-10T11:10:16Z

When temperature == 0.0f we don't compute probabilities. You can set temperature < 0.0f and it should work as expected

reuank · 2024-05-10T11:23:08Z

Thank you, @ggerganov. I just saw that the change happened in af0a5b6.

It is, however, a bit strange that the tokens for temperature == 0.0f are the first n tokens of the vocab, with the first token getting a prob of 1.0. Maybe it would be more intuitive to return an error, or to leave the probs in the completion_probabilities empty in this case?

…cpp#7197)

ggerganov · 2024-05-10T15:01:46Z

Yes, it's confusing. Should improve this - PRs welcome

JohannesGaessler · 2024-05-10T15:35:35Z

Sorry, I must have made a mistake while testing.

It is, however, a bit strange that the tokens for temperature == 0.0f are the first n tokens of the vocab, with the first token getting a prob of 1.0. Maybe it would be more intuitive to return an error, or to leave the probs in the completion_probabilities empty in this case?

The intended behavior is that with temperature 0 the top tokens are still being returned but that all tokens other than the top one have 0 probability. Essentially you should be getting the same thing as with a temperature of 0.001.

reuank · 2024-05-10T15:45:33Z

@JohannesGaessler, I see.
I just opened #7202, but that is obsolete then, right?

reuank · 2024-05-10T15:53:13Z

I don't understand why the top token should have 100% probability, and the others 0% probability assigned. I mean their appearance and position in the top n tokens is defined by their respective logits. I would expect the server response to reflect the actual model output.

Is the reason for your approach that you save the softmax calculation?

JohannesGaessler · 2024-05-10T16:03:48Z

I don't understand why the top token should have 100% probability, and the others 0% probability assigned.

If you sample with 0 temperature that are simple the probabilities with which the tokens are sampled. It is the correct way to continue the probabilities as the temperature goes towards 0, you would have discontinuities otherwise. Internally llama.cpp does not calculate logits at all with temperature == 0.0f, hence the need to manually set the values.

For temperatures < 0.0f the tokens are also sampled greedily but the backend still calculates token probabilities as you would get them with 1.0 temperature and no other samplers.

Is the reason for your approach that you save the softmax calculation?

It saves you not just the softmax but all other samplers as well.

JohannesGaessler · 2024-05-10T16:07:19Z

I just opened #7202, but that is obsolete then, right?

It was a five minute fix so I also opened a PR to yield the intended behavior: #7203 . Varying the number of returned tokens is I think not a good solution because it leads to weird behavior for temperatures that are almost but not quite 0. With temperature 0 and temperature $10^{-10}$ you would get different behavior even though it is both in effect greedy sampling.

reuank added the bug-unconfirmed label May 10, 2024

reuank added a commit to reuank/ThinkBench that referenced this issue May 10, 2024

Fixed broken probs in llama.cpp server responses (see ggml-org/llama.…

0f76304

…cpp#7197)

ggerganov added bug Something isn't working good first issue Good for newcomers server and removed bug-unconfirmed labels May 10, 2024

JohannesGaessler self-assigned this May 10, 2024

reuank mentioned this issue May 10, 2024

Server: Do not populate probs array when temperature is 0 #7202

Closed

JohannesGaessler mentioned this issue May 10, 2024

server: fix reported top tokens for temperature 0 #7203

Merged

JohannesGaessler closed this as completed in #7203 May 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server: completion_probabilities (tok_str and prob) seem to be broken #7197

Server: completion_probabilities (tok_str and prob) seem to be broken #7197

reuank commented May 10, 2024

ggerganov commented May 10, 2024

reuank commented May 10, 2024 •

edited

Loading

ggerganov commented May 10, 2024

JohannesGaessler commented May 10, 2024 •

edited

Loading

reuank commented May 10, 2024

reuank commented May 10, 2024

JohannesGaessler commented May 10, 2024

JohannesGaessler commented May 10, 2024

Server: completion_probabilities (tok_str and prob) seem to be broken #7197

Server: completion_probabilities (tok_str and prob) seem to be broken #7197

Comments

reuank commented May 10, 2024

ggerganov commented May 10, 2024

reuank commented May 10, 2024 • edited Loading

ggerganov commented May 10, 2024

JohannesGaessler commented May 10, 2024 • edited Loading

reuank commented May 10, 2024

reuank commented May 10, 2024

JohannesGaessler commented May 10, 2024

JohannesGaessler commented May 10, 2024

reuank commented May 10, 2024 •

edited

Loading

JohannesGaessler commented May 10, 2024 •

edited

Loading