Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server : fix logprobs, make it OAI-compatible #10783

Merged
merged 21 commits into from
Dec 19, 2024

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Dec 11, 2024

Fix #10733

The /v1/chat/completions now has full support for logprobs. See oai docs here. Please note that this is pre-sampling probs (i.e. no sampling method is applied)

Example using official openai library:

def test_logprobs():
    client = OpenAI(api_key="dummy", base_url=f"http://{server.server_host}:{server.server_port}")
    res = client.chat.completions.create(
        model="gpt-3.5-turbo-instruct",
        temperature=0.0,
        messages=[
            {"role": "system", "content": "Book"},
            {"role": "user", "content": "What is the best book"},
        ],
        max_tokens=5,
        logprobs=True,
        top_logprobs=10,
    )
    for token in res.choices[0].logprobs.content:
        print(token.token, token.logprob)

For non-OAI compatible endpoint /completion, the completion_probabilities is also changed:

Request:

{ "prompt": "hi", "n_predict": 2, "n_probs": 3, "top_k": 1 }

Response:

{
    "index": 0,
    "content": ", i",
    ...
    "completion_probabilities": [
        {
            "id": 11,
            "token": ",",
            "logprob": 0.0,
            "bytes": [
                44
            ],
            "top_logprobs": [
                {
                    "id": 11,
                    "token": ",",
                    "bytes": [
                        44
                    ],
                    "logprob": -1.9807817935943604
                },
                {
                    "id": 682,
                    "token": " all",
                    "bytes": [
                        32,
                        97,
                        108,
                        108
                    ],
                    "logprob": -2.73698353767395
                },
                {
                    "id": 5127,
                    "token": " everyone",
                    "bytes": [
                        32,
                        101,
                        118,
                        101,
                        114,
                        121,
                        111,
                        110,
                        101
                    ],
                    "logprob": -2.8409159183502197
                }
            ]
        },
        {
            "id": 602,
            "token": " i",
            "logprob": 0.0,
            "bytes": [
                32,
                105
            ],
            "top_logprobs": [
                {
                    "id": 602,
                    "token": " i",
                    "bytes": [
                        32,
                        105
                    ],
                    "logprob": -1.4105287790298462
                },
                {
                    "id": 358,
                    "token": " I",
                    "bytes": [
                        32,
                        73
                    ],
                    "logprob": -1.5139223337173462
                },
                {
                    "id": 856,
                    "token": " my",
                    "bytes": [
                        32,
                        109,
                        121
                    ],
                    "logprob": -3.0677266120910645
                }
            ]
        }
    ]
}

For post-sampling probs, add "post_sampling_probs": true to the request. Example for the response:

{
    "completion_probabilities": [
        {
            "id": 682,
            "token": " all",
            "bytes": [
                32,
                97,
                108,
                108
            ],
            "prob": 0.2798407971858978,
            "top_probs": [
                {
                    "id": 11,
                    "token": ",",
                    "bytes": [
                        44
                    ],
                    "prob": 0.7201591730117798
                },
                {
                    "id": 682,
                    "token": " all",
                    "bytes": [
                        32,
                        97,
                        108,
                        108
                    ],
                    "prob": 0.2798407971858978
                }
            ]
        },
        ...
    ]
}

@ngxson ngxson requested a review from ggerganov December 11, 2024 13:52
@github-actions github-actions bot added examples python python script changes server labels Dec 11, 2024
@ngxson ngxson added the breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. label Dec 11, 2024
result.tok = ids[i];
result.tok = ids[i];
result.text_to_send = common_token_to_piece(ctx, result.tok, params_base.special);
result.prob = 1.0f; // set later
Copy link
Collaborator Author

@ngxson ngxson Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the branch for speculative decoding. I'm not sure now I can get token probs here. Could you give me some clues? @ggerganov

(Or we can skip this for now if it's complicated)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think we need to update common_sampler_sample_and_accept_n() to return the probs. But let's fix this later.

@p-e-w
Copy link

p-e-w commented Dec 12, 2024

Based on the example response, it appears that this implementation returns probabilities (between 0 and 1) but labels them as log probabilities (which by definition must be between -infinity and 0). I don't know which of the two is expected over the /completion endpoint, but naming the field logprob when it contains a plain probability is certainly wrong.

Also, just to clarify, logprob and top_logprobs are the values before samplers, right? That is, before any penalties, truncation, etc. are applied and distort the probability distribution. In other words, they should contain the raw distribution that the model produces, not the transformed distribution that the generator samples from. Because the data is useless for many applications otherwise (such as confidence estimation).

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 12, 2024

I don't know which of the two is expected over the /completion endpoint, but naming the field logprob when it contains a plain probability is certainly wrong.

Ok I didn't noticed that, thanks for spotting it. It's a simple fix, just std::log the prob to turn it into logprob.

Also, just to clarify, logprob and top_logprobs are the values before samplers, right?

It's the prob after common_sampler_sample, so the sampler chain is already applied

@p-e-w
Copy link

p-e-w commented Dec 12, 2024

It's the prob after common_sampler_sample, so the sampler chain is already applied

That is wrong and needs to be changed. Post-sampler probs are essentially worthless.

Imagine you are doing greedy sampling (top_k = 1). You want to know which other choices were available at each token position, how probable your entire token sequence is (one of the classic applications of logprobs) etc. So you set logprobs = true and top_logprobs = 10 in the request.

But with the current logic, the logprob field in the result is simply 0 for every token, corresponding to a probability of 1, because the top_k sampler leaves only the most probable token, and after normalization, that token has probability 100%. Also, the top_logprobs field will contain only that token, because all other tokens were cut off by the sampler.

Samplers destroy information in order to make a choice. The whole point of requesting logprobs is to get that information. We already know which choice was made (the token in the result), so the distorted probability distribution left over by the sampler chain is useless. Pretty much every standard use case for logprobs (confidence estimation, beam search guidance) breaks down if the probs are post-sampling.

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 12, 2024

That is wrong and needs to be changed. Post-sampler probs are essentially worthless.

Tbh I'm not even the one who wrote this part of code from the beginning, so I didn't notice that.

Judging by comments from vllm-project/vllm#9453 , this suppose to be pre-sampling as you said then.

Just some lines of code to modify, not a big deal.

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 12, 2024

Done, it's now returning the log(softmax(logits))

Copy link

@p-e-w p-e-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff! This will make the server much more useful for research-type work. I have tested many inference engines, and all of them have at least one bug related to obtaining logprobs, which led me to hack together my own server to work with them. Glad I'll finally be able to ditch that!

examples/server/README.md Outdated Show resolved Hide resolved
@ngxson ngxson requested a review from ggerganov December 13, 2024 07:51
@@ -664,3 +694,33 @@ static json format_logit_bias(const std::vector<llama_logit_bias> & logit_bias)
static std::string safe_json_to_str(json data) {
return data.dump(-1, ' ', false, json::error_handler_t::replace);
}

static std::vector<llama_token_data> get_token_probabilities(llama_context * ctx, int idx) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: because llama_sampler_init_softmax is deprecated, I need to copy the its code here.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is extremely slow because it operates on the entire vocabulary. It can cause many headaches by having this enabled by default for n_probs > 0, so at the very least outputting logprobs has to be gated by a parameter only if the user really knows they want these. In majority of the cases, the llama-server users want the final token probability for visualization in the UI and this should not affect the performance in a significant way.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think We can add a gated parameter --max-n-probs and default it to 1

Copy link
Collaborator Author

@ngxson ngxson Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm sorry I misunderstood it a bit, the option should be --probs-multiple-tokens or -pmt instead of max n_probs as a said. Having max n_probs means nothing because the softmax is ran on n_vocab tokens anyways

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I would expect the probabilities to be effectively the same if only, say, the top 1000 tokens are considered in the calculation. In fact I believe the min-p sampler optimization added by @JohannesGaessler does something like this to determine if it is necessary to calculate a full softmax (correct me if I am wrong, I only looked at it briefly a while ago).

The min-p sampler optimization that I did exploits that a ratio of probabilities is equivalent to a difference in logits so min-p sampling can just be done relative to the maximum logit without having to explicitly calculate the softmax of all tokens. It should be fine to apply loose min-p sampling prior to calculating the token probabilities since the vast majority of tokens will effectively have 0 probability.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if "gating" this functionality is a good idea. This is what the OpenAI API does. If llama.cpp claims OAI compatibility, it shouldn't come with a fine print of "...but only if you enable this ad-hoc CLI flag".

Copy link
Collaborator Author

@ngxson ngxson Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov On second thought, I think most users know what they're doing (and the consequence) when they want to get n_probs for multiple tokens. I think we don't really need --multi-token-probs for now. Is it ok if I remove it?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, --multi-token-probs is not necessary. What you described in your #10783 (comment) seems good. Ideally, we should add the min-p optimizaion, but we can do it in another PR if necessary.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in 8734df7

@JohannesGaessler
Copy link
Collaborator

Regarding log probs of before vs. after samplers: my opinion is that they should reflect the probabilities after samplers since by applying samplers you are intentionally distorting the probability distribution that the model puts out. I don't see why the log likelihood of the original, undistorted distribution would be useful in that scenario. The only real use case I see is to do greedy sampling while still obtaining the probabilities of the original distribution which should be possible by setting the temperature to a negative value. But I ultimately don't feel strongly about this.

@ggerganov
Copy link
Owner

Based on the feedback so far, I think we have to support both raw logprobs and post-sampling probs:

  • For the logprobs, we adopt the same API as the OpenAI spec.
  • For the post-sampler probs, we should mirror the logprobs OpenAI API, but drop the "log" prefix: logprobs -> probs, top_logprobs -> top_probs
  • I don't see a problem both the /completion and /chat/completion endpoints to support the 2 types of probability outputs.
  • For the logprobs, we should apply the "min-p" optimization for some performance improvement.
  • In the llama-server UI, it would be nice to be able to choose to display either the pre-sampling or post-sampling probs, since I can see that both values can make sense in different cases.

Does that sound OK?

@JohannesGaessler
Copy link
Collaborator

For the post-sampler probs, we should mirror the logprobs OpenAI API, but drop the "log" prefix: logprobs -> probs, top_logprobs -> top_probs

I think that would be confusing, my intuitive expectation would be that logprobs is just the input of the softmax for probs. The intuitive name that I would associate with the raw model outputs pre-softmax is "logits". I would prefer if we define probs and logprobs in the documentation (whichever way we choose to do it) and then add something with a more explicit name like probs_softmax or probs_samplers.

@ggerganov
Copy link
Owner

ggerganov commented Dec 14, 2024

that logprobs is just the input of the softmax for probs

To make sure we are on the same page, do these definitions sound correct to you:

  1. logits: raw model output
  2. raw probabilities/probs, pre-sampling probabilities/probs: this is softmax(logits)
  3. raw logprobs, pre-sampling logprobs: this is log(softmax(logits))
  4. post-sampling probs: this are the probabilities that we computed after applying the full sampling chain

So 2. and 3. are essentially the same thing, but with log applied to one of them. The OAI chose to output (3) in their API, so to be compatible, we should do the same.

The raw logits (1) would require too much data to be transported over the network and top-logits don't have much applications, so I think these should not be exposed through the API in any way.

The post-sampling probs (4) are what we have exposed up to now and the suggestion is to keep this as an option.

If I understand correctly, you are suggesting to call (2) probs_softmax and (4) probs_samplers. This sounds intuitive to me, but the problem is that (2) won't match the OAI spec.

@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented Dec 14, 2024

raw logprobs, pre-sampling logprobs: this is log(softmax(logits))

Sorry, that's what I meant, I agree that this is the correct relationship between probs and logprobs.

If I understand correctly, you are suggesting to call (2) probs_softmax and (4) probs_samplers. This sounds intuitive to me, but the problem is that (2) won't match the OAI spec.

If we want to have an OAI-compatible API we should return the same quantities under the same name. So for the OAI-compatible API (3) should be returned as logprobs. But we should not just return (4) as probs since that would be confusing. So I think (4) should be returned as something like probs_samplers or probs_post_samplers (if at all).

For our own API we are not bound to whatever choices OAI made. We could choose to define probs and logprobs as (the logarithm of) the probabilities post samplers (though I think consistency with the OAI-compatible API would be preferable). In that case I would call the probabilities when using just the softmax something like probs_softmax.

I do not mean to use both probs_softmax and probs_samplers simultaneously, just one of them to clearly distinguish the property from the default probabilities.

@ggerganov
Copy link
Owner

But we should not just return (4) as probs since that would be confusing. So I think (4) should be returned as something like probs_samplers or probs_post_samplers (if at all).

Yes, I agree having "(4) as probs" would be confusing and the suggestion for probs_samplers or probs_post_samplers is better.

@p-e-w
Copy link

p-e-w commented Dec 14, 2024

What we are specifically talking about is the perplexity of a token sequence as it's being generated. The corresponding probability distribution is post-samplers. I don't see why the perplexity of such a sequence using the pre-samplers distribution would be a meaningful number.

Regardless of how one feels perplexity should be computed, doing so isn't the only application of logprobs. For confidence estimation, we absolutely want the pre-sampler probabilities, because otherwise samplers become a self-fulfilling prophecy where the confidence goes up because the choices are narrowed down by the samplers. In the common case where only a single token remains, this would give 100% confidence, which is patently absurd because that "confidence" is manufactured by truncation.

@p-e-w
Copy link

p-e-w commented Dec 14, 2024

Providing more information in extra fields sounds good, but I hope we can all agree that the fields logprob and top_logprobs must contain the pre-sampler log probabilities, and the pre-sampler top tokens, respectively. Those field names are from the OAI spec, and having the same names with different semantics would be a nightmare.

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 15, 2024

Hmm ok so to settle the discussion between post and pre sampling probs, I think we can support both and the rule is:

  • For OAI-compat API (i.e. everything under /v1/*), we will return a pre-sampling, aka log(softmax(logits))
  • For non-OAI-compat, we only return the probs and NOT the logprobs:
    • Default to pre-sampling probs (like OAI, softmax(logits))
    • If user specify "post_sampling_probs": true in the request, we will return the post sampling probs

@JohannesGaessler
Copy link
Collaborator

If user specify "pre_sampling_probs": true in the request, we will return the post sampling probs

I assume you mean post_sampling_probs; that would be fine with me.

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 19, 2024

@ggerganov Can you please re-review this PR? Thank you!

examples/server/README.md Outdated Show resolved Hide resolved
examples/server/server.cpp Outdated Show resolved Hide resolved
examples/server/server.cpp Outdated Show resolved Hide resolved
examples/server/server.cpp Show resolved Hide resolved
@ngxson ngxson requested a review from ggerganov December 19, 2024 11:28
cur_p->data[i].p
});
// break if we have all the necessary data
if (result.probs.size() == n_probs && found_sampled_tok) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems incorrect - if the sampled token is from the tail of the distribution, we will never break here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah it's quite messy to do every thing in one loop. I think it's better to split this into 2 different loops, and then optimize the one that take the prob of sampled token later on (in another PR)

@ngxson ngxson merged commit 57bb2c4 into ggerganov:master Dec 19, 2024
50 checks passed
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024
* server : fix logprobs, make it openai-compatible

* update docs

* add std::log

* return pre-sampling p

* sort before apply softmax

* add comment

* fix test

* set p for sampled token

* update docs

* add --multi-token-probs

* update docs

* add `post_sampling_probs` option

* update docs [no ci]

* remove --multi-token-probs

* "top_probs" with "post_sampling_probs"

* resolve review comments

* rename struct token_prob to prob_info

* correct comment placement

* fix setting prob for sampled token
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. examples python python script changes server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Misc. bug: n_probs is not working with llama.cpp server
5 participants