Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama-bench : add pp+tg test type #7199

Merged
merged 2 commits into from
May 10, 2024
Merged

llama-bench : add pp+tg test type #7199

merged 2 commits into from
May 10, 2024

Conversation

slaren
Copy link
Member

@slaren slaren commented May 10, 2024

Adds a test type -pg pp,tg that consists of a prompt of pp tokens followed by a generation of tg tokens. The result is the average t/s of the entire process. The default parameters include a pp512+tg128 test, which can be disabled by passing -pg 0,0.

Example:

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 4918.36 ± 132.23
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 165.76 ± 1.50
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512+tg128 600.22 ± 3.39

@mofosyne mofosyne added Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level enhancement New feature or request testing Everything test related labels May 10, 2024
@JohannesGaessler
Copy link
Collaborator

I think providing an average rate for something with two distinct phases doesn't really make sense. What I think would be a better metric is the total time needed to first process some number of tokens and then generate some other number of tokens.

Somewhat related: I've been thinking that it would be useful if you could use llama-bench to determine how the speed changes depending on how full the context is. (In principle you can already calculate this from multiple runs but it's kind of tedious.)

@slaren
Copy link
Member Author

slaren commented May 10, 2024

I think it makes perfect sense to have a value that represents the overall performance of the most common use case of LLMs. It also provides a way to test for cases such as #6766 (comment).

@JohannesGaessler
Copy link
Collaborator

My view is this: a metric is only useful if you can use it do comparisons. One important factor for that is that you would want a higher/lower value to be consistently better. t/s for prompt processing and token generation on their own meet this criterion because a higher value directly translates to a lower time to first token or a higher rate at which the user receives tokens. The average t/s of both phases does not have this property: if you increase the number of tokens in the prompt the average t/s will be higher but the actual user experience will be worse because not only will the time needed to process the prompt be higher but the rate at which tokens are generated afterwards will also be lower. On the other hand for the total runtime a lower value will always be better.

In my opinion the total runtime is also less abstract and more easily interpretable for real-life use cases such as determining the throughput of a server given some assumed prompt and generation lengths. If you still want to provide a rate rather than a runtime I think something like request throughput/minute would still be more useful than average t/s.

@slaren
Copy link
Member Author

slaren commented May 10, 2024

if you increase the number of tokens in the prompt the average t/s will be higher but the actual user experience will be worse because not only will the time needed to process the prompt be higher but the rate at which tokens are generated afterwards will also be lower

You could make that argument about any tests whose result is the throughput (ie. t/s) rather than the total time. Ie. the t/s of prompt of 512 will be a lot higher than of a prompt of 32 tokens, yet the user experience will be worse because the overall time will be much higher. IMO the conclusion is not that t/s is not an useful metric, but rather that tests with different number of tokens cannot be compared directly. Hence:

My view is this: a metric is only useful if you can use it do comparisons.

This is just as useful as any other test that llama-bench performs: it allows comparing the exact same scenario between different options, builds or hardware. It does not allow comparing an scenario with a different scenario, just like every over test that llama-bench performs.

@JohannesGaessler
Copy link
Collaborator

Ie. the t/s of prompt of 512 will be a lot higher than of a prompt of 32 tokens, yet the user experience will be worse because the overall time will be much higher. IMO the conclusion is not that t/s is not an useful metric, but rather that tests with different number of tokens cannot be compared directly.

I will concede the point that generally speaking the t/s values for pp and tg are also not directly comparable if you vary the number of tokens. But I think that these values are significantly more stable if you vary the number of tokens and they also have a much more direct interpretation. And you can (in principle) account for the varying number of tokens in a relatively straightforward but tedious way.

This is just as useful as any other test that llama-bench performs: it allows comparing the exact same scenario between different options, builds or hardware. It does not allow comparing an scenario with a different scenario, just like every over test that llama-bench performs.

Let me say that I fundamentally agree with you that a combined test for pp and tg would be a useful feature to have. To me it is simply a question of what to normalize the results to. The current metric of average t/s has the useful property of being higher for better hardware or general performance optimizations. I just think that the runtime or something like the request throughput/minute would have the additional useful properties of being less abstract and more closely related to real-life use cases.

@slaren
Copy link
Member Author

slaren commented May 10, 2024

I don't disagree that t/s is not the best metric for this type of test, since the result of a particular test cannot be used to extrapolate the performance with different numbers of tokens, but IMO it works well enough for the most important cases for which llama-bench is used, and really, I don't think the results are that stable for the other tests, the performance drops rather dramatically as the context size is increased, or as the batch size is reduced. I think it would be good to explore different metrics, but at this point this is just a small update to patch a hole in the testing capabilities of llama-bench.

@JohannesGaessler
Copy link
Collaborator

For me the bottom line is this: I think this PR would be a net benefit. I also think it would be a bigger net benefit with a slightly different metric. I don't think that I have the authority to tell other devs what to spend their time on, it is up to them whether or not they want to follow my advice. I myself would be willing to do the necessary changes for a different metric but I will likely not have the capacity to do so until Tuesday. I will not block a merge.

@slaren slaren merged commit e849648 into master May 10, 2024
59 checks passed
@slaren slaren deleted the sl/bench-pp+tg branch May 10, 2024 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants