llama-bench : add pp+tg test type #7199

slaren · 2024-05-10T11:53:27Z

Adds a test type -pg pp,tg that consists of a prompt of pp tokens followed by a generation of tg tokens. The result is the average t/s of the entire process. The default parameters include a pp512+tg128 test, which can be disabled by passing -pg 0,0.

Example:

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	4918.36 ± 132.23
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	165.76 ± 1.50
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512+tg128	600.22 ± 3.39

JohannesGaessler · 2024-05-10T12:13:25Z

I think providing an average rate for something with two distinct phases doesn't really make sense. What I think would be a better metric is the total time needed to first process some number of tokens and then generate some other number of tokens.

Somewhat related: I've been thinking that it would be useful if you could use llama-bench to determine how the speed changes depending on how full the context is. (In principle you can already calculate this from multiple runs but it's kind of tedious.)

slaren · 2024-05-10T12:22:45Z

I think it makes perfect sense to have a value that represents the overall performance of the most common use case of LLMs. It also provides a way to test for cases such as #6766 (comment).

JohannesGaessler · 2024-05-10T12:42:48Z

My view is this: a metric is only useful if you can use it do comparisons. One important factor for that is that you would want a higher/lower value to be consistently better. t/s for prompt processing and token generation on their own meet this criterion because a higher value directly translates to a lower time to first token or a higher rate at which the user receives tokens. The average t/s of both phases does not have this property: if you increase the number of tokens in the prompt the average t/s will be higher but the actual user experience will be worse because not only will the time needed to process the prompt be higher but the rate at which tokens are generated afterwards will also be lower. On the other hand for the total runtime a lower value will always be better.

In my opinion the total runtime is also less abstract and more easily interpretable for real-life use cases such as determining the throughput of a server given some assumed prompt and generation lengths. If you still want to provide a rate rather than a runtime I think something like request throughput/minute would still be more useful than average t/s.

slaren · 2024-05-10T12:57:20Z

if you increase the number of tokens in the prompt the average t/s will be higher but the actual user experience will be worse because not only will the time needed to process the prompt be higher but the rate at which tokens are generated afterwards will also be lower

You could make that argument about any tests whose result is the throughput (ie. t/s) rather than the total time. Ie. the t/s of prompt of 512 will be a lot higher than of a prompt of 32 tokens, yet the user experience will be worse because the overall time will be much higher. IMO the conclusion is not that t/s is not an useful metric, but rather that tests with different number of tokens cannot be compared directly. Hence:

My view is this: a metric is only useful if you can use it do comparisons.

This is just as useful as any other test that llama-bench performs: it allows comparing the exact same scenario between different options, builds or hardware. It does not allow comparing an scenario with a different scenario, just like every over test that llama-bench performs.

JohannesGaessler · 2024-05-10T13:30:20Z

Ie. the t/s of prompt of 512 will be a lot higher than of a prompt of 32 tokens, yet the user experience will be worse because the overall time will be much higher. IMO the conclusion is not that t/s is not an useful metric, but rather that tests with different number of tokens cannot be compared directly.

I will concede the point that generally speaking the t/s values for pp and tg are also not directly comparable if you vary the number of tokens. But I think that these values are significantly more stable if you vary the number of tokens and they also have a much more direct interpretation. And you can (in principle) account for the varying number of tokens in a relatively straightforward but tedious way.

This is just as useful as any other test that llama-bench performs: it allows comparing the exact same scenario between different options, builds or hardware. It does not allow comparing an scenario with a different scenario, just like every over test that llama-bench performs.

Let me say that I fundamentally agree with you that a combined test for pp and tg would be a useful feature to have. To me it is simply a question of what to normalize the results to. The current metric of average t/s has the useful property of being higher for better hardware or general performance optimizations. I just think that the runtime or something like the request throughput/minute would have the additional useful properties of being less abstract and more closely related to real-life use cases.

slaren · 2024-05-10T13:40:16Z

I don't disagree that t/s is not the best metric for this type of test, since the result of a particular test cannot be used to extrapolate the performance with different numbers of tokens, but IMO it works well enough for the most important cases for which llama-bench is used, and really, I don't think the results are that stable for the other tests, the performance drops rather dramatically as the context size is increased, or as the batch size is reduced. I think it would be good to explore different metrics, but at this point this is just a small update to patch a hole in the testing capabilities of llama-bench.

JohannesGaessler · 2024-05-10T15:15:07Z

For me the bottom line is this: I think this PR would be a net benefit. I also think it would be a bigger net benefit with a slightly different metric. I don't think that I have the authority to tell other devs what to spend their time on, it is up to them whether or not they want to follow my advice. I myself would be willing to do the necessary changes for a different metric but I will likely not have the capacity to do so until Tuesday. I will not block a merge.

llama-bench : add pp+tg test type

3658dd0

mofosyne added Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level enhancement New feature or request testing Everything test related labels May 10, 2024

update llama-bench readme

e9a3ba6

ggerganov approved these changes May 10, 2024

View reviewed changes

slaren merged commit e849648 into master May 10, 2024
59 checks passed

slaren deleted the sl/bench-pp+tg branch May 10, 2024 16:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama-bench : add pp+tg test type #7199

llama-bench : add pp+tg test type #7199

slaren commented May 10, 2024 •

edited

Loading

JohannesGaessler commented May 10, 2024

slaren commented May 10, 2024

JohannesGaessler commented May 10, 2024

slaren commented May 10, 2024

JohannesGaessler commented May 10, 2024

slaren commented May 10, 2024

JohannesGaessler commented May 10, 2024

llama-bench : add pp+tg test type #7199

llama-bench : add pp+tg test type #7199

Conversation

slaren commented May 10, 2024 • edited Loading

JohannesGaessler commented May 10, 2024

slaren commented May 10, 2024

JohannesGaessler commented May 10, 2024

slaren commented May 10, 2024

JohannesGaessler commented May 10, 2024

slaren commented May 10, 2024

JohannesGaessler commented May 10, 2024

slaren commented May 10, 2024 •

edited

Loading