-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama-bench : add pp+tg test type #7199
Conversation
I think providing an average rate for something with two distinct phases doesn't really make sense. What I think would be a better metric is the total time needed to first process some number of tokens and then generate some other number of tokens. Somewhat related: I've been thinking that it would be useful if you could use |
I think it makes perfect sense to have a value that represents the overall performance of the most common use case of LLMs. It also provides a way to test for cases such as #6766 (comment). |
My view is this: a metric is only useful if you can use it do comparisons. One important factor for that is that you would want a higher/lower value to be consistently better. t/s for prompt processing and token generation on their own meet this criterion because a higher value directly translates to a lower time to first token or a higher rate at which the user receives tokens. The average t/s of both phases does not have this property: if you increase the number of tokens in the prompt the average t/s will be higher but the actual user experience will be worse because not only will the time needed to process the prompt be higher but the rate at which tokens are generated afterwards will also be lower. On the other hand for the total runtime a lower value will always be better. In my opinion the total runtime is also less abstract and more easily interpretable for real-life use cases such as determining the throughput of a server given some assumed prompt and generation lengths. If you still want to provide a rate rather than a runtime I think something like request throughput/minute would still be more useful than average t/s. |
You could make that argument about any tests whose result is the throughput (ie. t/s) rather than the total time. Ie. the t/s of prompt of 512 will be a lot higher than of a prompt of 32 tokens, yet the user experience will be worse because the overall time will be much higher. IMO the conclusion is not that t/s is not an useful metric, but rather that tests with different number of tokens cannot be compared directly. Hence:
This is just as useful as any other test that llama-bench performs: it allows comparing the exact same scenario between different options, builds or hardware. It does not allow comparing an scenario with a different scenario, just like every over test that llama-bench performs. |
I will concede the point that generally speaking the t/s values for pp and tg are also not directly comparable if you vary the number of tokens. But I think that these values are significantly more stable if you vary the number of tokens and they also have a much more direct interpretation. And you can (in principle) account for the varying number of tokens in a relatively straightforward but tedious way.
Let me say that I fundamentally agree with you that a combined test for pp and tg would be a useful feature to have. To me it is simply a question of what to normalize the results to. The current metric of average t/s has the useful property of being higher for better hardware or general performance optimizations. I just think that the runtime or something like the request throughput/minute would have the additional useful properties of being less abstract and more closely related to real-life use cases. |
I don't disagree that t/s is not the best metric for this type of test, since the result of a particular test cannot be used to extrapolate the performance with different numbers of tokens, but IMO it works well enough for the most important cases for which |
For me the bottom line is this: I think this PR would be a net benefit. I also think it would be a bigger net benefit with a slightly different metric. I don't think that I have the authority to tell other devs what to spend their time on, it is up to them whether or not they want to follow my advice. I myself would be willing to do the necessary changes for a different metric but I will likely not have the capacity to do so until Tuesday. I will not block a merge. |
Adds a test type
-pg pp,tg
that consists of a prompt ofpp
tokens followed by a generation oftg
tokens. The result is the average t/s of the entire process. The default parameters include a pp512+tg128 test, which can be disabled by passing-pg 0,0
.Example: