-
Notifications
You must be signed in to change notification settings - Fork 731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Offline LLM Engine Benchmark Throughput #1968
Conversation
@ByronHsu do you want me to add this to CI? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR tries to reuse the script bench_serving.py
. However, I think a better approach is to use a standalone script.
Reason:
bench_serving.py
is for online serving, but the most common use case of the Engine API is for offline use cases. We should benchmark the non-async version of the Engine to see what the maximum throughput we can get without all of the streaming/asyncio overhead- We want to pass in many other arguments to the server. The new script should be similar to
bench_latency.py
, which takes in the fullServerArgs
as arguments.sglang/python/sglang/bench_latency.py
Lines 536 to 541 in 760552e
parser = argparse.ArgumentParser() ServerArgs.add_cli_args(parser) BenchArgs.add_cli_args(parser) args = parser.parse_args() server_args = ServerArgs.from_cli_args(args) bench_args = BenchArgs.from_cli_args(args)
Can you try to write a standalone script bench_offline_throughput.py
that takes the same arguments as bench_latency.py
?
will do |
@merrymercy updated script, what do you think |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks better! Sorry for the back-and-forth, but I think we should support loading real datasets and support better random dataset generation, similar to the one in bench_serinvg.py
.
The bench_latency.py
uses a simple way to generate synthetic data because it does not support continuous batching or inputs with variable lengths. The engine supports everything so we can use more realistic data.
Also, please add a unit test here https://github.com/sgl-project/sglang/blob/main/test/srt/test_srt_engine.py to run this benchmark for 10 random prompts. |
@merrymercy made the changes |
Can we have an option for |
sure. i would suggest that we keep input + output since it's a pretty standard way to measure throughput (total system throughput) as far as i know. definitely will put both though. maybe add input + output to bench_serving as well? |
The script now looks good. @ByronHsu can you verify the output format? @zolinthecow Can you resolve the remaining comments? Then we can merge this soon! |
LGTM. @ByronHsu you can do the final merge. |
Slightly refactored the code and attempted to fix the test. Can merge if the test passes |
Thanks for the contribution!! @zolinthecow |
@zolinthecow great work! |
Motivation
#1865
Add throughput benchmark for
engine.generate
Modifications
Added ability to specify an engine instead of an API url in the benchmarks
Checklist