-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to benchmark for speedup and acceptance rate? #12
Comments
+1 How to benchmark the speed up? I ran the example codes and didn't see obvious acceleration. How to reproduce 4.04x accelerate of Llama2-7b on A100? |
To run Sequoia: As the framework is written in Huggingface, the baseline should be around 23ms ~ 25ms per token, Sequoia should be 6ms ~ 7ms per token. |
Thanks for the response. How about acceptance rate? What does |
decoding step means how many tokens are generated. large model step means how many times large model do verification. decoding step / large model step reflects how many tokens are correctly predicted with Sequoia's tree. acceptance rate needs to be independently measured with |
Thank you. This answers all my questions. |
After testing both Baseline: It seems that more tokens are being generated in greedy mode than in baseline mode. Although the generation latency is the same as expected, I wonder if it is rather unfair to compare latency when generating different tokens. Would it be better if we set a fixed sequence length and compare generation time instead?
Just to make sure I understand this correctly, if all drafts are wrong, then |
Your understanding is correct. We only allow baseline to generate 32 tokens is because in some experiments, such as Vicuna33B, running baseline can cost a lot of time. You can change this manually if you want. What you need to modify is |
Sorry for asking a possibly obvious question but it would be better if the documentation makes this clear.
The text was updated successfully, but these errors were encountered: