Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to benchmark for speedup and acceptance rate? #12

Open
singularity-s0 opened this issue Apr 22, 2024 · 7 comments
Open

How to benchmark for speedup and acceptance rate? #12

singularity-s0 opened this issue Apr 22, 2024 · 7 comments

Comments

@singularity-s0
Copy link

Sorry for asking a possibly obvious question but it would be better if the documentation makes this clear.

@cyLi-Tiger
Copy link

cyLi-Tiger commented Apr 23, 2024

+1 How to benchmark the speed up? I ran the example codes and didn't see obvious acceleration. How to reproduce 4.04x accelerate of Llama2-7b on A100?

@dreaming-panda
Copy link
Contributor

dreaming-panda commented Apr 24, 2024

+1 How to benchmark the speed up? I ran the example codes and didn't see obvious acceleration. How to reproduce 4.04x accelerate of Llama2-7b on A100?

To run Sequoia:
CUDA_VISIBLE_DEVICES=0 python testbed_greedy.py --model JackFram/llama-68m --target meta-llama/Llama-2-7b-hf --T 0.6 --P 1.0 --start 0 --end 200 --M 384 --growmap ../A100_growmaps/68m_7b/growmaps/A100-C4-68m-7b-greedy.pt --Mode greedy --dataset c4
To run baseline:
CUDA_VISIBLE_DEVICES=0 python testbed_greedy.py --model JackFram/llama-68m --target meta-llama/Llama-2-7b-hf --T 0.6 --P 1.0 --start 0 --end 200 --M 384 --growmap ../A100_growmaps/68m_7b/growmaps/A100-C4-68m-7b-greedy.pt --Mode baseline --dataset c4

As the framework is written in Huggingface, the baseline should be around 23ms ~ 25ms per token, Sequoia should be 6ms ~ 7ms per token.

@singularity-s0
Copy link
Author

singularity-s0 commented Apr 24, 2024

Thanks for the response. How about acceptance rate? What does decoding step and large model step mean in the output?

@dreaming-panda
Copy link
Contributor

decoding step means how many tokens are generated. large model step means how many times large model do verification. decoding step / large model step reflects how many tokens are correctly predicted with Sequoia's tree.

acceptance rate needs to be independently measured with
python test_accept.py --model JackFram/llama-68m --target meta-llama/Llama-2-7b-hf
--T 0.6 --P 1.0 --start 0 --end 200 --M 288 --W 32
--ALG stochastic --dataset cnn \

@singularity-s0
Copy link
Author

Thank you. This answers all my questions.

@briskerkazoos
Copy link

After testing both baseline and greedy on C4 dataset on A100, I get the following result:

Baseline: total time :110.10318s, latency :0.02298s, decoding step: 4791
Greedy: total time :144.56247s, latency :0.00813s, decoding step: 17778, large model step: 4605, 3.8605863192182412

It seems that more tokens are being generated in greedy mode than in baseline mode. Although the generation latency is the same as expected, I wonder if it is rather unfair to compare latency when generating different tokens. Would it be better if we set a fixed sequence length and compare generation time instead?

decoding step / large model step reflects how many tokens are correctly predicted with Sequoia's tree.

Just to make sure I understand this correctly, if all drafts are wrong, then decoding step / large model step = 1. And if decoding step / large model step = 2, it means that on average, the drafting model gets 1 token correct per draft. Is this right?

@dreaming-panda
Copy link
Contributor

Your understanding is correct. We only allow baseline to generate 32 tokens is because in some experiments, such as Vicuna33B, running baseline can cost a lot of time.

You can change this manually if you want. What you need to modify is inner_decoding_step < 32 in testbed.py.
Also, we plan to update the code in the following weeks. We will solve the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants