-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Speculative decoding] [Performance]: Re-enable bonus tokens #4212
Comments
Hi @cadedaniel wondering if anyone working on this currently? If not I would like to look into it. Please let me know. |
that would be awesome. let's chat more in vllm slack. |
Discussed with @sroy745 and @LiuXiaoxuanPKU on best approach to fix this
|
Hi @cadedaniel and @LiuXiaoxuanPKU Here is a pr that I used for doing some measurements. I ran the tests with JackFram/llama-68m with TP 1 on A-100. Without cuda graphs the decode time is ~0.89ms to 0.87ms at batch size 5 and 10 respectively. This is greater than the 0.5ms expected for batch expansion. Given these numbers should we go with batch expansion then? For batch size 5 For batch size 10 |
Sounds good to me! |
This PR implements the logic for enabling bonus tokens. For this feature, the SpecDecodeWorker maintains state across multiple forward passes of a sequence to determine if it was assigned a bonus token. If so, it then backfills the KV cache for the penultimate token in the next forward pass. This logic for maintaining state is implemented in the SpecDecodeWorker In the current implementation, the SpecDecodeWorker maintains a list of the sequence_ids that were assigned bonus tokens in their last forward pass. If the sequence is not assigned a bonus token in its current pass, it is removed from the list if it was there. However, if the generation is terminated for a sequence that was part of this list, it is never removed. Hence, over time, we will accumulate sequence_ids in this list which are no longer active. Therefore, we need a way to remove such sequence_ids from this list. One way to implement this would be the following:
@cadedaniel can you please take a look at this proposal and let me know if this would work? |
The proposal looks good. To simplify the interaction between scheduler and worker, we should embed the finished seq ids in the ExecuteModelRequest. This is better than adding a new method because in the future the worker procs could run forever in a loop; it is also better than coupling the OutputProcessor with the worker as the OutputProcessors will live in their own process in the near future. by the way, the folks implementing Jamba support ran into the exact same issue. See the changes to ExecuteModelRequest in this PR https://github.com/vllm-project/vllm/pull/4115/files. |
Thanks for the pointer. Since this pr is addressing the same problem I will wait for this pr to be merged. |
I found that the recent version does not have issues with these options. However, the PRs (#5765, #7244, #8701) still haven't fully resolved the problem with bonus tokens. We continued to experience a sharp drop in acceptance rates when K is not 1, regardless of whether flashinfer was used. This drop does not occur if bonus tokens are disabled. We tested versions v0.5.4 and v0.6.2. (cc. @jeongin601) |
My understanding is that the bonus_token logic is enabled by default. Can you point me to where you see its not being set? |
@sroy745 The problem is that, since when we don't use For reference, we tested versions v0.5.4, and v0.6.2 with and without the |
I made a PR related to this issue |
Proposal to improve performance
In #3951 we disable bonus tokens (token sampled from verifier model assuming all proposal tokens are accepted) because its KV is not generated for the draft model. We can fix this by "prefilling" the KV of bonus tokens in the draft model. Note that for proposal methods not requiring KV (e.g. prompt lookup), we can re-enable bonus tokens and get a speedup there.
The impact of this performance improvement depends on the speculation length. For low K, e.g. 1, where the probability of accepting the single spec token is high (~= how aligned the draft model and target model are on the sequence), it has high impact because accepting 1 token allows us to emit 2 tokens (1 speculative and 1 bonus). Since we disable bonus tokens, we can now only emit 1 token (the accepted speculative one).
For higher K the impact is less as the likelihood of accepting all speculative tokens is exponentially lower.
vllm/vllm/model_executor/layers/rejection_sampler.py
Lines 311 to 315 in 323f27b
The text was updated successfully, but these errors were encountered: