-
Notifications
You must be signed in to change notification settings - Fork 825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speculative decoding with lookahead #2790
base: main
Are you sure you want to change the base?
Conversation
Hi @jjjjohnson Could you help resolve the conflicts? Thanks. |
Done |
Could you share any performance results? |
python/sglang/srt/server_args.py
Outdated
parser.add_argument( | ||
"--speculative-lookahead-path", | ||
type=str, | ||
help="The path of the lookahead ", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A more detailed description is needed here. The current description is somewhat confusing as to what this parameter does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -220,6 +220,17 @@ def __init__( | |||
target_worker=self.tp_worker, | |||
dp_rank=dp_rank, | |||
) | |||
elif self.spec_algorithm.is_lookahead(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it more appropriate to use a factory pattern to create different speculative workers? cc @merrymercy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just follow what eagel2 did in the class SpeculativeAlgorithm...
@@ -666,6 +667,10 @@ def init_cuda_graphs(self): | |||
tic = time.time() | |||
logger.info("Capture cuda graph begin. This can take up to several minutes.") | |||
self.cuda_graph_runner = CudaGraphRunner(self) | |||
if self.spec_algorithm.is_lookahead(): | |||
# in case look_ahead failed to match any draft token, fallback to normal cuda graph decode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can’t the same cuda graph runner be reused here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because there is two cases using cuda graph:
- Normal decode, where one batch corresponding to 1 token when decode;
- Lookahead spec decode, where one batch corresponding to more than 1 token when decode
These two cases cannot be unified, so I need a tag to differentiate these cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it will always be the one used when using the lookahead algorithm. Why do we need to create two cuda graph runners? https://github.com/sgl-project/sglang/pull/2790/files#diff-65c6ac2c41977f68e460f18e35053b97089631f88a9958b0796343fccee78a67R719
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you profiled the time-consuming proportion of the lookahead_cache part? I'm curious about the performance of these functions implemented by python. (Of course this is not a problem that needs to be solved for merging this PR. I am just curious.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The time to generate draft tokens using lookahead_cache is quite short, only cost 0.001s for 8 tokens.
But to update lookahead_cache using context tokens after prefill sometimes takes a long time especially when context very long, probably due to lookahead_cache.put induce python dict resizing, which takes a long time when the dict is very large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you could change the lookahead_cache.put to an async function to make it overlap with model computation on GPU. Which may help the module perform better.
Motivation
n-gram based speculative is very effective in retrieval augmented generation(RAG). The cost of generating draft tokens is relatively low compared to eagle and has a great potential for accelerating token generation in RAG. Ant group has proposed the Trie-based retrieval and verification mechanism. They claimed to use lookahead based on vLLM for the single-query situation and obtain 1.6 times acceleration on a real-life scenario. I want to adopt lookahead to SGLang.
Related resources
Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy
Overall workflow
Features
Checklist