-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] Fix disagg hang caused by the prefill and decode communication issues #12723
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
8a54ef7
to
195c282
Compare
Signed-off-by: Lu Fang <[email protected]>
195c282
to
4779f98
Compare
@KuntaiDu, friendly ping for the review :-) |
4779f98
to
da06b0c
Compare
da06b0c
to
4779f98
Compare
Oh right, when the buffer is almost full, the prefill instance can finish inference of this request, but the KV cache of this request won't be added to the buffer, so the prefill instance will return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me also test this PR locally.
while not is_buffer_available(tokens_roi_recver): | ||
self.buffer_cv.wait() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice if we can log here to make sure people know that the engine is waiting for the KV cache that is already generated but not entered into the lookup buffer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some logging.
# repeatedly. | ||
logger.debug("KV transfer buffer is full. Handling...") | ||
while self.buffer_size > self.buffer_size_threshold: | ||
self.full_handler() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also remove the code for full_handler
if we don't use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
Just tested, it works (and the performance gets much better). Thank you for contributing! |
Signed-off-by: Lu Fang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
…on issues (vllm-project#12723) Signed-off-by: Lu Fang <[email protected]>
…on issues (vllm-project#12723) Signed-off-by: Lu Fang <[email protected]>
…on issues (vllm-project#12723) Signed-off-by: Lu Fang <[email protected]> Signed-off-by: SzymonOzog <[email protected]>
Now prefill acts as a server to wait for decode kv request. But when the kv is not ready, prefill return None and decode can proceed. This caused prefill kv buffer accumulate and block other prefill from moving forward. Fix by poll wait + relinquish to ensure kv ready and send the correct kv to decode.
The PR was created by @jiayisuse. He is in China for holiday, so submit the PR on his behalf to get review first.