[Bugfix] Fix disagg hang caused by the prefill and decode communication issues #12723

houseroad · 2025-02-04T05:44:08Z

Now prefill acts as a server to wait for decode kv request. But when the kv is not ready, prefill return None and decode can proceed. This caused prefill kv buffer accumulate and block other prefill from moving forward. Fix by poll wait + relinquish to ensure kv ready and send the correct kv to decode.

The PR was created by @jiayisuse. He is in China for holiday, so submit the PR on his behalf to get review first.

github-actions · 2025-02-04T05:44:19Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: Lu Fang <[email protected]>

houseroad · 2025-02-06T01:19:12Z

@KuntaiDu, friendly ping for the review :-)

KuntaiDu · 2025-02-06T13:26:04Z

But when the kv is not ready, prefill return None and decode can proceed

Oh right, when the buffer is almost full, the prefill instance can finish inference of this request, but the KV cache of this request won't be added to the buffer, so the prefill instance will return None when decode instance request this KV.

KuntaiDu

Let me also test this PR locally.

KuntaiDu · 2025-02-06T13:37:05Z

vllm/distributed/kv_transfer/kv_lookup_buffer/simple_buffer.py

+                    while not is_buffer_available(tokens_roi_recver):
+                        self.buffer_cv.wait()


Would be nice if we can log here to make sure people know that the engine is waiting for the KV cache that is already generated but not entered into the lookup buffer?

Added some logging.

KuntaiDu · 2025-02-06T13:39:34Z

vllm/distributed/kv_transfer/kv_lookup_buffer/simple_buffer.py

-            # repeatedly.
-            logger.debug("KV transfer buffer is full. Handling...")
-        while self.buffer_size > self.buffer_size_threshold:
-            self.full_handler()


Maybe also remove the code for full_handler if we don't use it.

KuntaiDu · 2025-02-06T16:44:17Z

Just tested, it works (and the performance gets much better). Thank you for contributing!

Signed-off-by: Lu Fang <[email protected]>

KuntaiDu

LGTM!

…on issues (vllm-project#12723) Signed-off-by: Lu Fang <[email protected]>

…on issues (vllm-project#12723) Signed-off-by: Lu Fang <[email protected]> Signed-off-by: SzymonOzog <[email protected]>

houseroad force-pushed the fb-sync-lufang branch from 8a54ef7 to 195c282 Compare February 4, 2025 05:44

Fix disagg hang caused by the prefill and decode communication issues

4779f98

Signed-off-by: Lu Fang <[email protected]>

houseroad force-pushed the fb-sync-lufang branch from 195c282 to 4779f98 Compare February 4, 2025 06:00

houseroad force-pushed the fb-sync-lufang branch from 4779f98 to da06b0c Compare February 6, 2025 01:29

houseroad requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac, alexm-redhat, mgoin and tlrmchlsmth as code owners February 6, 2025 01:29

mergify bot added the v1 label Feb 6, 2025

houseroad force-pushed the fb-sync-lufang branch from da06b0c to 4779f98 Compare February 6, 2025 01:41

KuntaiDu reviewed Feb 6, 2025

View reviewed changes

Address feedback from Kuntai

6c112f9

Signed-off-by: Lu Fang <[email protected]>

KuntaiDu approved these changes Feb 7, 2025

View reviewed changes

KuntaiDu enabled auto-merge (squash) February 7, 2025 16:43

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 7, 2025

simon-mo disabled auto-merge February 8, 2025 00:39

simon-mo merged commit 45cbc49 into vllm-project:main Feb 8, 2025
61 of 64 checks passed

AoyuQC pushed a commit to AoyuQC/vllm that referenced this pull request Feb 8, 2025

[Bugfix] Fix disagg hang caused by the prefill and decode communicati…

fd1429b

…on issues (vllm-project#12723) Signed-off-by: Lu Fang <[email protected]>

ShangmingCai pushed a commit to ShangmingCai/vllm that referenced this pull request Feb 10, 2025

[Bugfix] Fix disagg hang caused by the prefill and decode communicati…

816a47e

…on issues (vllm-project#12723) Signed-off-by: Lu Fang <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] Fix disagg hang caused by the prefill and decode communication issues #12723

[Bugfix] Fix disagg hang caused by the prefill and decode communication issues #12723

houseroad commented Feb 4, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Feb 4, 2025

houseroad commented Feb 6, 2025

KuntaiDu commented Feb 6, 2025

KuntaiDu left a comment

KuntaiDu Feb 6, 2025

houseroad Feb 6, 2025

KuntaiDu Feb 6, 2025

houseroad Feb 6, 2025

KuntaiDu commented Feb 6, 2025

KuntaiDu left a comment

		while not is_buffer_available(tokens_roi_recver):
		self.buffer_cv.wait()

[Bugfix] Fix disagg hang caused by the prefill and decode communication issues #12723

[Bugfix] Fix disagg hang caused by the prefill and decode communication issues #12723

Conversation

houseroad commented Feb 4, 2025 • edited by github-actions bot Loading

github-actions bot commented Feb 4, 2025

houseroad commented Feb 6, 2025

KuntaiDu commented Feb 6, 2025

KuntaiDu left a comment

Choose a reason for hiding this comment

KuntaiDu Feb 6, 2025

Choose a reason for hiding this comment

houseroad Feb 6, 2025

Choose a reason for hiding this comment

KuntaiDu Feb 6, 2025

Choose a reason for hiding this comment

houseroad Feb 6, 2025

Choose a reason for hiding this comment

KuntaiDu commented Feb 6, 2025

KuntaiDu left a comment

Choose a reason for hiding this comment

houseroad commented Feb 4, 2025 •

edited by github-actions bot

Loading