[Core] Eliminate parallel worker per-step task scheduling overhead #3763

njhill · 2024-03-31T21:25:02Z

There's no need for the parallel workers to be scheduled in every step.

Using 80GB A100s, with llama-2-7b openai completion API. Single request with 5 input tokens, 2000 generated tokens. I repeated each test request multiple times, results were very consistent.

	Time (sec)	Difference
TP=1	24.2	0
TP=2	24.2	0
TP=2 with this PR	20.2	-17%
TP=2 with #3466 (without Ray)	17.0	-30%
TP=2 with this PR combined with #3466 (without Ray)	16.7	-31%

Though the relative improvement from this is much smaller in the non-Ray case, it might still be helpful for multi-node with Ray.

njhill · 2024-04-01T01:15:59Z

I measured about 5% reduction in latency with this change for a single request with 5 input toks, 1000 output toks with TP=4 llama-2-70b.

WoosukKwon · 2024-04-06T14:00:05Z

@zhuohan123 Kindly ping for PR review.

Yard1

I think this looks pretty good!

zhuohan123

Thanks for the contribution! Left some small comments for coding style.

One additional question, do you think it's possible to let the workers be in the model loop forever (so we don't need to call the current halt_model function)?

vllm/executor/ray_gpu_executor.py

zhuohan123 · 2024-04-09T06:00:00Z

vllm/executor/ray_gpu_executor.py

@@ -292,8 +308,7 @@ def _run_workers(
        self,
        method: str,
        *args,
-        driver_args: Optional[List[Any]] = None,
-        driver_kwargs: Optional[Dict[str, Any]] = None,
+        async_remote_only: bool = False,


A slightly more accurate name:

Suggested change

async_remote_only: bool = False,

remote_worker_only: bool = False,

@zhuohan123 I've renamed it for now to remote_workers_only_async because in addition to only running on the remote workers, it also changes the behavior to not wait for the responses and return the future(s) rather than resulting outputs.

Let me know if you would still prefer to drop the _async though (or if you can think of a better name... maybe start_remote_workers_only?) and I'll update again.

zhuohan123 · 2024-04-09T06:10:17Z

vllm/executor/ray_gpu_executor.py

+            blocks_to_swap_out=blocks_to_swap_out,
+            blocks_to_copy=blocks_to_copy)
+
+    def halt_model(self) -> None:


Possibly a more straight-forward name:

Suggested change

def halt_model(self) -> None:

def stop_remote_worker_execution_loop(self) -> None:

I've made this change, but the reason for the original name is that it's defined in the abstract base class where there is no concept of a remote worker. "halt_model" seemed like a more abstract way of describing it.

zhuohan123 · 2024-04-09T06:13:27Z

vllm/worker/worker.py

+                                               self.gpu_cache)
+
+    @torch.inference_mode()
+    def execute_model_parallel(self) -> None:


Suggested change

def execute_model_parallel(self) -> None:

def start_worker_execution_loop(self) -> None:

And we can rename the original execute_model -> driver_execute_model

@zhuohan123 I didn't make this change yet because it's actually called from a lot of places including from a number of places in the speculative decoding logic. Please confirm and I can change it everywhere.

vllm/worker/worker.py

hengxinCheung · 2024-04-09T10:17:50Z

vllm/engine/llm_engine.py

@@ -676,7 +676,11 @@ def step(self) -> List[RequestOutput]:
        else:
            output = []

-        return self._process_model_outputs(output, scheduler_outputs)
+        outputs = self._process_model_outputs(output, scheduler_outputs)


Will return finished outputs twice? LLM object will get duplicate output of same request?

@hengxinCheung sorry, I'm not sure I understand the question. This PR doesn't change anything w.r.t. how many outputs are returned.

I am sorry for cofusing you. Let me provide a more detailed description. For example, request A marked as finished in the current execution, but it will be scheduled in the next step. So this request will return last generated text twice? I will carefully read your implementation again. Thanks your reply.

njhill · 2024-04-09T13:35:09Z

Thanks for the review @zhuohan123 and great comments. I have address most of them but have couple of small follow-up questions, ptal!

One additional question, do you think it's possible to let the workers be in the model loop forever (so we don't need to call the current halt_model function)?

I'd been considering this, but I'm not sure that NCCL is intended to be used in this way, i.e. blocking indefinitely in an event loop. So we could run into unexpected issues. Apart from this, there are some consequences we'd have to address:

nccl has a timeout, but unless you run in a sub-optimal synchronous mode, it's always treated as fatal. We could possibly disable the timeout, but that might be undesirable. Or we could ensure that the driver wakes up the workers on some interval smaller than the timeout.
I think the procs are essentially frozen while in the collective op, so no other threads can run. This might be the GIL being held. So further changes might then be needed to accommodate other control plane operations, in particular add/remove loras. If we went this route we may also want to handle these via the same broadcast + gather mechanism.

Given the above, I thought the current PR changes would make more sense as a first incremental change. But I do like the idea of avoiding the secondary RPC path altogether. Perhaps gloo could be instead used for the event loop along the lines of what @youkaichao has been looking at.

youkaichao · 2024-04-09T15:44:53Z

@njhill FYI #3904 just adds a gloo backend by default. That process group is available as _CPU_WORLD_GROUP. I will create some APIs for getting the group.

There's no need for the parallel workers to be scheduled each step.

So that any errors are still propagated properly

Default behaviour is no-op (single GPU)

njhill · 2024-05-18T00:41:51Z

@zhuohan123 I've opened #4894 to replace this, now applies to both Ray and Multiprocessing executor implementations. PTAL!

njhill mentioned this pull request Mar 31, 2024

[WIP][Core] fully composible launcher/task/coordinator/communicator design and implementation #3762

Closed

njhill changed the title ~~[Core] Eliminate parallel worker inter-token scheduling overhead~~ [Core] Eliminate parallel worker inter-token task scheduling overhead Apr 1, 2024

njhill changed the title ~~[Core] Eliminate parallel worker inter-token task scheduling overhead~~ [Core] Eliminate parallel worker per-step task scheduling overhead Apr 1, 2024

njhill mentioned this pull request Apr 2, 2024

[Core][ROCm][AMD] Add optional torchrun multi GPU executor #3691

Closed

njhill force-pushed the streamline-tp branch from 18ff5d7 to 3c1398b Compare April 2, 2024 00:45

zhuohan123 self-assigned this Apr 2, 2024

Yard1 reviewed Apr 8, 2024

View reviewed changes

zhuohan123 reviewed Apr 9, 2024

View reviewed changes

zhuohan123 added the action-required label Apr 9, 2024

hengxinCheung reviewed Apr 9, 2024

View reviewed changes

njhill mentioned this pull request Apr 9, 2024

[Feature]: Support Ray-free multi-node distributed inference on resource managers like Kubernetes #3902

Closed

njhill removed the action-required label Apr 10, 2024

njhill mentioned this pull request Apr 10, 2024

[Core] Multiprocessing executor for single-node multi-GPU deployment #3466

Closed

njhill force-pushed the streamline-tp branch 2 times, most recently from 8e56d21 to d95d486 Compare April 14, 2024 18:32

njhill added 5 commits April 15, 2024 13:44

[Core] Eliminate parallel worker inter-token scheduling overhead

07fc806

There's no need for the parallel workers to be scheduled each step.

Check result of worker tasks between loop executions

5f638d4

So that any errors are still propagated properly

Change halt_model_async method to not be abstract

8fcfaa7

Default behaviour is no-op (single GPU)

Change halt_model method to not be abstract

4835c6a

Default behaviour is no-op (single GPU)

Make ruff happy

2d71616

njhill force-pushed the streamline-tp branch from d95d486 to b98653e Compare April 15, 2024 20:46

Address review comments from @zhuohan123

cc1b545

njhill force-pushed the streamline-tp branch from b98653e to cc1b545 Compare April 15, 2024 22:04

njhill mentioned this pull request Apr 19, 2024

[Core] Some simplification of WorkerWrapper changes #4183

Merged

vrdn-23 mentioned this pull request Apr 30, 2024

v0.4.2 Release Tracker #4505

Closed

njhill mentioned this pull request May 18, 2024

[Core] Eliminate parallel worker per-step task scheduling overhead #4894

Merged

njhill closed this May 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Eliminate parallel worker per-step task scheduling overhead #3763

[Core] Eliminate parallel worker per-step task scheduling overhead #3763

njhill commented Mar 31, 2024

njhill commented Apr 1, 2024

WoosukKwon commented Apr 6, 2024

Yard1 left a comment

zhuohan123 left a comment

zhuohan123 Apr 9, 2024

njhill Apr 9, 2024

zhuohan123 Apr 9, 2024

njhill Apr 9, 2024

zhuohan123 Apr 9, 2024

njhill Apr 9, 2024

hengxinCheung Apr 9, 2024

njhill Apr 9, 2024

hengxinCheung Apr 9, 2024 •

edited

Loading

njhill commented Apr 9, 2024

youkaichao commented Apr 9, 2024

njhill commented May 18, 2024

	async_remote_only: bool = False,
	remote_worker_only: bool = False,

	def halt_model(self) -> None:
	def stop_remote_worker_execution_loop(self) -> None:

	def execute_model_parallel(self) -> None:
	def start_worker_execution_loop(self) -> None:

[Core] Eliminate parallel worker per-step task scheduling overhead #3763

[Core] Eliminate parallel worker per-step task scheduling overhead #3763

Conversation

njhill commented Mar 31, 2024

njhill commented Apr 1, 2024

WoosukKwon commented Apr 6, 2024

Yard1 left a comment

Choose a reason for hiding this comment

zhuohan123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hengxinCheung Apr 9, 2024 • edited Loading

Choose a reason for hiding this comment

njhill commented Apr 9, 2024

youkaichao commented Apr 9, 2024

njhill commented May 18, 2024

hengxinCheung Apr 9, 2024 •

edited

Loading