Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Eliminate parallel worker per-step task scheduling overhead #3763

Closed
wants to merge 6 commits into from

Conversation

njhill
Copy link
Member

@njhill njhill commented Mar 31, 2024

There's no need for the parallel workers to be scheduled in every step.

Using 80GB A100s, with llama-2-7b openai completion API. Single request with 5 input tokens, 2000 generated tokens. I repeated each test request multiple times, results were very consistent.

Time (sec) Difference
TP=1 24.2 0
TP=2 24.2 0
TP=2 with this PR 20.2 -17%
TP=2 with #3466 (without Ray) 17.0 -30%
TP=2 with this PR combined with #3466 (without Ray) 16.7 -31%

Though the relative improvement from this is much smaller in the non-Ray case, it might still be helpful for multi-node with Ray.

@njhill njhill changed the title [Core] Eliminate parallel worker inter-token scheduling overhead [Core] Eliminate parallel worker inter-token task scheduling overhead Apr 1, 2024
@njhill njhill changed the title [Core] Eliminate parallel worker inter-token task scheduling overhead [Core] Eliminate parallel worker per-step task scheduling overhead Apr 1, 2024
@njhill
Copy link
Member Author

njhill commented Apr 1, 2024

I measured about 5% reduction in latency with this change for a single request with 5 input toks, 1000 output toks with TP=4 llama-2-70b.

@WoosukKwon
Copy link
Collaborator

@zhuohan123 Kindly ping for PR review.

Copy link
Collaborator

@Yard1 Yard1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks pretty good!

Copy link
Member

@zhuohan123 zhuohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! Left some small comments for coding style.

One additional question, do you think it's possible to let the workers be in the model loop forever (so we don't need to call the current halt_model function)?

vllm/executor/ray_gpu_executor.py Outdated Show resolved Hide resolved
@@ -292,8 +308,7 @@ def _run_workers(
self,
method: str,
*args,
driver_args: Optional[List[Any]] = None,
driver_kwargs: Optional[Dict[str, Any]] = None,
async_remote_only: bool = False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A slightly more accurate name:

Suggested change
async_remote_only: bool = False,
remote_worker_only: bool = False,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhuohan123 I've renamed it for now to remote_workers_only_async because in addition to only running on the remote workers, it also changes the behavior to not wait for the responses and return the future(s) rather than resulting outputs.

Let me know if you would still prefer to drop the _async though (or if you can think of a better name... maybe start_remote_workers_only?) and I'll update again.

blocks_to_swap_out=blocks_to_swap_out,
blocks_to_copy=blocks_to_copy)

def halt_model(self) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly a more straight-forward name:

Suggested change
def halt_model(self) -> None:
def stop_remote_worker_execution_loop(self) -> None:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made this change, but the reason for the original name is that it's defined in the abstract base class where there is no concept of a remote worker. "halt_model" seemed like a more abstract way of describing it.

self.gpu_cache)

@torch.inference_mode()
def execute_model_parallel(self) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def execute_model_parallel(self) -> None:
def start_worker_execution_loop(self) -> None:

And we can rename the original execute_model -> driver_execute_model

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhuohan123 I didn't make this change yet because it's actually called from a lot of places including from a number of places in the speculative decoding logic. Please confirm and I can change it everywhere.

vllm/worker/worker.py Outdated Show resolved Hide resolved
@@ -676,7 +676,11 @@ def step(self) -> List[RequestOutput]:
else:
output = []

return self._process_model_outputs(output, scheduler_outputs)
outputs = self._process_model_outputs(output, scheduler_outputs)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will return finished outputs twice? LLM object will get duplicate output of same request?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hengxinCheung sorry, I'm not sure I understand the question. This PR doesn't change anything w.r.t. how many outputs are returned.

Copy link

@hengxinCheung hengxinCheung Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry for cofusing you. Let me provide a more detailed description. For example, request A marked as finished in the current execution, but it will be scheduled in the next step. So this request will return last generated text twice? I will carefully read your implementation again. Thanks your reply.

@njhill
Copy link
Member Author

njhill commented Apr 9, 2024

Thanks for the review @zhuohan123 and great comments. I have address most of them but have couple of small follow-up questions, ptal!

One additional question, do you think it's possible to let the workers be in the model loop forever (so we don't need to call the current halt_model function)?

I'd been considering this, but I'm not sure that NCCL is intended to be used in this way, i.e. blocking indefinitely in an event loop. So we could run into unexpected issues. Apart from this, there are some consequences we'd have to address:

  • nccl has a timeout, but unless you run in a sub-optimal synchronous mode, it's always treated as fatal. We could possibly disable the timeout, but that might be undesirable. Or we could ensure that the driver wakes up the workers on some interval smaller than the timeout.
  • I think the procs are essentially frozen while in the collective op, so no other threads can run. This might be the GIL being held. So further changes might then be needed to accommodate other control plane operations, in particular add/remove loras. If we went this route we may also want to handle these via the same broadcast + gather mechanism.

Given the above, I thought the current PR changes would make more sense as a first incremental change. But I do like the idea of avoiding the secondary RPC path altogether. Perhaps gloo could be instead used for the event loop along the lines of what @youkaichao has been looking at.

@youkaichao
Copy link
Member

@njhill FYI #3904 just adds a gloo backend by default. That process group is available as _CPU_WORLD_GROUP. I will create some APIs for getting the group.

njhill added 5 commits April 15, 2024 13:44
There's no need for the parallel workers to be scheduled each step.
So that any errors are still propagated properly
Default behaviour is no-op (single GPU)
Default behaviour is no-op (single GPU)
@njhill
Copy link
Member Author

njhill commented May 18, 2024

@zhuohan123 I've opened #4894 to replace this, now applies to both Ray and Multiprocessing executor implementations. PTAL!

@njhill njhill closed this May 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants