-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Eliminate parallel worker per-step task scheduling overhead #3763
Conversation
I measured about 5% reduction in latency with this change for a single request with 5 input toks, 1000 output toks with TP=4 llama-2-70b. |
@zhuohan123 Kindly ping for PR review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks pretty good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution! Left some small comments for coding style.
One additional question, do you think it's possible to let the workers be in the model loop forever (so we don't need to call the current halt_model
function)?
vllm/executor/ray_gpu_executor.py
Outdated
@@ -292,8 +308,7 @@ def _run_workers( | |||
self, | |||
method: str, | |||
*args, | |||
driver_args: Optional[List[Any]] = None, | |||
driver_kwargs: Optional[Dict[str, Any]] = None, | |||
async_remote_only: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A slightly more accurate name:
async_remote_only: bool = False, | |
remote_worker_only: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhuohan123 I've renamed it for now to remote_workers_only_async
because in addition to only running on the remote workers, it also changes the behavior to not wait for the responses and return the future(s) rather than resulting outputs.
Let me know if you would still prefer to drop the _async
though (or if you can think of a better name... maybe start_remote_workers_only
?) and I'll update again.
vllm/executor/ray_gpu_executor.py
Outdated
blocks_to_swap_out=blocks_to_swap_out, | ||
blocks_to_copy=blocks_to_copy) | ||
|
||
def halt_model(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly a more straight-forward name:
def halt_model(self) -> None: | |
def stop_remote_worker_execution_loop(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made this change, but the reason for the original name is that it's defined in the abstract base class where there is no concept of a remote worker. "halt_model" seemed like a more abstract way of describing it.
vllm/worker/worker.py
Outdated
self.gpu_cache) | ||
|
||
@torch.inference_mode() | ||
def execute_model_parallel(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def execute_model_parallel(self) -> None: | |
def start_worker_execution_loop(self) -> None: |
And we can rename the original execute_model
-> driver_execute_model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhuohan123 I didn't make this change yet because it's actually called from a lot of places including from a number of places in the speculative decoding logic. Please confirm and I can change it everywhere.
@@ -676,7 +676,11 @@ def step(self) -> List[RequestOutput]: | |||
else: | |||
output = [] | |||
|
|||
return self._process_model_outputs(output, scheduler_outputs) | |||
outputs = self._process_model_outputs(output, scheduler_outputs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will return finished outputs twice? LLM
object will get duplicate output of same request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hengxinCheung sorry, I'm not sure I understand the question. This PR doesn't change anything w.r.t. how many outputs are returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry for cofusing you. Let me provide a more detailed description. For example, request A marked as finished in the current execution, but it will be scheduled in the next step. So this request will return last generated text twice? I will carefully read your implementation again. Thanks your reply.
Thanks for the review @zhuohan123 and great comments. I have address most of them but have couple of small follow-up questions, ptal!
I'd been considering this, but I'm not sure that NCCL is intended to be used in this way, i.e. blocking indefinitely in an event loop. So we could run into unexpected issues. Apart from this, there are some consequences we'd have to address:
Given the above, I thought the current PR changes would make more sense as a first incremental change. But I do like the idea of avoiding the secondary RPC path altogether. Perhaps gloo could be instead used for the event loop along the lines of what @youkaichao has been looking at. |
8e56d21
to
d95d486
Compare
There's no need for the parallel workers to be scheduled each step.
So that any errors are still propagated properly
Default behaviour is no-op (single GPU)
Default behaviour is no-op (single GPU)
@zhuohan123 I've opened #4894 to replace this, now applies to both Ray and Multiprocessing executor implementations. PTAL! |
There's no need for the parallel workers to be scheduled in every step.
Using 80GB A100s, with llama-2-7b openai completion API. Single request with 5 input tokens, 2000 generated tokens. I repeated each test request multiple times, results were very consistent.
Though the relative improvement from this is much smaller in the non-Ray case, it might still be helpful for multi-node with Ray.