Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][Bugfix][Perf] Refactor Server to Avoid AsyncLLMEngine #8092

Closed
wants to merge 33 commits into from
Closed
Changes from 1 commit
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
a7a6e43
[Benchmark] Add async throughput benchmark
njhill Aug 28, 2024
ce7d159
wip
njhill Aug 29, 2024
569cd43
Merge remote-tracking branch 'njhill/async-llm-eng-bench' into reduce…
robertgshaw2-redhat Aug 29, 2024
d99ce6f
stash
robertgshaw2-redhat Aug 31, 2024
8d6b2e9
remove proxy
robertgshaw2-redhat Sep 2, 2024
14f3637
stash
robertgshaw2-redhat Sep 2, 2024
3b8311b
added mp_llm_engine
robertgshaw2-redhat Sep 2, 2024
5e2eb74
fixed
robertgshaw2-redhat Sep 2, 2024
aa62f2e
format
robertgshaw2-redhat Sep 2, 2024
863081b
cleanup
robertgshaw2-redhat Sep 2, 2024
965b97a
revert asyncllmengine
robertgshaw2-redhat Sep 2, 2024
8fd72f6
fix nit
robertgshaw2-redhat Sep 2, 2024
ddeb7c6
format
robertgshaw2-redhat Sep 2, 2024
6539e10
Merge branch 'main' into reduce-asyncio-oh
robertgshaw2-redhat Sep 2, 2024
4b111e4
clean
robertgshaw2-redhat Sep 2, 2024
a5ffd2c
fix
robertgshaw2-redhat Sep 2, 2024
1395872
stash
robertgshaw2-redhat Sep 2, 2024
938cf85
move files
robertgshaw2-redhat Sep 2, 2024
72d1d42
cleanup code
robertgshaw2-redhat Sep 3, 2024
fcdcfc9
refactor, cleanup
robertgshaw2-redhat Sep 3, 2024
659169e
updated
robertgshaw2-redhat Sep 3, 2024
9886f3d
make health check work
robertgshaw2-redhat Sep 3, 2024
5b2f057
format
robertgshaw2-redhat Sep 3, 2024
ae4564c
awk -> ack
robertgshaw2-redhat Sep 3, 2024
f9ccecc
add better shutdown
robertgshaw2-redhat Sep 3, 2024
89b730b
cleanup comment
robertgshaw2-redhat Sep 3, 2024
f3dc82b
more awk --> ack
robertgshaw2-redhat Sep 3, 2024
ac97a9e
use constant
robertgshaw2-redhat Sep 3, 2024
becd7ab
format
robertgshaw2-redhat Sep 3, 2024
b7f49ed
remove set to None
robertgshaw2-redhat Sep 3, 2024
58ae3b0
Merge remote-tracking branch 'origin/main' into reduce-asyncio-oh
njhill Sep 4, 2024
d0f9641
Remove redundant pass
njhill Sep 4, 2024
aa64042
Merge branch 'main' into reduce-asyncio-oh
robertgshaw2-redhat Sep 4, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
more awk --> ack
robertgshaw2-redhat committed Sep 3, 2024
commit f3dc82b584f3d2db67c87d5e8fc12d498189e902
17 changes: 3 additions & 14 deletions vllm/engine/multiprocessing/mp_llm_engine.py
Original file line number Diff line number Diff line change
@@ -124,17 +124,7 @@ def _init_engine(self, *args,
elif self.worker_use_ray:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is worker_use_ray?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is the same as --distributed_executor_backend=ray

engine_class = ray.remote(num_cpus=0)(self._engine_class).remote
else:
# FIXME(woosuk): This is a bit hacky. Be careful when changing the
# order of the arguments.
cache_config = kwargs["cache_config"]
parallel_config = kwargs["parallel_config"]
if (parallel_config.tensor_parallel_size == 1
and parallel_config.pipeline_parallel_size == 1):
num_gpus = cache_config.gpu_memory_utilization
else:
num_gpus = 1
engine_class = ray.remote(num_gpus=num_gpus)(
self._engine_class).remote
raise NotImplementedError("Not supported yet!")
return engine_class(*args, **kwargs)

def run_background_loop(self):
Copy link
Collaborator Author

@robertgshaw2-redhat robertgshaw2-redhat Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inner loop?

How should we make this propogate exceptions and do things like have the BackgroundLoopDeadError

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it even worth pulling over the LoopDead logic from the async llm engine?

IIUC that's all there to prevent misconfiguration errors from putting the engine into a state where it only ever responds with an exception, but it looks like it's done pretty bluntly where any exception at all from the model executor will kill the loop. If we want to keep that behavior, we could simply raise from here and exit (after notifying the clients of the exception), and let the frontend die as well.

@@ -218,7 +208,7 @@ def stream_outputs(self, request_outputs: List[RequestOutput]):
self.output_socket.send_multipart((pickle.dumps(request_outputs), ),
copy=False)

def awk_check_health(self):
def ack_check_health(self):
self.health_socket.send_multipart(
(pickle.dumps(VLLM_RPC_SUCCESS_STR), ), copy=False)

@@ -255,8 +245,7 @@ def _handle_utility_request(self, request: RPCUtilityRequest):
self.engine.do_log_stats()
elif request == RPCUtilityRequest.CHECK_HEALTH:
self.engine.check_health()
# Special check_health channel for awk check health.
self.awk_check_health()
self.ack_check_health()


def run_mp_engine(engine_args: AsyncEngineArgs, usage_context: UsageContext,