-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Hardware][Gaudi][Bugfix] Fix HPU tensor parallelism, enable multiprocessing executor #12167
[Hardware][Gaudi][Bugfix] Fix HPU tensor parallelism, enable multiprocessing executor #12167
Conversation
Signed-off-by: Konrad Zawora <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Signed-off-by: Konrad Zawora <[email protected]>
46d59ea
to
68f1a43
Compare
Signed-off-by: zhenwei <[email protected]>
Signed-off-by: Konrad Zawora <[email protected]>
Signed-off-by: Konrad Zawora <[email protected]>
Signed-off-by: Konrad Zawora <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @youkaichao Can you take a second look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the fix!
Signed-off-by: Konrad Zawora <[email protected]> Signed-off-by: Bowen Wang <[email protected]>
Signed-off-by: Konrad Zawora <[email protected]>
Signed-off-by: Konrad Zawora <[email protected]>
Signed-off-by: Konrad Zawora <[email protected]>
Signed-off-by: Konrad Zawora <[email protected]>
Signed-off-by: Konrad Zawora <[email protected]>
Signed-off-by: Konrad Zawora <[email protected]> Signed-off-by: Isotr0py <[email protected]>
#11256 changed the input broadcasting flow a little bit, and in tensor parallel scenario we cannot assume that all workers will receive
execute_model_req
from the executor. Since broadcast from driver worker is now performed withinLocalOrDistributedWorkerBase.prepare_input
called byLocalOrDistributedWorkerBase.execute_model
, for non-driver workers, we can expectexecute_model_req
to be None withinHPUWorker.execute_model
(which is essentially a wrapper aroundLocalOrDistributedWorkerBase.execute_model
with some additional HPU-specific profiling sugar). Asserting thatexecute_model_req is not None
breaks the tensor parallelism, and this PR fixes that. Additionally, it enables usage of multiprocessing executor on HPU, as it's fully functional on HPU (as introduced in #11030).