forked from vllm-project/vllm
-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support HQT on VLLM #59
Closed
Closed
Changes from 1 commit
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
4b9b955
support hqt on vllm
nirda7 f3ffc8c
Support HQT on VLLM - KVCache and Mark Step uses
nirda7 8ffc3d0
HQT on VLLM - prep model and finish measurements and multi cards run
nirda7 f5f0972
HQT on VLLM - separate kv caches
nirda7 c521c4d
HQT on VLLM - remove code duplications
nirda7 64c8c7f
HQT on VLLM - move matmul and softmax to hpu utils and revert logits …
nirda7 2e291c5
Move model to hpu when HQT is not used
nirda7 9d0fbb7
fix CR comments
nirda7 09e0078
add model weights device load
nirda7 24847a9
skip replay cached graphs during warmup
nirda7 90c2527
HQT on VLLM - Enable split value in G3
nirda7 f7c2157
pass optimizations flags only in Lazy mode
nirda7 608123b
barak rms norm optimization and a WA to remove transpose nodes
nirda7 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -61,8 +61,8 @@ def forward( | |
orig_shape = x.shape | ||
residual += x.view(residual.shape) | ||
# Note: FusedRMSNorm requires 3D tensors as inputs | ||
x = FusedRMSNorm.apply(residual.float(), self.weight.float(), self.variance_epsilon) | ||
return x.to(orig_dtype).view(orig_shape), residual | ||
x = FusedRMSNorm.apply(residual, self.weight, self.variance_epsilon) | ||
return x.view(orig_shape), residual | ||
ops.fused_add_rms_norm( | ||
x, | ||
residual, | ||
|
@@ -72,8 +72,8 @@ def forward( | |
return x, residual | ||
if x.device.type == "hpu" and FusedRMSNorm: | ||
orig_dtype = x.dtype | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. orig_dtype not used, here and also in line 60 |
||
x = FusedRMSNorm.apply(x.float(), self.weight.float(), self.variance_epsilon) | ||
return x.to(orig_dtype) | ||
x = FusedRMSNorm.apply(x, self.weight, self.variance_epsilon) | ||
return x | ||
out = torch.empty_like(x) | ||
ops.rms_norm( | ||
out, | ||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is @hpu_utils.with_mark_steps removed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we want that all convert to\from hf8 will be in the same graph
so we remove this mark step and add one outside the transformer block align to ohf version.