-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] fix crash on flashinfer backend with cudagraph disabled, when attention group_size not in [1,2,4,8] #7509
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge). To run full CI, you can do one of these:
🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Can you add a unit test?
add utest case in the last commit which can cover this bug if user do not set BatchDecodeWithPagedKVCacheWrapper properly. |
/ready |
When will this pr be merged? |
@learninmou could you rebase to see if CI failure is fixed in main branch already? |
Head branch was pushed to by a user without write access
rebase finished, all checks have passed |
…ttention group_size not in [1,2,4,8] (vllm-project#7509)
Hi, we recently discover that this PR caused 11.4% perf regression: Tested on H200 machines with below command: Have we done perf test on this PR? |
Hmm this PR shouldn't introduce performance regression to existing workloads as it attempts to fix uncovered cases. Is your vLLM benchmark before and after exactly this PR with the same environment? If so is it possible for you to identify the root cause? It could be one of the following I could think of:
Also which model and GPU you're benchmarking? Thanks. |
…ttention group_size not in [1,2,4,8] (vllm-project#7509)
I'm not exactly sure why this would cause perf regression yet. But I reran the benchmark, it shows about 18% slowdown: Before: Throughput: 10.56 requests/s, 5409.15 tokens/s Benchmark is performed on Llama3-70B on H200. Is it possible that we revert it for now until further investigations? |
Ok I probably know the reason. Before this PR:
After this PR:
I'll file a PR to fix this and you could test the PR to see if that helps. |
Sorry I just noticed this PR, this fix will degrade performance. The reason flashinfer compiles kernel with |
It will be great if you can set me as a reviewer for pull requests related to flashinfer, I always miss mentions because of huge amount of notifications.. |
…ttention group_size not in [1,2,4,8] (vllm-project#7509) Signed-off-by: Alvant <[email protected]>
Signed-off-by: Alvant <[email protected]>
…ttention group_size not in [1,2,4,8] (vllm-project#7509)
when I use flashinfer backend and disable cuda graph, load a model with attention group_size=6, vllm crashs and shows the following log:
This error consistently occurs under the following conditions: