forked from vllm-project/vllm
-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Unexpected decode graph compilation after preemption #158
Labels
external
Issues or PRs submitted by external users
Comments
This issue is related to Flat PA feature Perf is 2 times higher because of this feature, but unexpected decode graphs from out of max-model-len range is present. Investigation is ongoing. |
This was referenced Sep 3, 2024
madamczykhabana
pushed a commit
that referenced
this issue
Sep 18, 2024
Fix blocks number calculation for Flat PA via adding empty table_block (#158)
|
issue is closed |
zhouyu5
pushed a commit
to zhouyu5/vllm-fork
that referenced
this issue
Sep 20, 2024
Fix blocks number calculation for Flat PA via adding empty table_block (HabanaAI#158)
zhouyu5
pushed a commit
to zhouyu5/vllm-fork
that referenced
this issue
Sep 20, 2024
Fix blocks number calculation for Flat PA via adding empty table_block (HabanaAI#158)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Anything you want to discuss about vllm.
On vllm-fork and habana_next branch (commit 067a243), preemption can cause unexpected decode graph compilation. It can be reproduced with following command:
Also, add following code to line 103 in benchmark_throughput.py to set block_size=128:
Please note that VLLM_GRAPH_RESERVED_MEM=0.08 VLLM_GRAPH_PROMPT_RATIO=0.1 captures 100% of pre-determined prefill and decode graphs. With this setup, 3935 blocks can be allocated.
Early logs are as below:
While prompt graph miss can be easily handled (please check PR 109), decode graph misses are unexpected. You can see that the length is longer than max_model_len.
This issue stems from (L958~961) in habana_model_runner.py:
After preemption, real_batch_size decreases from max_seq_len (ex. 256 → 243). In this case, L961 pads seq_group_metadata_list to 256 batch for bucketing, and padded seq_group_metadata_list[0] includes non-zero block_table for the first decode request in the batch. Therefore, L697 in habana_model_runner.py creates a DecodeMetadata with block_tables which is longer than the number of maximum pages, leading to unpredictable sequence length for the decode graphs.
To solve this problem, one suggested way will be to pad along the batch dimension with seq_group_metadata without block_table. We can temporarily handle this with increased sequence length for the decode bucket, but it will increase the memory for captured graphs.
The text was updated successfully, but these errors were encountered: