[bucketing overhaul 1/n] Add padding-aware scheduling and option to limit prefill batch size #394

kzawora-intel · 2024-10-15T13:18:27Z

This PR adds following functionality that can be enabled via engine flags:

use_padding_aware_scheduling - vLLM scheduler will now calculate token cost considering padded prefill shape (similar to schedule prefills considering padded shape #109).
max_num_prefill_seqs - padding-aware scheduler will perform an additional check for prefill batch size and will effectively limit prefill batch size at maximum of max_num_prefill_seqs. If unset, max prefill batch size will be max_num_seqs.
Both features are generic and do not require HPU, although they may be specialized for particular vendor's usage. Padding aware scheduling includes padding function selector which selects HPU padding function (considering currently used HPU buckets) if current device is HPU. Otherwise, it will take a product of batch_size x max_seq_len.

vllm/worker/hpu_model_runner.py

vllm/core/scheduler.py

vllm/worker/hpu_model_runner.py

kdamaszk

LGTM

…ion to limit prefill batch size (HabanaAI#394)" This reverts commit 05bcdf5.

kzawora-intel added 2 commits October 15, 2024 16:11

Add padding-aware scheduling

38b044b

format.sh

3ec55be

kdamaszk reviewed Oct 15, 2024

View reviewed changes

vllm/worker/hpu_model_runner.py Outdated Show resolved Hide resolved

kzawora-intel added 2 commits October 15, 2024 16:41

fix scheduler bugs

ea1ffaa

remove debug stuff

c888889

michalkuligowski reviewed Oct 15, 2024

View reviewed changes

vllm/core/scheduler.py Show resolved Hide resolved

vllm/core/scheduler.py Outdated Show resolved Hide resolved

vllm/worker/hpu_model_runner.py Show resolved Hide resolved

kzawora-intel changed the title ~~Add padding-aware scheduling and option to limit prefill batch size~~ [bucketing overhaul 1/n] Add padding-aware scheduling and option to limit prefill batch size Oct 15, 2024

kzawora-intel mentioned this pull request Oct 15, 2024

[bucketing overhaul 2/n] Delegate bucket management to HPUBucketingContext #395

Closed

kdamaszk approved these changes Oct 17, 2024

View reviewed changes

madamczykhabana approved these changes Oct 17, 2024

View reviewed changes

Merge branch 'habana_main' into private/kzawora/padding_aware_scheduling

231df85

kzawora-intel merged commit 05bcdf5 into habana_main Oct 17, 2024
19 checks passed

tae-su-kim mentioned this pull request Oct 18, 2024

schedule prefills considering padded shape #109

Closed

xuechendi added a commit to xuechendi/vllm-fork that referenced this pull request Oct 23, 2024

Revert "[bucketing overhaul 1/n] Add padding-aware scheduling and opt…

1b4b7cd

…ion to limit prefill batch size (HabanaAI#394)" This reverts commit 05bcdf5.

kzawora-intel mentioned this pull request Nov 6, 2024

Draft: Add max-num-prefill-seqs parameter #253

Closed

kzawora-intel added the habana Issues or PRs submitted by Habana Labs label Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bucketing overhaul 1/n] Add padding-aware scheduling and option to limit prefill batch size #394

[bucketing overhaul 1/n] Add padding-aware scheduling and option to limit prefill batch size #394

kzawora-intel commented Oct 15, 2024 •

edited

Loading

kdamaszk left a comment

[bucketing overhaul 1/n] Add padding-aware scheduling and option to limit prefill batch size #394

[bucketing overhaul 1/n] Add padding-aware scheduling and option to limit prefill batch size #394

Conversation

kzawora-intel commented Oct 15, 2024 • edited Loading

kdamaszk left a comment

Choose a reason for hiding this comment

kzawora-intel commented Oct 15, 2024 •

edited

Loading