Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Implement Dynamic SplitFuse #1562

Closed
casper-hansen opened this issue Nov 4, 2023 · 7 comments
Closed

[FEATURE] Implement Dynamic SplitFuse #1562

casper-hansen opened this issue Nov 4, 2023 · 7 comments
Labels
feature request performance Performance-related issues

Comments

@casper-hansen
Copy link
Contributor

Dear vLLM maintainers @WoosukKwon and @zhuohan123 (@Yard1),

DeepSpeed has released its serving framework which claims to be faster than vLLM. The main speedup comes from Dynamic SplitFuse which is a technique that does the following:

  • Long prompts are decomposed into much smaller chunks and scheduled across multiple forward passes (iterations) with only the final pass performing any generation.

  • Short prompts will be composed to exactly fill a target token budget. Even short prompts may be decomposed to ensure the budget is precisely met and the forward sizes are well-aligned.

Code: https://github.com/microsoft/DeepSpeed-MII
Background: https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fastgen

Llama 13B (1x A100-80GB):
image

Llama 70B (4x A100x80GB with TP):
image

@WoosukKwon WoosukKwon added the enhancement New feature or request label Nov 7, 2023
@WoosukKwon WoosukKwon added the performance Performance-related issues label Nov 9, 2023
@irasin
Copy link
Contributor

irasin commented Nov 14, 2023

LGTM

@thesues
Copy link
Contributor

thesues commented Dec 20, 2023

Hi, is there any progress right now?

@shixianc
Copy link

shixianc commented Jan 7, 2024

Do we have an ETA? 😊

@tdene
Copy link

tdene commented Feb 20, 2024

Hi @WoosukKwon @zhuohan123

The absence of a chunked prefill implementation in vllm is a major blocker. Any kind of timeline or regular communication on progress towards a chunked prefill implementation would be immensely helpful, just to allow for future planning.

@sh1ng
Copy link
Contributor

sh1ng commented Feb 29, 2024

Keeping a batch with aligned length definitely helps #2357

@njhill
Copy link
Member

njhill commented Feb 29, 2024

Looks like someone has started working on this: #3106

@hmellor
Copy link
Collaborator

hmellor commented Jul 26, 2024

Chunked prefill is now supported

@hmellor hmellor closed this as completed Jul 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request performance Performance-related issues
Projects
None yet
Development

No branches or pull requests

9 participants