-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Implement Dynamic SplitFuse #1562
Comments
LGTM |
Hi, is there any progress right now? |
Do we have an ETA? 😊 |
The absence of a chunked prefill implementation in vllm is a major blocker. Any kind of timeline or regular communication on progress towards a chunked prefill implementation would be immensely helpful, just to allow for future planning. |
Keeping a batch with aligned length definitely helps #2357 |
Looks like someone has started working on this: #3106 |
Chunked prefill is now supported |
Dear vLLM maintainers @WoosukKwon and @zhuohan123 (@Yard1),
DeepSpeed has released its serving framework which claims to be faster than vLLM. The main speedup comes from Dynamic SplitFuse which is a technique that does the following:
Long prompts are decomposed into much smaller chunks and scheduled across multiple forward passes (iterations) with only the final pass performing any generation.
Short prompts will be composed to exactly fill a target token budget. Even short prompts may be decomposed to ensure the budget is precisely met and the forward sizes are well-aligned.
Code: https://github.com/microsoft/DeepSpeed-MII
Background: https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fastgen
Llama 13B (1x A100-80GB):
![image](https://private-user-images.githubusercontent.com/27340033/280471548-cc7842b8-e234-482d-8550-d38d39d94473.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkxNzQ4NTEsIm5iZiI6MTczOTE3NDU1MSwicGF0aCI6Ii8yNzM0MDAzMy8yODA0NzE1NDgtY2M3ODQyYjgtZTIzNC00ODJkLTg1NTAtZDM4ZDM5ZDk0NDczLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEwVDA4MDIzMVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWFkMWRmNmUwOGMzNjEyY2I1ZjMxZGQ2OTZlMWJhYmIyNTUwNzA5M2E0NjRmNTg2ZWI3YTVjMTEyZmM4ODQ4ZmImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.727-wvDtuST1S7fvwtko5mD2Wtvn6WRiPJZUo4xaBvc)
Llama 70B (4x A100x80GB with TP):
![image](https://private-user-images.githubusercontent.com/27340033/280471584-e035e094-0f10-463c-abf0-aafd67a61fed.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkxNzQ4NTEsIm5iZiI6MTczOTE3NDU1MSwicGF0aCI6Ii8yNzM0MDAzMy8yODA0NzE1ODQtZTAzNWUwOTQtMGYxMC00NjNjLWFiZjAtYWFmZDY3YTYxZmVkLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEwVDA4MDIzMVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTY2NjkyZWUwODQ2NzEzOGU4ZWUyZTUzZTlkZTAzM2Y3MzFhOTliOGNjYjU3MThmN2Q1ZGIzMmRjOGQ4YjIyNGQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.5LAmim9u4EOPTkKBR-Oe92b1CAqyrQpYeWTp_jQCTBU)
The text was updated successfully, but these errors were encountered: