-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock detected #339
Comments
The deadlock is caused when we detect that we are not making any progress on any of the generation tasks. This can happen for a few reasons, including lots of concurrent generation requests, very long sequences, or limited GPU memory. Our current solution for this will hurt performance if you are seeing it often. How many requests are you sending to the server at once time? Also, I believe @tohtana is working on an improved solution to this problem. |
I am sending a few hundred requests within one batch. |
If these requests are generating lots of tokens, then sending this many at once will definitely cause the deadlock situation. If you can send the requests in smaller batches, that would avoid the problem. However, I will let @tohtana comment on any upcoming changes that will allow users to send large batches of requests at once! |
Hi @flexwang, We understand that tuning the number of requests isn't always straightforward, and we're considering either automating this adjustment or at least making it easier in future versions. |
vLLM implements swapping (Chapter 4.5 of vLLM paper) as an alternative to recomputing if no space could be allocated for KV cache of new tokens. Would MII implement KV cache swapping? |
@canamika27 I think #403 resolved the issue. Can you try the latest version? |
@tohtana -- Thanks !! Deadlock issue is solved with latest Deepspeed version but again I got a new error : Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. One Observation from my end : I am currently using 2 x A100 80GB systems & my prompts are approx 1000-2000 tokens, so when I am reducing my prompt length till 200 tokens it seems to be working with batch size 1 but not for higher batches. It seems we cannot run large prompts. This issue is happening only when I am using 2 gpus , with 1 GPU I am able to run with big batches & long prompts. |
I got the same error. |
same error |
I have the same problem. |
Any Update on this? I am also getting the same error. |
Constantly see this issue when running below on a A100(40GiB) for llama2-7b.
The text was updated successfully, but these errors were encountered: