Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock detected #339

Open
flexwang opened this issue Dec 3, 2023 · 13 comments
Open

Deadlock detected #339

flexwang opened this issue Dec 3, 2023 · 13 comments

Comments

@flexwang
Copy link

flexwang commented Dec 3, 2023

Deadlock detected. Resetting KV cache and recomputing requests. Consider limiting number of concurrent requests or decreasing max lengths of prompts/generations.

Constantly see this issue when running below on a A100(40GiB) for llama2-7b.

    import mii
    from deepspeed.inference import RaggedInferenceEngineConfig, DeepSpeedTPConfig
    from deepspeed.inference.v2.ragged import DSStateManagerConfig

    tp_config = DeepSpeedTPConfig(tp_size=tensor_parallel)
    mgr_config = DSStateManagerConfig(max_ragged_batch_size=1024,
                                      max_ragged_sequence_count=1024)
    inference_config = RaggedInferenceEngineConfig(tensor_parallel=tp_config,
                                                   state_manager=mgr_config)
    llm = mii.serve(
        model,
        deployment_name='mii',
        tensor_parallel=tensor_parallel,
        inference_engine_config=inference_config,
        replica_num=1,
        task='text-generation'
    )
    outputs = llm.generate(prompts,
                           do_sample=False,
                           top_p=1.0,
                           max_new_tokens=max_new_tokens)
@mrwyattii
Copy link
Contributor

The deadlock is caused when we detect that we are not making any progress on any of the generation tasks. This can happen for a few reasons, including lots of concurrent generation requests, very long sequences, or limited GPU memory. Our current solution for this will hurt performance if you are seeing it often. How many requests are you sending to the server at once time?

Also, I believe @tohtana is working on an improved solution to this problem.

@flexwang
Copy link
Author

flexwang commented Dec 4, 2023

I am sending a few hundred requests within one batch.

@mrwyattii
Copy link
Contributor

I am sending a few hundred requests within one batch.

If these requests are generating lots of tokens, then sending this many at once will definitely cause the deadlock situation. If you can send the requests in smaller batches, that would avoid the problem. However, I will let @tohtana comment on any upcoming changes that will allow users to send large batches of requests at once!

@tohtana
Copy link
Contributor

tohtana commented Dec 6, 2023

Hi @flexwang,
DeepSpeed-FastGet (MII) allocates KV cache for all requests that are processed in a batch. To avoid this warning, a simple workaround is to reduce the number of requests in a batch. In your case, I recommend starting with 10-20 requests, though the optimal number heavily depends on the lengths of the prompts and the generated tokens. If you don't encounter the warning message, you may be able to further enhance efficiency by gradually increasing the number of requests.

We understand that tuning the number of requests isn't always straightforward, and we're considering either automating this adjustment or at least making it easier in future versions.

@Tan-YiFan
Copy link

vLLM implements swapping (Chapter 4.5 of vLLM paper) as an alternative to recomputing if no space could be allocated for KV cache of new tokens. Would MII implement KV cache swapping?

@canamika27
Copy link

Hi, any update on this issue.

I am also getting same issue even with a batch size 1 (using 2 x A100 80GB) but when I am using a single A100 80GB I am able to run even with higher batches

image

@tohtana
Copy link
Contributor

tohtana commented Feb 23, 2024

@canamika27 I think #403 resolved the issue. Can you try the latest version?

@canamika27
Copy link

canamika27 commented Feb 26, 2024

@tohtana -- Thanks !! Deadlock issue is solved with latest Deepspeed version but again I got a new error :
assert last_r is not None, "Function to clear the KV cache is invoked, but no request consumes KV cache"

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The batch size is 1
The batch size is 1
The total time is 3.23 secs
The batch size is 1
The total time is 4.21 secs
The batch size is 1
The total time is 3.15 secs
The batch size is 1
The total time is 3.15 secs
The batch size is 1
Traceback (most recent call last):
File "/home/AutoAWQ/Digi_human/TP_DP/Deepspeed/test_deepspeed.py", line 37, in
response = pipe(prompts, max_new_tokens=256)
File "/home/anaconda3/envs/mlc/lib/python3.10/site-packages/mii/batching/ragged_batching.py", line 550, in call
self.schedule_requests()
File "/home/anaconda3/envs/mlc/lib/python3.10/site-packages/mii/batching/ragged_batching.py", line 334, in schedule_requests
self.reset_request_status()
File "/home/anaconda3/envs/mlc/lib/python3.10/site-packages/mii/batching/ragged_batching.py", line 359, in reset_request_status
assert last_r is not None, "Function to clear the KV cache is invoked, but no request consumes KV cache"
AssertionError: Function to clear the KV cache is invoked, but no request consumes KV cache
[2024-02-26 00:11:22,782] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 766664
[2024-02-26 00:11:25,008] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 766665

One Observation from my end : I am currently using 2 x A100 80GB systems & my prompts are approx 1000-2000 tokens, so when I am reducing my prompt length till 200 tokens it seems to be working with batch size 1 but not for higher batches. It seems we cannot run large prompts. This issue is happening only when I am using 2 gpus , with 1 GPU I am able to run with big batches & long prompts.

@zoyopei
Copy link

zoyopei commented Mar 19, 2024

@tohtana -- Thanks !! Deadlock issue is solved with latest Deepspeed version but again I got a new error : assert last_r is not None, "Function to clear the KV cache is invoked, but no request consumes KV cache"

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. The batch size is 1 The batch size is 1 The total time is 3.23 secs The batch size is 1 The total time is 4.21 secs The batch size is 1 The total time is 3.15 secs The batch size is 1 The total time is 3.15 secs The batch size is 1 Traceback (most recent call last): File "/home/AutoAWQ/Digi_human/TP_DP/Deepspeed/test_deepspeed.py", line 37, in response = pipe(prompts, max_new_tokens=256) File "/home/anaconda3/envs/mlc/lib/python3.10/site-packages/mii/batching/ragged_batching.py", line 550, in call self.schedule_requests() File "/home/anaconda3/envs/mlc/lib/python3.10/site-packages/mii/batching/ragged_batching.py", line 334, in schedule_requests self.reset_request_status() File "/home/anaconda3/envs/mlc/lib/python3.10/site-packages/mii/batching/ragged_batching.py", line 359, in reset_request_status assert last_r is not None, "Function to clear the KV cache is invoked, but no request consumes KV cache" AssertionError: Function to clear the KV cache is invoked, but no request consumes KV cache [2024-02-26 00:11:22,782] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 766664 [2024-02-26 00:11:25,008] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 766665

One Observation from my end : I am currently using 2 x A100 80GB systems & my prompts are approx 1000-2000 tokens, so when I am reducing my prompt length till 200 tokens it seems to be working with batch size 1 but not for higher batches. It seems we cannot run large prompts. This issue is happening only when I am using 2 gpus , with 1 GPU I am able to run with big batches & long prompts.

I got the same error.

@geoyg
Copy link

geoyg commented Mar 22, 2024

@tohtana -- Thanks !! Deadlock issue is solved with latest Deepspeed version but again I got a new error : assert last_r is not None, "Function to clear the KV cache is invoked, but no request consumes KV cache"

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. The batch size is 1 The batch size is 1 The total time is 3.23 secs The batch size is 1 The total time is 4.21 secs The batch size is 1 The total time is 3.15 secs The batch size is 1 The total time is 3.15 secs The batch size is 1 Traceback (most recent call last): File "/home/AutoAWQ/Digi_human/TP_DP/Deepspeed/test_deepspeed.py", line 37, in response = pipe(prompts, max_new_tokens=256) File "/home/anaconda3/envs/mlc/lib/python3.10/site-packages/mii/batching/ragged_batching.py", line 550, in call self.schedule_requests() File "/home/anaconda3/envs/mlc/lib/python3.10/site-packages/mii/batching/ragged_batching.py", line 334, in schedule_requests self.reset_request_status() File "/home/anaconda3/envs/mlc/lib/python3.10/site-packages/mii/batching/ragged_batching.py", line 359, in reset_request_status assert last_r is not None, "Function to clear the KV cache is invoked, but no request consumes KV cache" AssertionError: Function to clear the KV cache is invoked, but no request consumes KV cache [2024-02-26 00:11:22,782] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 766664 [2024-02-26 00:11:25,008] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 766665

One Observation from my end : I am currently using 2 x A100 80GB systems & my prompts are approx 1000-2000 tokens, so when I am reducing my prompt length till 200 tokens it seems to be working with batch size 1 but not for higher batches. It seems we cannot run large prompts. This issue is happening only when I am using 2 gpus , with 1 GPU I am able to run with big batches & long prompts.

same error

@aspiridon0v
Copy link

@tohtana -- Thanks !! Deadlock issue is solved with latest Deepspeed version but again I got a new error : assert last_r is not None, "Function to clear the KV cache is invoked, but no request consumes KV cache"

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. The batch size is 1 The batch size is 1 The total time is 3.23 secs The batch size is 1 The total time is 4.21 secs The batch size is 1 The total time is 3.15 secs The batch size is 1 The total time is 3.15 secs The batch size is 1 Traceback (most recent call last): File "/home/AutoAWQ/Digi_human/TP_DP/Deepspeed/test_deepspeed.py", line 37, in response = pipe(prompts, max_new_tokens=256) File "/home/anaconda3/envs/mlc/lib/python3.10/site-packages/mii/batching/ragged_batching.py", line 550, in call self.schedule_requests() File "/home/anaconda3/envs/mlc/lib/python3.10/site-packages/mii/batching/ragged_batching.py", line 334, in schedule_requests self.reset_request_status() File "/home/anaconda3/envs/mlc/lib/python3.10/site-packages/mii/batching/ragged_batching.py", line 359, in reset_request_status assert last_r is not None, "Function to clear the KV cache is invoked, but no request consumes KV cache" AssertionError: Function to clear the KV cache is invoked, but no request consumes KV cache [2024-02-26 00:11:22,782] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 766664 [2024-02-26 00:11:25,008] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 766665

One Observation from my end : I am currently using 2 x A100 80GB systems & my prompts are approx 1000-2000 tokens, so when I am reducing my prompt length till 200 tokens it seems to be working with batch size 1 but not for higher batches. It seems we cannot run large prompts. This issue is happening only when I am using 2 gpus , with 1 GPU I am able to run with big batches & long prompts.

I have the same problem.

@prabin525
Copy link

Any Update on this? I am also getting the same error.

@seven-mile
Copy link

seven-mile commented Jul 30, 2024

Any workaround for the new problem? @arashb Sorry for ping, can you help?

I just want to do inference serially, but got this error after exactly 3 pipeline calls. Stably repro with mixtral8x7b on two machines.

Related issue: #497

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants