-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: AsyncEngineDeadError: Task finished unexpectedly with qwen2 72b #6208
Comments
ERROR 07-10 11:08:14 async_llm_engine.py:483] Engine iteration timed out. This should never happen! During handling of the above exception, another exception occurred: Traceback (most recent call last): The above exception was the direct cause of the following exception: Traceback (most recent call last): The above exception was the direct cause of the following exception: Traceback (most recent call last): |
+1 |
2 similar comments
+1 |
+1 |
We have a tracking issue (#5901) for this. Please provide more details there so we can better troubleshoot the underlying cause. |
when i used the glm4-9b-int8 and qwen2-72b-int4, i met this problem too. |
me too.Once it happen "Engine iteration timed out. This should never happen!", all the server will never response. |
We encountered the same error ("Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered... Engine iteration timed out. This should never happen!") multiple times across v0.4.3, v0.5.5, and v0.6.1 post2, specifically with the A800-80G * 4 setup, tp=4 configuration. In v0.4.3, using --disable-custom-all-reduce resolved the issue. However, in v0.5.5 and v0.6.1 post2, this flag no longer works. as mentioned in the discussion here: #8230 @Sekri0 highlighted that --enable-prefix-caching could cause the CUDA illegal memory access error, with the traceback pointing to FlashAttention as a potential source. Although pull requests 7018 and 7142 appeared to address this issue, it persists in vLLM 0.5.5. Given that Flash Attention directly manages and accesses GPU memory, this detailed control could indeed increase the likelihood of encountering illegal memory access errors, especially if Flash Attention is not configured properly. Based on these clues, we decided to uninstall Flash Attention and switch to xformers. After doing so, we successfully processed thousands of requests with tp=4, without any errors. We also tested aborting hundreds of requests abruptly and resending them in loops. The server remained robust throughout. While the speed is slightly slower compared to Flash Attention (a difference of a few hundred milliseconds), the system stability has significantly improved. If you're facing similar issues, Maybe uninstall Flash Attention and use xformers instead will help. |
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you! |
This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you! |
Your current environment
🐛 Describe the bug
I am using the eval-scope to test the concurrent throughput of the Qwen2 72B Instruct model deployed with VLLM. When running with 8 concurrent sessions, inputting 8k tokens, and outputting 2k tokens for a period of time, the VLLM service becomes inaccessible.
https://github.com/modelscope/eval-scope/tree/main:
The text was updated successfully, but these errors were encountered: