-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BFCL] The generation in v3 is too slow #649
Comments
Hey @XuHwang Which models are you trying it on, and what's your GPU config? I don't know about the 6X, but yeah we do expect it to take significantly longer since a) multi-turn responses, and b) long-context responses both increase latency. |
Hey, thanks for the reply. I evaluate my own model with oss_handler on 4*v100-32G. And the model size is 8B. The generation process would cost about 10 min in version 2 while it takes more than 1 hour in version 3. I wonder whether it is caused by the gap between the vllm API serving strategy and batch_generate (such as the API serving strategy would handle one sample each time while the batch_generate would handler more samples?) |
Fix #649 Instead of send requests to the vllm server one by one in sequence, we should send all requests all at once to vllm to utiliza its batching and optimizaiton benefits. Tested on 8 x A100 (40G) with Llama 3.1 70B. The inference speed on single-turn entries are roughtly the same (within 1 minute difference) as when using `llm.generate` before the BFCL V3 release in #644]. The multi-turn entries still takes around 2 hours to complete, but that's largely due to the nature of the multi-turn dataset; it has been much faster than previously where it would take 2 days to finish. This PR **will not** affect the leaderboard score.
) Fix ShishirPatil#649 Instead of send requests to the vllm server one by one in sequence, we should send all requests all at once to vllm to utiliza its batching and optimizaiton benefits. Tested on 8 x A100 (40G) with Llama 3.1 70B. The inference speed on single-turn entries are roughtly the same (within 1 minute difference) as when using `llm.generate` before the BFCL V3 release in ShishirPatil#644]. The multi-turn entries still takes around 2 hours to complete, but that's largely due to the nature of the multi-turn dataset; it has been much faster than previously where it would take 2 days to finish. This PR **will not** affect the leaderboard score.
Describe the issue
A clear and concise description of what the issue is.
ID datapoint
What is the issue
After migration to version 3, where the vllm API serving strategy is adopted, the generation speed shows a severe degradation. It takes about 6x more time to get all results.
Proposed Changes
{
'previous_datapoint':[],
'updated_datapoint':[]
}
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: