[BFCL] The generation in v3 is too slow #649

XuHwang · 2024-09-23T02:26:14Z

Describe the issue
A clear and concise description of what the issue is.

ID datapoint

Datapoint / Model Handler permalink:
Issue:
Gorilla repo commit #:

What is the issue

After migration to version 3, where the vllm API serving strategy is adopted, the generation speed shows a severe degradation. It takes about 6x more time to get all results.

Proposed Changes

{
'previous_datapoint':[],
'updated_datapoint':[]
}

Additional context
Add any other context about the problem here.

ShishirPatil · 2024-09-23T07:09:27Z

Hey @XuHwang Which models are you trying it on, and what's your GPU config? I don't know about the 6X, but yeah we do expect it to take significantly longer since a) multi-turn responses, and b) long-context responses both increase latency.

XuHwang · 2024-09-23T09:34:28Z

Hey @XuHwang Which models are you trying it on, and what's your GPU config? I don't know about the 6X, but yeah we do expect it to take significantly longer since a) multi-turn responses, and b) long-context responses both increase latency.

Hey, thanks for the reply.

I evaluate my own model with oss_handler on 4*v100-32G. And the model size is 8B. The generation process would cost about 10 min in version 2 while it takes more than 1 hour in version 3. I wonder whether it is caused by the gap between the vllm API serving strategy and batch_generate (such as the API serving strategy would handle one sample each time while the batch_generate would handler more samples?)

HuanzhiMao · 2024-10-04T11:05:09Z

Hey @XuHwang,
Could you try #671 and see if it solves your issue?

Fix #649 Instead of send requests to the vllm server one by one in sequence, we should send all requests all at once to vllm to utiliza its batching and optimizaiton benefits. Tested on 8 x A100 (40G) with Llama 3.1 70B. The inference speed on single-turn entries are roughtly the same (within 1 minute difference) as when using `llm.generate` before the BFCL V3 release in #644]. The multi-turn entries still takes around 2 hours to complete, but that's largely due to the nature of the multi-turn dataset; it has been much faster than previously where it would take 2 days to finish. This PR **will not** affect the leaderboard score.

) Fix ShishirPatil#649 Instead of send requests to the vllm server one by one in sequence, we should send all requests all at once to vllm to utiliza its batching and optimizaiton benefits. Tested on 8 x A100 (40G) with Llama 3.1 70B. The inference speed on single-turn entries are roughtly the same (within 1 minute difference) as when using `llm.generate` before the BFCL V3 release in ShishirPatil#644]. The multi-turn entries still takes around 2 hours to complete, but that's largely due to the nature of the multi-turn dataset; it has been much faster than previously where it would take 2 days to finish. This PR **will not** affect the leaderboard score.

ShishirPatil closed this as completed Sep 26, 2024

HuanzhiMao mentioned this issue Oct 4, 2024

[BFCL] Speed Up Locally-hosted Model Inference Process #671

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BFCL] The generation in v3 is too slow #649

[BFCL] The generation in v3 is too slow #649

XuHwang commented Sep 23, 2024

ShishirPatil commented Sep 23, 2024

XuHwang commented Sep 23, 2024

HuanzhiMao commented Oct 4, 2024

[BFCL] The generation in v3 is too slow #649

[BFCL] The generation in v3 is too slow #649

Comments

XuHwang commented Sep 23, 2024

ShishirPatil commented Sep 23, 2024

XuHwang commented Sep 23, 2024

HuanzhiMao commented Oct 4, 2024