Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BFCL] The generation in v3 is too slow #649

Closed
XuHwang opened this issue Sep 23, 2024 · 3 comments · Fixed by #671
Closed

[BFCL] The generation in v3 is too slow #649

XuHwang opened this issue Sep 23, 2024 · 3 comments · Fixed by #671

Comments

@XuHwang
Copy link
Contributor

XuHwang commented Sep 23, 2024

Describe the issue
A clear and concise description of what the issue is.

ID datapoint

  1. Datapoint / Model Handler permalink:
  2. Issue:
  3. Gorilla repo commit #:

What is the issue

After migration to version 3, where the vllm API serving strategy is adopted, the generation speed shows a severe degradation. It takes about 6x more time to get all results.

Proposed Changes

{
'previous_datapoint':[],
'updated_datapoint':[]
}

Additional context
Add any other context about the problem here.

@ShishirPatil
Copy link
Owner

Hey @XuHwang Which models are you trying it on, and what's your GPU config? I don't know about the 6X, but yeah we do expect it to take significantly longer since a) multi-turn responses, and b) long-context responses both increase latency.

@XuHwang
Copy link
Contributor Author

XuHwang commented Sep 23, 2024

Hey @XuHwang Which models are you trying it on, and what's your GPU config? I don't know about the 6X, but yeah we do expect it to take significantly longer since a) multi-turn responses, and b) long-context responses both increase latency.

Hey, thanks for the reply.

I evaluate my own model with oss_handler on 4*v100-32G. And the model size is 8B. The generation process would cost about 10 min in version 2 while it takes more than 1 hour in version 3. I wonder whether it is caused by the gap between the vllm API serving strategy and batch_generate (such as the API serving strategy would handle one sample each time while the batch_generate would handler more samples?)

@HuanzhiMao
Copy link
Collaborator

Hey @XuHwang,
Could you try #671 and see if it solves your issue?

ShishirPatil pushed a commit that referenced this issue Oct 5, 2024
Fix #649 

Instead of send requests to the vllm server one by one in sequence, we
should send all requests all at once to vllm to utiliza its batching and
optimizaiton benefits.

Tested on 8 x A100 (40G) with Llama 3.1 70B. The inference speed on
single-turn entries are roughtly the same (within 1 minute difference)
as when using `llm.generate` before the BFCL V3 release in #644]. The
multi-turn entries still takes around 2 hours to complete, but that's
largely due to the nature of the multi-turn dataset; it has been much
faster than previously where it would take 2 days to finish.

This PR **will not** affect the leaderboard score.
VishnuSuresh27 pushed a commit to VishnuSuresh27/gorilla that referenced this issue Nov 11, 2024
)

Fix ShishirPatil#649 

Instead of send requests to the vllm server one by one in sequence, we
should send all requests all at once to vllm to utiliza its batching and
optimizaiton benefits.

Tested on 8 x A100 (40G) with Llama 3.1 70B. The inference speed on
single-turn entries are roughtly the same (within 1 minute difference)
as when using `llm.generate` before the BFCL V3 release in ShishirPatil#644]. The
multi-turn entries still takes around 2 hours to complete, but that's
largely due to the nature of the multi-turn dataset; it has been much
faster than previously where it would take 2 days to finish.

This PR **will not** affect the leaderboard score.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants