Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BFCL] Reproduction GPT-4o-2024-08-06 (prompt) does not match the official score #662

Closed
liuxize-dhu opened this issue Sep 27, 2024 · 3 comments
Labels
BFCL-General General BFCL Issue

Comments

@liuxize-dhu
Copy link

Hello, in Berkley -function-call-leaderboard, when I ran GPT-4o-2024-08-06 (prompt), I only got 32.37 points, the official score is 53.66 points, probably because my json has' ' 'json representation, Broken code? Could you please provide the official json result of GPT-4o-2024-08-06 (prompt)

@HuanzhiMao
Copy link
Collaborator

Hi @liuxize-dhu,

Thanks for the issue.

I have attached the result we used to compute the current score on the leaderboard (the rest category result file has been processed to have the API keys redacted). The single-turn results are generated in #603 while the multi-turn results are generated in #646.

I have also dispatched a complete regeneration for GPT-4o-2024-08-06 (prompt), and the model does seem to have a significant performance drop, similar to what you claimed.

Examine the two versions' result files for the simple category (1st image is the version we use, 2nd image is the version from regenerate), we can see that the later version contains many more entries that have formatting issues, which would cause decoder failure.
For example, for simple_22, the later version has:

```json\n[{\"name\": \"math.gcd\", \"parameters\": {\"num1\": 12, \"num2\": 15}}]\n```

This is wrong, as it is not the format we asked for in the system prompt; the model is not following instructions.
Below is the expected output format.

[math.gcd(num1=12, num2=15)]

image

image

I will do a re-generate for all the single_turn data, and update the leaderboard with the most up-to-date score shortly.

gpt-4o-2024-08-06_result.zip

@curiothinker
Copy link

curiothinker commented Oct 10, 2024

Hi I had tried quite a number of models versus the gpt-4o-mini as a benchmark on langgraph. However, the result of the output was quite mediocre at best compared to even hermes-3-llama-3-1-8b-tools? I am not doubting the benchmarks but I am puzzled why it does not even do a tool call. While the results from openai and anthropic are similar and good, those from llama, mistral, qwen and others have varying degree of unwanted result, none that I had experimented come close to even the gpt-4o-mini. Unfortunately, gorilla-llm/gorilla-openfunctions-v2-gguf possibly gave one of the worst results. I am really puzzled to the Whys and How it can be made better? Hope you can point me in the right direction.

@HuanzhiMao
Copy link
Collaborator

Hi I had tried quite a number of models versus the gpt-4o-mini as a benchmark on langgraph. However, the result of the output was quite mediocre at best compared to even hermes-3-llama-3-1-8b-tools? I am not doubting the benchmarks but I am puzzled why it does not even do a tool call. While the results from openai and anthropic are similar and good, those from llama, mistral, qwen and others have varying degree of unwanted result, none that I had experimented come close to even the gpt-4o-mini. Unfortunately, gorilla-llm/gorilla-openfunctions-v2-gguf possibly gave one of the worst results. I am really puzzled to the Whys and How it can be made better? Hope you can point me in the right direction.

Would you mind open a separate discussion for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BFCL-General General BFCL Issue
Projects
None yet
Development

No branches or pull requests

3 participants