[BFCL] Reproduction GPT-4o-2024-08-06 (prompt) does not match the official score #662

liuxize-dhu · 2024-09-27T10:05:32Z

Hello, in Berkley -function-call-leaderboard, when I ran GPT-4o-2024-08-06 (prompt), I only got 32.37 points, the official score is 53.66 points, probably because my json has' ' 'json representation, Broken code? Could you please provide the official json result of GPT-4o-2024-08-06 (prompt)

HuanzhiMao · 2024-09-28T01:02:08Z

Hi @liuxize-dhu,

Thanks for the issue.

I have attached the result we used to compute the current score on the leaderboard (the rest category result file has been processed to have the API keys redacted). The single-turn results are generated in #603 while the multi-turn results are generated in #646.

I have also dispatched a complete regeneration for GPT-4o-2024-08-06 (prompt), and the model does seem to have a significant performance drop, similar to what you claimed.

Examine the two versions' result files for the simple category (1st image is the version we use, 2nd image is the version from regenerate), we can see that the later version contains many more entries that have formatting issues, which would cause decoder failure.
For example, for simple_22, the later version has:

```json\n[{\"name\": \"math.gcd\", \"parameters\": {\"num1\": 12, \"num2\": 15}}]\n```

This is wrong, as it is not the format we asked for in the system prompt; the model is not following instructions.
Below is the expected output format.

[math.gcd(num1=12, num2=15)]

I will do a re-generate for all the single_turn data, and update the leaderboard with the most up-to-date score shortly.

gpt-4o-2024-08-06_result.zip

curiothinker · 2024-10-10T04:15:16Z

Hi I had tried quite a number of models versus the gpt-4o-mini as a benchmark on langgraph. However, the result of the output was quite mediocre at best compared to even hermes-3-llama-3-1-8b-tools? I am not doubting the benchmarks but I am puzzled why it does not even do a tool call. While the results from openai and anthropic are similar and good, those from llama, mistral, qwen and others have varying degree of unwanted result, none that I had experimented come close to even the gpt-4o-mini. Unfortunately, gorilla-llm/gorilla-openfunctions-v2-gguf possibly gave one of the worst results. I am really puzzled to the Whys and How it can be made better? Hope you can point me in the right direction.

HuanzhiMao · 2024-10-12T21:50:29Z

Hi I had tried quite a number of models versus the gpt-4o-mini as a benchmark on langgraph. However, the result of the output was quite mediocre at best compared to even hermes-3-llama-3-1-8b-tools? I am not doubting the benchmarks but I am puzzled why it does not even do a tool call. While the results from openai and anthropic are similar and good, those from llama, mistral, qwen and others have varying degree of unwanted result, none that I had experimented come close to even the gpt-4o-mini. Unfortunately, gorilla-llm/gorilla-openfunctions-v2-gguf possibly gave one of the worst results. I am really puzzled to the Whys and How it can be made better? Hope you can point me in the right direction.

Would you mind open a separate discussion for this?

HuanzhiMao added the BFCL-General General BFCL Issue label Oct 4, 2024

HuanzhiMao mentioned this issue Oct 4, 2024

[BFCL] Leaderboard Update, 10/21/2024 #672

Merged

ShishirPatil closed this as completed in 9032355 Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BFCL] Reproduction GPT-4o-2024-08-06 (prompt) does not match the official score #662

[BFCL] Reproduction GPT-4o-2024-08-06 (prompt) does not match the official score #662

liuxize-dhu commented Sep 27, 2024

HuanzhiMao commented Sep 28, 2024

curiothinker commented Oct 10, 2024 •

edited

Loading

HuanzhiMao commented Oct 12, 2024

[BFCL] Reproduction GPT-4o-2024-08-06 (prompt) does not match the official score #662

[BFCL] Reproduction GPT-4o-2024-08-06 (prompt) does not match the official score #662

Comments

liuxize-dhu commented Sep 27, 2024

HuanzhiMao commented Sep 28, 2024

curiothinker commented Oct 10, 2024 • edited Loading

HuanzhiMao commented Oct 12, 2024

curiothinker commented Oct 10, 2024 •

edited

Loading