-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BFCL] Reproduction GPT-4o-2024-08-06 (prompt) does not match the official score #662
Comments
Hi @liuxize-dhu, Thanks for the issue. I have attached the result we used to compute the current score on the leaderboard (the I have also dispatched a complete regeneration for Examine the two versions' result files for the
This is wrong, as it is not the format we asked for in the system prompt; the model is not following instructions.
I will do a re-generate for all the single_turn data, and update the leaderboard with the most up-to-date score shortly. |
Hi I had tried quite a number of models versus the gpt-4o-mini as a benchmark on langgraph. However, the result of the output was quite mediocre at best compared to even hermes-3-llama-3-1-8b-tools? I am not doubting the benchmarks but I am puzzled why it does not even do a tool call. While the results from openai and anthropic are similar and good, those from llama, mistral, qwen and others have varying degree of unwanted result, none that I had experimented come close to even the gpt-4o-mini. Unfortunately, gorilla-llm/gorilla-openfunctions-v2-gguf possibly gave one of the worst results. I am really puzzled to the Whys and How it can be made better? Hope you can point me in the right direction. |
Would you mind open a separate discussion for this? |
Hello, in Berkley -function-call-leaderboard, when I ran GPT-4o-2024-08-06 (prompt), I only got 32.37 points, the official score is 53.66 points, probably because my json has' ' 'json representation, Broken code? Could you please provide the official json result of GPT-4o-2024-08-06 (prompt)
The text was updated successfully, but these errors were encountered: