Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the benchmark by evaluating multiple models and display the results #126

Merged
merged 6 commits into from
Jul 11, 2024

Conversation

bugsz
Copy link
Contributor

@bugsz bugsz commented Jun 26, 2024

Closes #

📑 Description

As the title suggests:

  1. Support evaluating multiple models at the same time, simply by sotopia benchmark-all --model-list gpt-4o --model-list gpt-3.5-turbo, or just go ahead with the default model names.
  2. Support displaying and saving the results in format in https://github.com/sotopia-lab/sotopia-space/blob/main/data_dir/models_vs_gpt35.jsonl by sotopia benchmark-display. (Seems there is no requirement for pandas so I am not sure how to display in a structured way in CLI)

✅ Checks

  • My pull request adheres to the code style of this project
  • My code requires changes to the documentation
  • I have updated the documentation as required
  • All the tests have passed
  • Branch name follows type/descript (e.g. feature/add-llm-agents)
  • Ready for code review

ℹ Additional Information

…ving the results to compare different models
Copy link

codecov bot commented Jun 26, 2024

Codecov Report

Attention: Patch coverage is 17.02128% with 39 lines in your changes missing coverage. Please review.

Project coverage is 60.90%. Comparing base (701f2a8) to head (d8bfa47).
Report is 4 commits behind head on main.

@@            Coverage Diff             @@
##             main     #126      +/-   ##
==========================================
- Coverage   61.71%   60.90%   -0.81%     
==========================================
  Files          55       55              
  Lines        2714     2778      +64     
==========================================
+ Hits         1675     1692      +17     
- Misses       1039     1086      +47     
Files Coverage Δ
sotopia/cli/benchmark/benchmark.py 21.17% <17.02%> (-2.27%) ⬇️

... and 6 files with indirect coverage changes

sotopia/cli/benchmark/benchmark.py Show resolved Hide resolved
sotopia/cli/benchmark/benchmark.py Show resolved Hide resolved
sotopia/cli/benchmark/benchmark.py Show resolved Hide resolved
sotopia/cli/benchmark/benchmark.py Outdated Show resolved Hide resolved
Copy link
Member

@ProKil ProKil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change this and we can merge this PR.

sotopia/cli/benchmark/benchmark.py Outdated Show resolved Hide resolved
@ProKil ProKil merged commit 28f053a into main Jul 11, 2024
8 checks passed
@ProKil ProKil deleted the feature/benchmark branch July 11, 2024 22:14
@bugsz bugsz restored the feature/benchmark branch July 12, 2024 19:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants